CN117280709A - Image restoration for an under-screen camera - Google Patents

Image restoration for an under-screen camera Download PDF

Info

Publication number
CN117280709A
CN117280709A CN202280029102.9A CN202280029102A CN117280709A CN 117280709 A CN117280709 A CN 117280709A CN 202280029102 A CN202280029102 A CN 202280029102A CN 117280709 A CN117280709 A CN 117280709A
Authority
CN
China
Prior art keywords
image
sequence
exposure time
images
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280029102.9A
Other languages
Chinese (zh)
Inventor
李江
欧阳灵
解扬波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innopeak Technology Inc
Original Assignee
Innopeak Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology Inc filed Critical Innopeak Technology Inc
Publication of CN117280709A publication Critical patent/CN117280709A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • G06T5/94Dynamic range modification of images or parts thereof based on local image properties, e.g. for local contrast enhancement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

The present application is directed to image preprocessing to facilitate deep learning based image processing. The electronic device obtains a high dynamic range (high dynamic range, HDR) setting comprising a sequence of exposure settings and controls an off-screen camera having the sequence of exposure settings to capture a sequence of HDR images. The HDR image sequence is combined into an input image. The electronic device processes the input image using an image processing network to generate an output image. The input image has an input image quality and the output image has an output image quality greater than the input image quality. In some embodiments, the electronic device determines a brightness dynamic range requirement of the image data to be processed by the image processing network and determines an HDR setting comprising a sequence of exposure settings based on the brightness dynamic range requirement.

Description

Image restoration for an under-screen camera
Cross Reference to Related Applications
The present application claims priority from U.S. provisional patent application No. 63/183,827 entitled "neural network architecture for image restoration of an off-screen camera," filed on 5/4 of 2021, the entire contents of which are incorporated herein by reference.
Technical Field
The present application relates generally to data processing techniques including, but not limited to, methods, systems, and non-transitory computer readable media for recovering images captured by an off-screen camera using a neural network and image processing algorithms.
Background
Under-screen cameras (UDCs) are widely used in consumer electronics devices (e.g., smartphones, tablets, notebooks, and televisions) to provide a full screen user interaction experience. The UDC removes the bezel or recess and embeds the self-imaging camera directly under the display pixels of the display panel. When the self-timer camera is enabled for image capture, the display pixels are disabled from display to minimize interference of the display pixels with the self-timer camera. However, the inherent microstructure of the display pixels disposed above the self-imaging camera still reduces the image quality of the UDC because it results in low light transmittance and strong scattering and diffraction effects. Example image impairments of the UDC include signal-to-noise reduction, haze effects, glare, blurring, and light diffraction artifacts. It would be beneficial to have an effective and efficient mechanism for reducing or eliminating image impairments caused by the display microstructure of the UDC.
Disclosure of Invention
Embodiments of the present application are directed to restoring images captured by a UDC using a deep learning technique. The image restoration model includes a masking neural network for identifying and emphasizing one or more defective areas on the input image captured by the UDC and a restoration neural network for restoring the defective input image. Each neural network of the image restoration model optionally performs end-to-end training and accelerates the reasoning speed of restoring the input image. In particular, such image restoration models are used to enhance UDC image quality and reduce artifacts, including but not limited to noise, haze, blurring, and strong light diffraction artifacts. In some embodiments, the network is a lightweight and multi-masked network.
Additionally, in some embodiments, particularly when an image restoration model is used to restore high resolution input images or in electronic devices with limited computing resources, a model pruning pipeline (model pruning pipeline) is applied to enhance the image restoration model and speed up reasoning. The model pruning pipeline selects a pruned model with the greatest inference accuracy from a plurality of pruned model candidates using an automatic machine learning (auto machine learning, autopl) method. The selected pruned model meets the computational requirements of the electronic device. In an example, the model pruning pipeline is based on two rounds of network searches. A web search space is defined with a plurality of pruned model candidates, each of which is trimmed in a plurality of training steps and has a low loss or reduced accuracy (e.g., within a loss tolerance). Optionally, an alternative search space is generated based on the plurality of pruned model candidates, and the alternative search space is used to fine tune each pruned model candidate to identify the selected pruned model with the lowest training loss.
In one aspect, an image restoration method is implemented at an electronic device. The method includes obtaining an input image captured by an off-screen camera and applying a masking neural network to the input image to generate a mask (mask) identifying one or more defective areas on the input image. The masking neural network includes a sequence of convolutional layers. The method further includes combining (e.g., connecting) the mask and the input image to provide a masked input image, and applying a recovery neural network to the masked input image to generate an output image. The recovery neural network includes a sequence of gating blocks, and each gating block includes two parallel convolutional layers for processing a respective gating input. The input image has an input image quality and the output image has an output image quality greater than the input image quality.
In some embodiments, the input image quality is indicated by an input signal-to-noise ratio (SNR) of the input image, and the output image quality is indicated by an output SNR of the output image. The output SNR is greater than the input SNR.
In some embodiments, the masking neural network comprises a lightweight U-Net having at least a masking encoder network comprising a sequence of convolutional layers.
In some embodiments, the mask includes a gray scale map having a plurality of gray scale elements, and each gray scale element represents an intensity weight such as an artifact of the diffraction pattern at a respective pixel or pixel region, and each gray scale pixel has a respective gray scale value in the range of [0,1 ].
In another aspect, an image processing method is implemented at an electronic device. The method includes obtaining a high dynamic range (high dynamic range, HDR) setting, the HDR setting comprising a sequence of exposure settings, and controlling an off-screen camera having the sequence of exposure settings to capture a sequence of HDR images. The method also includes combining the HDR image sequence to generate an input image, and processing the input image using an image processing network to generate an output image. The input image has an input image quality and the output image has an output image quality greater than the input image quality. In some embodiments, the method further comprises obtaining a brightness dynamic range requirement of the image data to be processed by the image processing network. The method further includes determining the HDR setting including the sequence of exposure settings based on the luminance dynamic range requirements, the input image combined from the sequence of HDR images meeting the luminance dynamic range requirements.
In another aspect, some embodiments include an electronic device comprising one or more processors and memory having instructions stored thereon that, when executed by the one or more processors, cause the processors to perform any of the methods described above.
In yet another aspect, some embodiments include a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more processors, cause the processors to perform any of the methods described above.
These illustrative examples and implementations are not mentioned to limit or define the disclosure, but to provide examples to aid understanding of the disclosure. Additional embodiments are discussed in the detailed description, and further description is provided in these examples.
Drawings
For a better understanding of the various embodiments described, reference should be made to the following detailed description taken in conjunction with the following drawings in which like reference numerals identify corresponding parts in the figures.
FIG. 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, according to some embodiments.
Fig. 2 is a block diagram illustrating an electronic device for processing content data (e.g., image data), in accordance with some embodiments.
FIG. 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, according to some embodiments.
FIG. 4A is an example neural network applied to processing content data in an NN-based data processing model in accordance with some embodiments, and FIG. 4B is an example node in a neural network in accordance with some embodiments.
Fig. 5 is a block diagram of an example camera module of an electronic device, according to some embodiments.
Fig. 6A is a block diagram of an example image restoration model for restoring an image captured by an electronic device, according to some embodiments, and fig. 6B is a block diagram of an example gating block applied in the image restoration model shown in fig. 6A, according to some embodiments.
FIG. 7 is a block diagram of another example image restoration model for restoring an input image acquired from image data captured by a UDC, in accordance with some embodiments.
Fig. 8 is a block diagram of an importance-based filter pruning process for a pruned image restoration model, according to some embodiments.
Fig. 9 is a flow chart of a process for preprocessing image data to be restored using an image processing network, in accordance with some embodiments.
Fig. 10A and 10B compare an example original image with an image restored from the example original image according to some embodiments.
Fig. 10C and 10D compare another example original image with an image restored from the example original image according to some embodiments.
FIG. 11 is a flowchart of an example image restoration method according to some embodiments.
Fig. 12 is a flowchart of an example image processing method according to some embodiments.
Like reference numerals designate corresponding parts throughout the several views of the drawings.
Detailed Description
Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that various alternatives may be used without departing from the scope of the claims, and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented on a variety of types of electronic devices having digital video capabilities.
Embodiments of the present application are directed to restoring images in electronic devices (e.g., mobile handsets) having an under-screen camera (UDC) to reduce or eliminate imaging artifacts caused by the microstructure of the display panel. The image restoration model is multi-tasking compatible to combine multiple tasks in a single network and thereby make the model itself lightweight and efficient. Specifically, the image restoration model includes a masking neural network for identifying one or more defective areas on the input image and a restoration neural network for restoring the masked input image. In some embodiments, the image restoration model is further accelerated with an automatic pruning pipeline (automatic pruning pipeline) and may be executed on a GPU of an electronic device with limited computing resources, and may reduce inference time to below one second.
FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, according to some embodiments. For example, the one or more client devices 104 may be desktop computers 104A, tablet computers 104B, mobile handsets 104C, head-mounted displays (HMDs) (also referred to as augmented reality (augmented reality, AR) glasses) 104D, or smart, multi-sensing, networking home devices (e.g., surveillance cameras 104E, smart television devices, drones). Each client device 104 may collect data or user input, execute a user application, and present output on its user interface. The collected data or user input may be processed locally at the client device 104 and/or remotely by the server 102. One or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to client devices 104 and, in some embodiments, process data and user inputs received from client devices 104 when executing user applications on client devices 104. In some embodiments, the data processing environment 100 also includes a memory 106 for storing data related to the server 102, the client device 104, and applications executing on the client device 104.
The one or more servers 102 are used to enable real-time data communication with client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, one or more servers 102 are used to implement data processing tasks that cannot be performed locally by client device 104 or, preferably, are not performed locally by client device 104. For example, the client device 104 includes a game console (e.g., HMD 104D) executing an interactive online game application. The game console receives user instructions and transmits the user instructions and user data to the game server 102. The game server 102 generates a video data stream based on the user instructions and user data and provides the video data stream for display on the game console and other client devices participating in the same game session as the game console. In another example, the client device 104 includes a networked monitoring camera 104E and a mobile handset 104C. The networked monitoring camera 104E collects video data and streams the video data to the monitoring camera server 102 in real time. While the video data is optionally pre-processed on the monitoring camera 104E, the monitoring camera server 102 processes the video data to identify motion events or audio events in the video data and shares information of those events with the mobile handset 104C, allowing the user of the mobile handset 104 to monitor events occurring in the vicinity of the networked monitoring camera 104E in real-time and remotely.
One or more servers 102, one or more client devices 104, and memory 106 are communicatively coupled to one another via one or more communication networks 108, the communication networks 108 being a medium used to provide communication links between the devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless, or fiber optic cables. Examples of the one or more communication networks 108 include a local area network (local area network, LAN), a wide area network (wide area network, WAN) such as the internet, or a combination thereof. Optionally, one or more of the communication networks 108 are implemented using any known network protocols including various wired or wireless protocols such as Ethernet, universal serial bus (universal serial bus, USB), FIREWIRE (FIREWIRE), long term evolution (long term evolution, LTE), global System for Mobile communications (global system for mobile communications, GSM), enhanced data GSM environment (enhanced data GSM environment, EDGE), code division multiple access (code division multiple access, CDMA), time division multiple access (time division multiple access, TDMA), bluetooth, wi-Fi, voice over Internet protocol (voice over internet protocol, voIP), wi-MAX, or any other suitable communication protocol. Connections to one or more communication networks 108 may be established directly (e.g., using 3G/4G connections to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or intelligent dedicated full-house control node), or through any combination thereof. Thus, the one or more communication networks 108 may represent the Internet of global networks and gateway sets that communicate with each other using the Transmission control protocol/Internet protocol (transmission control protocol/internet protocol, TCP/IP) protocol suite. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages.
In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) acquired by an application executing at the client device 104 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data. In these deep learning techniques, a data processing model is created based on one or more neural networks to process content data. The data processing models are trained with training data before they are used to process content data. After model training, the client device 104 obtains content data (e.g., captures video data via an internal camera) and processes the content data locally using a data processing model.
In some embodiments, model training and data processing is implemented locally at each individual client device 104. The client device 104 retrieves training data from one or more servers 102 or memories 106 and applies the training data to train the data processing model. Alternatively, in some embodiments, model training and data processing is implemented remotely from a server 102 (e.g., server 102A) associated with the client device 104. The server 102A retrieves training data from itself, another server 102, or the memory 106 and applies the training data to train the data processing model. The client device 104 obtains content data, sends the content data (e.g., in an application) to the server 102A for data processing using the trained data processing model, receives data processing results (e.g., recognized gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on gestures, or implements some other function based on the results. The client device 104 itself performs no or little data processing on the content data before sending the content data to the server 102A. Additionally, in some embodiments, data processing is implemented locally at the client device 104, while model training is implemented remotely at a server 102 (e.g., server 102B) associated with the client device 104. Server 102B retrieves training data from itself, another server 102, or memory 106 and applies the training data to train the data processing model. The trained data processing model is optionally stored in server 102B or memory 106. Client device 104 imports a trained data processing model from server 102B or memory 106, processes content data using the data processing model, and generates data processing results to be presented on a user interface or used to locally initiate some function (e.g., rendering a virtual object based on device gestures).
In some embodiments, a pair of AR glasses 104D (also referred to as HMDs) are communicatively coupled in the data processing environment 100. AR glasses 104D include a camera, microphone, speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. A camera and microphone are used to capture video and audio data from the scene of AR glasses 104D, while one or more inertial sensors are used to capture inertial sensor data. In some cases, the camera captures the gesture of the user wearing the AR glasses 104D and recognizes the gesture locally and in real-time using a two-stage gesture recognition model. In some cases, the microphone records ambient sound including a voice command of the user. In some cases, both video or still visual data captured by the camera and inertial sensor data measured by one or more inertial sensors are used to determine and predict device pose. Video, still image, audio, or inertial sensor data captured by AR glasses 104D is processed by AR glasses 104D and/or server 102 to recognize device gestures. Optionally, the server 102 and AR glasses 104D jointly apply deep learning techniques to recognize and predict device gestures. The device gestures are used to control the AR glasses 104D itself or interact with applications (e.g., gaming applications) executed by the AR glasses 104D. In some embodiments, the display of AR glasses 104D displays a user interface, and the recognized device gestures or predicted device gestures are used to render or interact with user-selectable display items (e.g., avatars) on the user interface.
Fig. 2 is a block diagram illustrating an electronic device 200 for processing content data (e.g., image data), in accordance with some embodiments. The electronic device 200 includes a client device 104 (e.g., mobile handset 104C, AR glasses 104D in fig. 1). The electronic device 200 typically includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic device 200 includes one or more input devices 210 (e.g., a keyboard, mouse, voice command input unit or microphone, touch screen display, touch sensitive tablet, gesture capture camera, or other input buttons or controls) that facilitate user input. Further, in some embodiments, the electronic device 200 uses a microphone for voice recognition or a camera for gesture recognition to supplement or replace a keyboard. In some embodiments, electronic device 200 includes one or more cameras (e.g., an off-screen camera 206), scanners, or photosensor units for capturing images (e.g., of a graphic sequence code printed on the electronic device). In addition, in some embodiments, the electronic device 200 includes an image signal processor (image signal processor, ISP) 203 for converting raw image signals captured by the off-screen camera 260 into output image data to be used by other functional modules of the electronic device 200.
The electronic device 200 also includes one or more output devices 212 capable of presenting user interfaces and displaying content, the one or more output devices 212 including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device (e.g., a global positioning system (global positioning system, GPS) or other geographic location receiver) for determining the location of the client device 104.
Memory 206 includes high-speed random access memory (e.g., DRAM, SRAM, DDR RAM, or other random access solid state memory devices); and optionally includes non-volatile memory such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206 optionally includes one or more storage devices remote from the one or more processing units 202. The memory 206 or alternatively the non-volatile memory within the memory 206 includes a non-transitory computer-readable storage medium. In some embodiments, memory 206 or a non-transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
An operating system 214 including programs for handling various basic system services and for performing hardware-related tasks;
a network communication module 216 for connecting the electronic device 200 to other devices (e.g., server 102, client device 104, or memory 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108 (e.g., the internet, other wide area networks, local area networks, metropolitan area networks, etc.);
a user interface module 218 for enabling presentation of information (e.g., graphical user interfaces of applications 224, widgets, websites and their web pages, and/or games, audio and/or video content, text, etc.) at electronic device 200 via one or more output devices 212 (e.g., display, speaker, etc.);
an input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected inputs or interactions;
a web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and their web pages (including a web interface for logging onto a user account associated with the client device 104 or another electronic device), controlling the client device or electronic device (if associated with the user account), and editing and viewing settings and data associated with the user account;
One or more user applications 224 for execution by the electronic device 200 (e.g., games, social networking applications, smart home applications, and/or other network-based or non-network-based applications for controlling another electronic device and viewing data captured by such devices);
model training module 226 for receiving training data and building a data processing model for processing content data (e.g., video, image, audio, or text data) collected or obtained by client device 104;
a data processing module 228 for processing the content data using the data processing model 240 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data, wherein in some embodiments the data processing module 228 is associated with one of the user applications 224 to process the content data in response to user instructions received from the user application 224, and in an example the data processing module 208 is applied to process the content data; and
one or more databases 230 for storing data comprising at least one or more of:
o device settings 232, including common device settings (e.g., service layer, device model, storage capacity, processing power, communication power, etc.) for one or more of server 102 or client device 104;
User account information 234 (e.g., user name, security questions, account history data, user preferences, and predefined account settings) for one or more user applications 224;
network parameters 236 of one or more communication networks 108 (e.g., IP address, subnet mask, default gateway, DNS server, and hostname);
o training data 238 for training one or more data processing models 240;
a data processing model 240 for processing content data (e.g., video, image, audio, or text data) using a deep learning technique, wherein the data processing model 240 includes an image restoration model having a masking neural network and a restoration neural network, and the masking neural network and the restoration neural network include a sequence of convolutional layers and a sequence of gated blocks, respectively; and
content data and results 254, respectively, acquired by the client device 104 of the electronic device 200 and output to the client device 104 of the electronic device 200, wherein the content data is processed locally at the client device 104 or remotely at the server 102 by the data processing model 240 to provide associated results to be presented on the client device 104.
In some embodiments, the data processing module 228 includes a UDC recovery module 229 for recovering images captured locally by the UDC 260 or acquired from another client device 104. The UDC restoration module 229 is configured to apply an image restoration model established based on a deep learning technique to restore images degraded by the microstructure of the display panel of the corresponding UDC 260.
Optionally, one or more databases 230 are stored in one of the server 102, the client device 104, and the memory 106 of the electronic device 200. Optionally, one or more databases 230 are distributed among multiple ones of the server 102, client devices 104, and memory 106 of the electronic device 200. In some embodiments, multiple copies of the data are stored on different devices, e.g., two copies of data processing model 240 are stored on server 102 and memory 106, respectively.
Each of the above identified elements may be stored in one or more of the aforementioned storage devices and applied to a set of instructions that perform the above described functions. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus, in various embodiments, various subsets of these modules may be combined or otherwise rearranged. In some embodiments, memory 206 optionally stores a subset of the modules and data structures identified above. Further, the memory 206 may optionally store additional modules and data structures not described above.
FIG. 3 is an example data processing system 300 for training and applying a neural network (neural network based, NN-based) based data processing model 240 for processing content data (e.g., video, image, audio, or text data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for building a data processing model 240 and a data processing module 228 for processing content data using the data processing model 240. In some embodiments, model training module 226 and data processing module 228 are both located on client device 104 of data processing system 300, while training data source 304, which is different from client device 104, provides training data 306 to client device 104. The training data source 304 is optionally the server 102 or the memory 106. Alternatively, in some embodiments, model training module 226 and data processing module 228 are both located on server 102 of data processing system 300. The training data source 304 that provides the training data 306 is optionally the server 102 itself, another server 102, or the memory 106. Additionally, in some embodiments, model training module 226 and data processing module 228 are located on server 102 and client device 104, respectively, and server 102 provides the trained data processing model to client device 104.
Model training module 226 includes one or more data preprocessing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to the type of content data to be processed. The training data 306 is of the same type as the content data, so the application data preprocessing module 308 processes the training data 306 that is of the same type as the content data. For example, the image pre-processing module 308A is configured to process the image training data 306 into a predefined image format, e.g., extract regions of interest (region of interest, ROIs) in each training image, and crop each training image to a predefined image size. Alternatively, the audio pre-processing module 308B is configured to process the audio training data 306 into a predefined audio format, e.g., to convert each training sequence to the frequency domain using fourier transforms. The module training engine 310 receives the preprocessed training data provided by the data preprocessing module 308, further processes the preprocessed training data using the existing data processing model 240, and generates an output from each training data item. In this process, the loss control module 312 may monitor a loss function that compares the output associated with the corresponding training data item to the actual data (ground trunk) of the corresponding training data item. Model training engine 310 modifies data processing model 240 to reduce the loss function until the loss function meets a loss criterion (e.g., the comparison of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 for processing the content data.
In some embodiments, model training module 226 provides supervised learning in which training data is fully labeled and includes a desired output (also referred to as real data in some cases) for each training data item. In contrast, in some embodiments, model training module 226 provides for unsupervised learning in which training data is not labeled. Model training module 226 is used to identify previously undetected patterns in training data without pre-existing tags and without or with little manual supervision. Additionally, in some embodiments, model training module 226 provides for partially supervised learning in which training data is partially labeled.
The data processing module 228 includes a data preprocessing module 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing module 314 preprocesses the content data based on the type of the content data. The function of the data preprocessing module 314 is consistent with the function of the preprocessing module 308 and converts the content data into a predefined content format acceptable to the input of the model-based processing module 316. Examples of content data include one or more of video, images, audio, text, and other types of data. For example, each image is preprocessed to extract the ROI, or cropped to a predefined image size, and the audio piece is preprocessed using fourier transform to be converted to the frequency domain. In some cases, the content data includes two or more types (e.g., video data and text data). The model-based preprocessing module 316 processes the preprocessed content data using the trained data processing model 240 provided by the model training module 226. Model-based processing module 316 may also monitor the error indicators to determine whether the content data is properly processed in data processing model 240. In some embodiments, the data post-processing module 318 further processes the processed content data to present the processed content data in a preferred format or to provide other relevant information that may be derived from the processed content data.
Fig. 4A is an example Neural Network (NN) 400 applied to processing content data in an NN-based data processing model 240, and fig. 4B is an example node 420 in the Neural Network (NN) 400, according to some embodiments. The data processing model 240 is built based on the neural network 400. The respective model-based processing module 316 applies the data processing model 240 comprising the neural network 400 to process content data that has been converted into a predefined content format. Neural network 400 includes a collection of nodes 420 connected by links 412. Each node 420 receives one or more node inputs and applies a transfer function to generate a node output from the one or more node inputs. When a node output is provided to one or more other nodes 420 via one or more links 412, a weight w associated with each link 412 is applied to the node output. Likewise, based on the corresponding weights w according to the transfer function 1 、w 2 、w 3 And w 4 One or more node inputs are combined. In an example, the transfer function is a product of a nonlinear activation function and a linear weighted combination of one or more node inputs.
The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, one or more layers include a single layer that serves as both an input layer and an output layer. Optionally, one or more layers include an input layer 402 for receiving input, an output layer 406 for providing output, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input layer 402 and the output layer 406. The deep neural network has a plurality of hidden layers 404 between an input layer 402 and an output layer 406. In the neural network 400, each layer is connected only to its immediately preceding layer and/or to its immediately following layer. In some embodiments, layer 402 or layer 404B is a fully connected layer because each node 420 in layer 402 or 404B is connected to each node 420 in its immediate subsequent layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately subsequent layer for downsampling or pooling the node 420 between the two layers. Specifically, maximizing pooling uses the maximum of two or more nodes in layer 404B to generate a node of an immediately subsequent layer 406 that is connected to the two or more nodes.
In some embodiments, convolutional neural networks (convolutional neural network, CNN) are applied in the data processing model 240 to process content data (particularly video data and image data). CNN employs convolution operations and belongs to a class of deep neural networks 400, namely feed-forward neural networks that move data forward only from an input layer 402 through a hidden layer to an output layer 406. One or more hidden layers of the CNN are convolution layers that are convolved with a multiplication or dot product. Each node in the convolutional layer receives input from a receptive area (receptive area) associated with a previous layer (e.g., five nodes) and the receptive area is less than the entire previous layer and may vary based on the position of the convolutional layer in the convolutional neural network. The video data or image data is pre-processed into a predefined video/image format corresponding to the input of the CNN. The preprocessed video data or image data is abstracted by each layer of CNN into a corresponding feature map. In these ways, video data and image data may be processed by CNNs for video and image recognition, classification, analysis, imprinting, or composition.
Alternatively and additionally, in some embodiments, recurrent neural networks (recurrent neural network, RNNs) are applied in the data processing model 240 to process content data (particularly text data and audio data). Nodes in successive layers of the RNN follow a time sequence such that the RNN exhibits time-dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-value activation (time-varying real-valued activation). Examples of RNNs include, but are not limited to, long short-term memory (LSTM) networks, fully recursive networks, elman networks, jordan networks, hopfield networks, two-way associative memory (bidirectional associative memory, BAM) networks, echo status networks, independent RNNs (indirnns), recursive neural networks, and neural history compressors. In some embodiments, RNNs may be used for handwriting or speech recognition. It should be noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., CNNs and RNNs) are applied to jointly process data content.
The training process is to calibrate the ownership weights w of each layer of the learning model using the training data set provided in the input layer 402 i Is a process of (2). The training process typically includes two steps, forward propagation and reverse propagation, which are repeated multiple times until a predefined convergence condition is met. In forward propagation, a set of weights for different layers is applied to the input data and intermediate results from the previous layer. In back propagation, the error magnitude (e.g., loss function) of the output is measured and the weights are adjusted accordingly to reduce the error. The activation function may alternatively be a linear function, a linear rectification function, a sigmoid function (sigmoid), a hyperbolic tangent function, or other type. In some embodiments, the network bias term b is added to the sum of weighted outputs from previous layers before the activation function is applied. The network bias term b provides a perturbation that helps the NN 400 avoid overfitting the training data. The trained results include a network bias term b for each layer.
Fig. 5 is a block diagram of an example camera module 500 of the electronic device 200, according to some embodiments. The electronic device 200 includes a UDC 260 hidden behind the display panel (e.g., immediately behind the set of display pixel microstructures). The UDC 260 includes an image sensing array for receiving optical signals and converting the optical signals into raw image signals 502. The camera module 500 converts the raw image signal 502 into image data 504 that is output to one or more output devices 506. The image data 504 optionally includes images or video clips, each image or video clip including a sequence of image frames, as does the original image signal 502. The camera module 500 includes an Image Signal Processor (ISP) 203 and a UDC recovery module 229 coupled to the ISP 203.
ISP 203 is used to convert raw image signal 502 into image data 504 by implementing one or more of noise reduction, auto white balance and color correction, color interpolation, lens shading correction, defective pixel correction, gamma correction, local tone mapping, auto exposure, and auto focus. The image data 504 has a higher image quality than the original image signal 502. ISP 203 includes a camera control module 508, a camera image processing module 510, and a camera output control module 512. The camera control module 508 is coupled to the UDC 260 and is used to control the UDC 260 to capture the raw image signal 502 (e.g., digital images and video clips) with a plurality of camera settings 514. The camera image processing module 510 is coupled to the camera control module 508 and is configured to process the raw image signal 502 and output image data that is perceptually pleasant to human vision and similar to a real scene perceived by human vision. The camera output module 512 is coupled to the camera image processing module 510 and is used to control the transmission of the image data 504 (e.g., the generated data images and video clips), for example, to determine whether to output the image data 504 via the output device 506. Examples of output devices 512 include, but are not limited to, a display device, a memory, and a wired or wireless communication module (e.g., network communication module 216).
In some embodiments, the plurality of camera settings 514 applied by the camera control module 508 includes one or more of the following: aperture, shutter speed (i.e., exposure time), ISO speed, camera mode, metering mode, focusing mode, focus area, white balance, file format, driving mode, long exposure noise reduction, high ISO noise reduction, color space, image stabilization, and High Dynamic Range (HDR)/dynamic range optimizer (dynamic range optimizer, DRO). For example, f/1.8-f/5.6 applies in weak light or to narrower depth of field (DOF), and f/8-f/16 applies to wider DOF. The shutter speed is set to 30 seconds to 1/4000 second according to the scene. The range of ISO speed in entry level cameras is 100-3200 and the range of ISO speed in higher level cameras is 100-6400. The camera mode optionally has a manual camera mode or an aperture priority mode. And selecting a light metering mode from the matrix light metering mode, the multiple light metering mode and the evaluation light metering mode according to the camera model. The focus mode is AF-S for capturing a stationary object and AF-C for capturing a moving object. The focal region is a single point for capturing stationary objects and a dynamic/region for capturing moving objects. White balance corresponds to an automatic white balance mode of on or off. The file format may alternatively be an original file, a JPEG file, or a file in any other image format. The driving mode is a single shot for capturing a stationary object or a continuous shot for capturing a moving object. Each of the long exposure noise reduction, high ISO noise reduction, image stabilization, and HDR/DRO is optionally turned on or off. For example, when the UDC 260 is held, the image stabilization control is turned on, and when the UDC 260 is mounted on a tripod, the image stabilization control is turned off. The color space is one of CIELUV, CIELAB, HSLuv, RGB, YCbCr/YUV, HSV/HSL, LCh, CMYK/CMY, and other commercial or proprietary color spaces. In an example, the exposure time and ISO speed of the UDC 260 are dynamically adjusted as the camera control module 508 controls the UDC 260 to capture the original image signal 502. Additionally, in some embodiments, a plurality of camera settings 514 are provided to the camera image processing module 510 along with the raw image signal 502 to facilitate processing of the raw image signal 502.
The UDC recovery module 229 is coupled to the ISP 203 and is used to reduce or eliminate image quality degradation caused by the microstructure of the display panel disposed directly in front of the UDC 260. When the UDC 260 cannot be used for image capturing, the microstructure of the display panel is used to display a portion of an image or video on the display panel. In contrast, when the UDC 260 can be used for image capturing, the microstructure of the display panel cannot display a part of an image or video on the display panel. In an example, the lens diameter of the UDC 260 is about 2mm and corresponds to the microstructure of several pixels (e.g., less than 10 pixels). The state machine controls the camera control module 508 of the ISP 203 and the display module of the display panel to operate the UDC 260 and the microstructure of the display panel in synchronization with each other.
In some embodiments, the UDC recovery module 229 includes a UDC preprocessing module 518 and an image recovery module 520. The UDC preprocessing module 518 receives the input image signal 502' (i.e., digital images and videos captured by the UDC 260) and is operative to process the input image signal 502' at a preliminary level and at least partially restore the image quality of the input image signal 502 '. In some embodiments, the UDC preprocessing module 518 obtains the brightness dynamic range requirements of the image data to be processed by the image restoration module 520. The UDC preprocessing module 518 determines the HDR settings 516, including the sequence of exposure settings, and provides the HDR settings 516 to the camera control module 508 in the ISP 203, as required by the luminance dynamic range. Furthermore, in some embodiments, the UDC 260 with the exposure setting sequence of HDR settings 516 is controlled to capture the original image signal 502 comprising the HDR image sequence. The UDC preprocessing module 518 combines the HDR image sequence to generate an input image 522 that meets the requirements of luminance dynamic range, and the image restoration module 520 further processes the input image 522.
The image restoration module 520 is coupled to the UDC preprocessing module 518 and is configured to apply a deep learning technique (e.g., image restoration model) to process the preprocessed image signal 522, e.g., restore perceived image quality to the same level as a normal camera. In some embodiments, the UDC recovery module 229 further includes a UDC post-processing module 524 coupled to the image recovery module 520. The UDC post-processing module 524 is used to process the restored image signal 526 output by the image restoration module 520. In some embodiments, the techniques applied to the UDC pre-processing module 518 and the UDC post-processing module 524 are not based on deep learning techniques. Examples of such techniques are filtering, decompressing, compressing, sharpening, etc. Instead, the image restoration module 520 applies a deep learning technique to restore the original image signal 502 based on the image restoration model.
In some embodiments, the UDC recovery module 229 is coupled between the camera control module 508 and the camera image processing module 510. The UDC pre-processing module 518 receives the original image signal 502 as an input image signal 502' and returns the recovered image signal 526, which is optionally post-processed, to the camera image processing module 510 for additional image processing. Alternatively, in some embodiments, the UDC recovery module 229 is coupled within the camera image processing module 510. The UDC pre-processing module 518 receives the image signal partially processed by the camera image processing module 510 as an input image signal 502' and returns the recovered image signal 526, optionally post-processed, to the camera image processing module 510 for further image processing. Alternatively, in some embodiments, the UDC recovery module 229 is coupled between the camera image processing module 510 and the camera output control module 512. The UDC preprocessing module 518 receives the image signal output by the camera image processing module 510 as an input image signal 502', and returns a restored image signal 526, which is optionally post-processed, to the camera output control module 512. The camera output control module 512 outputs the recovered image signals 526 as image data 504 to one or more output devices 506.
In addition, in some embodiments, the UDC recovery module 229 is inserted in front of, in the middle of, or behind the camera signal processing module 510, depending on the plurality of camera settings 514. The plurality of camera settings 514 includes a camera capture configuration having a snapshot mode, a preview mode, and a video mode. In some cases, in accordance with a determination that the UDC 260 is operating in the snapshot mode, the UDC recovery module 229 is coupled between the camera control module 508 and the camera image processing module 510. In some cases, in accordance with a determination that the UDC 260 is operating in video mode or preview mode, the UDC recovery module 229 is coupled between the camera image processing module 510 and the camera output control module 512. In some more cases, the UDC restoration module 229 is coupled after the camera output control module 512 when a third party application invokes the camera instead of the native application of the system.
Fig. 6A is a block diagram of an example image restoration model 600 for restoring images captured by the electronic device 200, according to some embodiments, and fig. 6B is a block diagram of an example gating block 650 applied in the image restoration model 600 shown in fig. 6A, according to some embodiments. As described above, the image restoration module 520 applies a deep learning technique (e.g., the image restoration model 600) to restore the optionally post-processed input image signal 502' and generate the restored image signal 526. The input image signal 502 'is associated with the original image signal 502 captured using the UDC 260 through the display panel's microstructure, and the display panel's microstructure reduces the quality of the input image signal 502'. The input image signal 502' includes an input image 602 and the restored image signal 526 includes an output image 604 that is restored from the input image 602. The image restoration model 600 includes a masking neural network 606 and a restoration neural network 608, and is used to restore the input image 602 to the output image 604.
The input image 602 has an input image quality and the output image 604 has an output image quality that is greater than the input image quality. In some embodiments, the input image quality is indicated by the input SNR of the input image 602 and the output image quality is indicated by the output SNR of the output image 604. The output SNR is greater than the input SNR. Alternatively, in some embodiments, the input image quality is indicated by at least a level of haze, glare, blur, or light diffraction artifacts. Thus, the image restoration model 600 improves the level of SNR or haze, glare, blur, or light diffraction artifacts of the input image 602 in restoring the input image 602 from the degradation caused by the microstructure of the display panel disposed in front of the lens of the UDC 260.
The image restoration module 520 obtains the input image 602 from the image data captured by the UDC 206. In some embodiments, the electronic device 200 includes an image restoration module 520 and the UDC 206 that captured the image data. In some embodiments, the electronic device 200 includes an image restoration module 520 and is different from the UDC 206. The electronic device 200 receives an input image 602 from the UDC 206 via a wired or wireless communication link. Optionally, the electronic device 200 receives the input image 602 from the UDC 206 through the server 102. The image restoration module 520 applies a masking neural network 606 to the input image 602 to generate a mask 610 that identifies one or more defective areas on the input image 602. Masking neural network 606 includes convolutional layer sequence 612. The input image 602 and the mask 610 are combined (e.g., connected) to provide a masked input image 614. The image restoration module 520 applies the restoration neural network 608 to the masked input image 614 to generate the output image 604. The recovery neural network 608 includes a sequence of gating blocks 650.
Referring to fig. 6B, in some embodiments, each gating block 650 includes at least two parallel convolutional layers 652 and 654 for processing a respective gating input 656. Each parallel convolution layer 652 or 654 receives an input feature vector (e.g., a respective gating input 656) and generates a respective output feature vector 658. In an example, each parallel convolution layer 652 or 654 includes a depth convolution layer using 1 x 1 convolution blocks. The respective output feature vectors 658 are combined to generate output feature vectors 660 for the respective gating blocks 650. Furthermore, in some embodiments, gating block 650 includes three or more parallel convolutional layers for generating three or more corresponding output feature vectors 658 to be combined into output feature vector 660. Additionally and alternatively, in some embodiments, gating block 650 includes at least two sets of parallel convolutional layers for generating two respective output feature vectors 658. The first set of parallel convolutional layers includes a first convolutional layer 652 and one or more serial convolutional layers 662, and the second set of parallel convolutional layers includes a second convolutional layer 654 and one or more serial convolutional layers 664. Additionally, in some embodiments, gating block 650 includes three or more sets of parallel convolutional layers for generating three or more respective output feature vectors 658 that are combined into output feature vector 660.
Fig. 7 is a block diagram of another example image restoration model 700 for restoring an input image acquired from image data captured by the UDC 260, in accordance with some embodiments. The image restoration model 700 includes a masking neural network 606 and a restoration neural network 608, and is implemented by the image restoration module 520 to restore the input image 602 to the output image 604. The input image 602 is acquired from image data (e.g., raw image signal 502) captured by the UDC 206 and degraded by the microstructure of the display panel. The input image 602 has a relatively low signal-to-noise ratio, haze effects, glare, blur, light diffraction artifacts, or a combination thereof.
In some embodiments, masking neural network 606 includes a sequence of convolutional layers 612 coupled in series with each other. Two successive convolutional layers 612 are optionally immediately adjacent to or separated from each other by a respective neural network layer. For example, masking neural network 606 includes a lightweight U-Net having at least a masking encoder network 702 that includes a convolutional layer sequence 612. Masking neural network 606 generates mask 610. In some embodiments, mask 610 includes a gray scale map having a plurality of gray scale elements. Each gray element represents the intensity of the diffraction pattern at a respective pixel or pixel region, and each gray element has a respective gray value in the range of [0,1 ]. For example, in a saturation region where information is completely lost, the gradation value of the corresponding gradation element is 0. Thus, the mask 610 identifies one or more regions on the input image 602 that have artifacts.
In some embodiments, a pixel-wise loss (pixel-wise loss) end-to-end training mask neural network 606 is used, the pixel-wise loss being represented as an average of a log-scale tone map (log-scale tone mapping), as follows:
wherein each mapping element corresponds to a logarithmic scale pixel hue Φ (y pred ) And logarithmic scale true hue Φ (y true ) And (3) a difference. Masking neural network 606 extracts the pixel tone for each pixel. In some embodiments, the input image 602 is an HDR input image having a dynamic range much greater than a normal low dynamic range image. The logarithmic scale tone mapping is applied to the mask 610 and corresponding real data generated by the masking neural network 606. For each pixel located at pixel location x, the corresponding logarithmic scale pixel hue Φ (x) is represented as follows:
where μ is a tone correction coefficient.
In some embodiments, masking neural network 606 includes a coder-decoder network (e.g., complete U-net) in which all channels or selected channels of the encoded signature are temporarily stored in each encoding stage and used in hopping connection 701. The encoder-decoder network includes a series of encoding stages 702, a bottleneck network 704 coupled to the series of encoding stages 702, and a series of decoding stages 706 coupled to the bottleneck network 704. The series of decoding stages 706 includes the same number of stages as the series of encoding stages 702. In an example, the encoder-decoder network has four encoding stages 702 and four decoding stages 706. A bottleneck network is coupled between the encoding stage 702 and the decoding stage 706. The series of encoding stages 702, the bottleneck network 705, and the series of decoding stages 706 successively process the input image 602 to generate the mask 610. In some embodiments, encoding stage 702 includes a convolutional layer sequence 612. In some embodiments, decoding stage 706 includes a second convolutional layer sequence that follows convolutional layer sequence 612.
The series of encoding stages 702 includes an ordered sequence of encoding stages 702 and has an encoding scale factor. Each encoding stage 702 generates an encoded feature map having a feature resolution and a plurality of encoding channels. In the encoding stage 702, the feature resolution is reduced and the number of encoding channels is increased according to the encoding scale factor. In an example, the coding scale factor is 2. Bottleneck network 704 is coupled to the last stage of encoding stage 702, continues to process the total number of encoding passes of the encoded feature map of the last encoding stage, and generates an intermediate feature map. In an example, the bottleneck network 704 includes a first set of 3 x 3 CNNs and linear rectification functions (rectified linear unit, reLU), a second set of 3 x 3 CNNs and ReLU, a global pooling network, a bilinear upsampling network, and a set of 1 x 1 CNNs and ReLU. The encoded signature of the final encoding stage is normalized (e.g., using a pooling layer) and fed to the first 3 x 3CNN and ReLU set of the bottleneck network 704. The 1 x 1CNN and ReLU set of bottleneck network 704 outputs an intermediate feature map, which is provided to decoding stage 706. The series of decoding stages 706 includes an ordered sequence of decoding stages 706 and has a decoding upsampling factor. Each decoding stage 706 generates a decoded feature map having a feature resolution and a plurality of decoding channels. In decoding stage 706, the feature resolution is amplified and the number of decoding channels is reduced according to the decoding upsampling factor. In an example, the decoding upsampling factor is 2.
Each decoding stage 706 extracts a subset of feature maps selected from the total number of encoded channels of the encoded feature maps of the corresponding encoding stage 702. The selected feature map subset is combined with the input feature map of the corresponding decoding stage 706 using a set of neural networks. Each respective decoding stage 706 and respective encoding stage 702 is symmetrical with respect to the bottleneck network 704, i.e. separated from the bottleneck network 704 by the same number of decoding stages 706 or encoding stages 702.
Combining the mask 610 and the input image 602 into a masked input image 614 (not shown in fig. 7), the restoration neural network 608 further processes the masked input image 614. In some embodiments, the input image 602 and the mask 610 are connected to provide a masked input image 614. The mask 610 is used to guide the recovery neural network 608 to learn to adaptively select features based on the intensity distribution of the diffraction pattern. In some embodiments, the recovery neural network 608 includes another U-Net having a recovery encoder network 708 and a recovery decoder network 710, and the recovery encoder network 708 includes a sequence of gating blocks 650. In some cases, each gating block 650 in the sequence of gating blocks 650 includes a deep convolutional layer using a 1 x 1 convolutional block. This controls the first number of network parameters and the second number of floating point operations required to perform the recovery neural network 608 per second to be within the respective threshold numbers. Thus, when the restoration neural network 608 is implemented on a mobile device, the network inference time is controlled within a tolerance such that there is no loss of restored image quality or little loss of restored image quality.
More specifically, in some embodiments, the recovery neural network 608 includes an encoder-decoder network (e.g., a complete U-net) in which all channels or selected channels of the encoded signature are temporarily stored in each encoding stage 708 and used in the hop connection 707. The encoder-decoder network includes a series of encoding stages 708, a bottleneck network 712 coupled to the series of encoder stages 708, and a series of decoding stages 710 coupled to the bottleneck network 712. The series of decoding stages 710 includes the same number of stages as the series of encoding stages 708. In an example, the encoder-decoder network has four encoding stages 708 and four decoding stages 710. A bottleneck network is coupled between the encoding stage 708 and the decoding stage 710. The series of encoding stages 708, the bottleneck network 705, and the series of decoding stages 710 successively process the input image 602 to generate the mask 6. In some embodiments, encoding stage 708 includes a convolutional layer sequence 612. In some embodiments, decoding stage 710 includes a second convolutional layer sequence following convolutional layer sequence 612.
The series of encoding stages 708 includes an ordered sequence of encoding stages 708 and has an encoding scale factor. Each encoding stage 708 generates a feature map having a feature resolution and encoding of a plurality of encoding channels. In the encoding stage 708, the feature resolution is reduced and the number of encoding channels is increased according to the encoding scale factor. In an example, the coding scale factor is 2. Bottleneck network 712 is coupled to the last stage of encoding stage 708, continues to process the total number of encoding passes of the encoded feature map of the last encoding stage, and generates an intermediate feature map. In an example, the bottleneck network 712 includes a first set of 3×3 CNNs and linear rectification functions (relus), a second set of 3×3 CNNs and relus, a global pooling network, a bilinear upsampling network, and a set of 1×1 CNNs and relus. The encoded signature of the final encoding stage is normalized (e.g., using a pooling layer) and fed to the first 3 x 3CNN and ReLU set of the bottleneck network 712. The 1 x 1CNN and ReLU set of bottleneck network 712 outputs an intermediate feature map, which is provided to decoding stage 710. The series of decoding stages 710 includes an ordered sequence of decoding stages 710 and has decoding upsampling factors. Each decoding stage 710 generates a decoded feature map having a feature resolution and a plurality of decoding channels. In the decoding stage 710, the feature resolution is amplified and the number of encoded channels is reduced according to the decoded upsampling factor. In an example, the decoding upsampling factor is 2.
Each decoding stage 710 extracts a subset of the features map selected from the total number of encoded channels of the encoded features map of the corresponding encoding stage 708. The selected feature map subset is combined with the input feature map of the corresponding decoding stage 710 using a set of neural networks. Each respective decoding stage 710 and respective encoding stage 708 is symmetrical with respect to the bottleneck network 712, i.e., the same number of decoding stages 708 or encoding stages 710 spaced from the bottleneck network 712.
In some embodiments, loss L is based on pixel-by-pixel 1 Multiscale structural similarity (Multi-scale structural similarity, MS SSIM) loss L 2 Perceptual loss L based on learning perceived image block similarity (learned perceptual image patch similarity, LPIPS) 3 And end-to-end training of the recovery neural network 608. The weighted sum of these losses is expressed as follows:
loss=L 11 L 22 L 3 (3)
wherein lambda is 1 And lambda (lambda) 2 The MS-SSIM loss and the LPIPS-based perceived loss relative to the pixel-by-pixel loss, respectively. In some embodiments, the weight λ is empirically adjusted 1 And lambda (lambda) 2 So that each loss term contributes equally in equation (3).
In some embodiments, image restoration model 600 or image restoration model 700 includes a plurality of weights associated with a respective number of filters per layer. Image restoration model 600 or image restoration model 700 maintains a single precision floating point number (float 32) format. The plurality of weights of the quantized image restoration model 600 or the image restoration model 700 are set based on the accuracy of the electronic device 200, for example, quantized into a signed 8-bit integer (int 8), unsigned 8-bit integer (uint 8), signed 16-bit integer (int 16), or unsigned 16-bit integer (uint 16) format. For example, the electronic device 200 uses a CPU to run the image restoration model 600 or the image restoration model 700, and the CPU of the electronic device 200 processes 32-bit data. The weights of the image restoration model 600 or the image restoration model 700 are not quantized, and the image restoration model 600 or the image restoration model 700 is directly provided to the electronic device 200. In another example, electronic device 200 uses one or more GPUs to run image restoration model 600 or image restoration model 700, and the GPUs process 16-bit data. The weights of image restoration model 600 or image restoration model 700 are quantized to the int16 format. In another example, electronic device 200 uses a DSP to run image restoration model 600 or image restoration model 700, and the DSP processes 8-bit data. The weights of image restoration model 600 or image restoration model 700 are quantized to the int8 format.
In some embodiments, server system 102 is configured to simplify image restoration model 600 or image restoration model 700 in response to receiving a model simplification request from electronic device 200, and the model simplification request includes information associated with the accuracy setting of electronic device 200. Alternatively, in some cases, server system 102 is configured to quantify image restoration model 600 or image restoration model 700 as a plurality of model options based on a plurality of known accuracy settings often used by different client devices 104, and to select target image restoration model 600 or target image restoration model 700 from the model options in response to receiving a model reduction request from electronic device 200. The target image restoration model 600 or the target image restoration model 700 with the quantized weights is provided to the requesting electronic device 200.
Fig. 8 is a block diagram of an importance-based filter pruning process 800 for a pruning image recovery model 802 (e.g., model 600 of fig. 6 or model 700 of fig. 7) in accordance with some embodiments. In some embodiments, the filter pruning process 800 is implemented at the server system 102 to prune the image restoration model 802 into the target NN model 804 and provide the target NN model 804 to the client device 104. In some embodiments, the filter pruning process 800 is implemented directly at the client device 104 to prune the image restoration model 802 into the target NN model 804. The image restoration model 802 has multiple layers 806, and each layer 806 has a corresponding number of filters 808. The image restoration model 802 is pruned into a plurality of pruned NN models 810 (e.g., 810A, 810B, 810C … …). In an example, each of the plurality of pruned NN models 810 has a pruned number of filters 808 and the image restoration model 802 has a first number of filters. The difference between the number after pruning and the first number is equal to a predefined difference or a predefined percentage of the first number. In another example, each of the plurality of pruned NN models 810 may be operated with a respective number of FLOPS that is equal to or less than a predefined number of FLOPS, and the respective subset of filters 808 is removed from the image restoration model 802 to obtain the respective pruned NN model 810 corresponding to the respective number of FLOPS. After the plurality of pruned NN models 810 are generated, a target NN model 804 is selected from the plurality of pruned NN models 810 based on model selection criteria (e.g., by AutoML).
Specifically, for each pruned NN model 810, a respective different set of importance coefficients is assigned to each of the multiple layers 806 in the image restoration model 802. For example, a first set of importance coefficients (e.g., A 1 And B 1 ) Is assigned to the first layer 806A and a second set of importance coefficients (e.g., a 2 And B 2 ) Assigned to the second layer 806B. A third set of importance coefficients (e.g., A 3 And B 3 ) Is assigned to the third layer 806C and a fourth set of importance coefficients (e.g., a 4 And B 4 ) Assigned to the fourth layer 806D. An importance score I is determined for each filter 808 based on the respective subset of importance coefficients of the respective layer 806 to which the respective filter 808 belongs. For example, third layer 806C includes filter 808A and is based on a third subset A of importance coefficients assigned to third layer 806C 3 And B 3 An importance score I is determined. The filters 808 of the overall image restoration model 802 are ordered based on the importance scores I for each filter 808. According to the ordering of the filters 808, the respective subset of filters 808 is removed based on the importance scores I of the filters 808, allowing pruning of the image restoration model 802 into the respective pruned NN model 810. Specifically, each of the plurality of pruned NN models 810 has a number of pruned filters 808 that satisfy a predefined difference, percentage, or number of flow, and the highest ranked number of filters 808 is selected based on the importance score I of each filter 808 to generate the corresponding pruned NN model 810.
In some embodiments, the weights W of the respective filters 808 are combined i And the respective subset of importance parameters of the respective layer 806 to which the respective filter 808 belongs, an importance score I for each filter 808 is generated. For each pruned NN model 810, if the different sets of importance parameters include the first importance parameter A of each layer 806 1 And a second importance parameter B 1 Then use the first importance parameter and the second importanceA sexual parameter, a weight value W for each filter 808 in the corresponding layer 806 i Modified as(L2 norm) and->(L1 norm) to generate an importance score I for each filter 808. Each of the plurality of pruned NN models 810 has a number of filters 808 after pruning, and the highest ranked filter 808 of the number after pruning is selected based on the importance score I of each filter 808 to generate a corresponding pruned NN model while removing the lower ranked filter 808.
In some embodiments, pruning process 800 includes a first scope search and a second precision search. In the first range search, a search is randomly selected over a wide range (A l ,B l ) The pool of pairs such that the filters have different importance scores I. Pruning criteria of the network are applied to order the filters 808 by importance scores I and remove low-pass filters 808. (A) l ,B l ) The pool of pairs gives a search space S with different pruned network structures 810 1 . For evaluation, S is trimmed by several gradient steps 1 Is disclosed herein) is a pruned architecture 810. The loss is set to be used for the corresponding (a l ,B l ) Criteria for pairs. In passing through the search space S 1 After the pruning process 800 is implemented, one or more target NN models 804 are identified (e.g., as having less loss than the remaining pruned NN models 810). In the second accurate search, the search space S is narrowed 1 . The importance score I is optionally selected as a key to pruning criteria. Learning and determining via a training process (A l ,B l ) A pool of pairs. For example, based on (A) in the first range search l ,B l ) For generating another network architecture search space S using regularized evolutionary algorithm (evolutionary algorithm, EA) 2 . Trimming S by several gradient steps 2 Each target NN model 804 is evaluated by the architecture in (1) and its loss on the validation set is used as a phaseResponse (A) l ,B l ) Criteria for pairs. (A) l ,B l ) Pairs may be put aside again until an optimized potential network structure is found.
Fig. 9 is a flowchart of an example process 900 for preprocessing image data to be restored using an image processing network (e.g., image restoration model 600 or 700), according to some embodiments. In some cases, light needs to pass through the microstructures of the set of display pixels to reach the lens system of the UDC 260, making the light transmittance quite low. Under normal to low illumination conditions, the SNR of each frame captured by the UDC 260 is low. In contrast, under high illumination conditions, especially when there is a bright light source in the field of view of the UDC 260, the UDC 260 is susceptible to objectionable haze and diffraction pattern artifacts. So that the image quality of the image taken by the UDC 260 is severely compromised under most lighting conditions including high, normal and low lighting. The image preprocessing process 900 is performed in the UDC preprocessing module 518 to enhance the image quality of these images in conjunction with the image processing network.
In some cases, each image captured by the UDC 260 may have a low SNR due to low light transmittance. The respective exposure time is increased while the respective exposure time is controlled to be less than a high exposure time limit (e.g., 100 ms). The high exposure time limits the corresponding motion blur tolerance. Since the exposure time of a single camera exceeds the exposure time limit, the UDC 260 cannot remain stable without hand trembling or causing blurred images. Alternatively, in some embodiments, the image preprocessing process 900 is implemented based on a multi-frame image capture method for generating the input image 602 of the image restoration model 700 from the plurality of images 902. The plurality of images 902 extends the dynamic range of the input image 602. The electronic device 200 obtains the HDR settings 516 comprising the exposure setting sequence and controls the UDC 260 having the exposure setting sequence to capture the HDR image sequence 902. Each HDR image 902 has a shorter exposure time (e.g., +.100 ms). The HDR image sequence 902 is combined into the input image 602. For example, rather than taking a single image with an extended exposure time of 500ms, the UDC 260 takes a sequence 902 of five images, each with an exposure time of 100ms, and merges the five images 902 into the input image 602. The input image 602, which is merged from the five images 902 taken with an exposure time of 100ms, receives the same number of photons as the input image 602 taken with an extended exposure time of 500 ms. In merging the image sequence 902, image processing techniques are applied to suppress the motion blur level of the input image 602 merged from the image sequence 902 to within the motion blur tolerance.
In contrast, in some cases, one or more bright light sources (e.g., sun, floodlight) are present within the field of view of the UDC 260 or near the field of view of the UDC 260 and cause light diffraction effects. The image taken by the UDC 260 contains objectionable light source diffraction patterns and strong haze artifacts in the area around the bright light source or sources with large area saturation. In some embodiments, the input image 602 is captured with a shortened exposure time to mitigate saturation of diffraction patterns and haze artifacts. Since the exposure time of the input image 602 is below the low exposure time limit (e.g., <1 ms), the SNR performance of the darker areas on the input image 602 drops below the SNR margin. Alternatively, in some embodiments, the image sequence 902 is captured with different exposure times and the image sequence 902 is combined into the input image 602. The input image 602 includes one or more bright areas represented from pixels of one or more images 902 having a shortest exposure time, one or more dark areas represented from pixels of one or more images 902 having a longest exposure time, and a remaining area. Each pixel in the remaining region is represented by a weighted sum of a subset or all of the corresponding pixels of the image sequence 902.
The image sequences 902 are aligned with each other before the image sequences 902 are combined to generate the input image 602. Optionally, each image in the first image sequence 902 subset is moved laterally and/or vertically with respect to the second image sequence 902 subset and each image in the first image sequence 902 subset is combined with the second image sequence 902 subset. Optionally, each image in the first image sequence 902 subset is rotated with respect to the second image sequence 902 subset and each image in the first image sequence 902 subset is merged with the second image sequence 902 subset. Optionally, each image in the first image sequence 902 subset is divided into tiles and each tile is moved laterally and/or vertically with respect to the second image sequence 902 subset and is merged with the second image sequence 902 subset.
In the example, with the same exposure time t 0 Each image in the image sequence 902 is taken. The image sequence 902 includes N images that are aligned and combined (e.g., with the same weight) to generate the input image 602. The input image 602 has a time Nt substantially equal to the exposure time 0 The first SNR of the single image taken, while the input image is less susceptible to hand trembling and better motion blur level than the single image.
In another example, each image in the image sequence 902 has a respective exposure time, and the exposure times of the image sequence 902 are controlled independently of each other. The first subset of images 902A has a first corresponding exposure time within a long exposure time range and the second subset of images 902B has a second corresponding exposure time within an intermediate exposure time range. Each second corresponding exposure time is shorter than any first corresponding exposure time within the long exposure time range. The third subset of images 902C has a third corresponding exposure time within the short exposure time range. Each third corresponding exposure time is shorter than any exposure time within the intermediate exposure time range. The electronic device aligns the first subset of images 902A and the second subset of images 902B with the third subset of images 902C. After alignment, the first subset of images 902A, the second subset of images 902B, and the third subset of images 902C are combined with one another to generate the input image 602 to be further processed by an image processing network (e.g., image restoration model 700 or image restoration model 800). In some embodiments, pixels of one or more dark regions of the input image 602 are employed from the first image subset 902A and pixels of one or more light regions are employed from the third image subset 902C. Each pixel in the dark region has a brightness level below the first brightness threshold and each pixel in the bright region has a brightness level above the second brightness threshold. The one or more remaining regions are different from, and do not overlap (e.g., complement) the combination of the one or more light regions and the one or more dark regions. Optionally, each pixel of one or the remaining regions is a weighted combination of the corresponding pixels of the second image subset 902B. Optionally, each pixel of one or the remaining regions is a weighted combination of corresponding pixels of all images in the image sequence 602.
Referring to fig. 9, in some embodiments, the image sequence 902 includes (904) N images and the image sequence subset 902D includes M images. M and N are both positive integers, and M is less than N. The residual image subset 902E includes N-M residual images in the image sequence 902. The subset 902D of image sequences is captured with a first exposure time and each image in the subset 902E of remaining images is captured with a unique and different exposure time that is less than the first exposure time. In some embodiments, the remaining subset of images 902E includes a second image captured with a second exposure time that is shorter than the first exposure time. Further, in some embodiments, the remaining subset of images 902E further includes a third image taken after the second image and with a third exposure time that is shorter than the second exposure time. In an example, the exposure time of the N-M remaining images is sequentially reduced. The subset 902D of image sequences are aligned and combined (906) with each other to generate a first image. For example, the subset of image sequences 902D are averaged on a pixel basis to generate a first image. The first image and the remaining image subset 902E are further aligned and combined (908) to generate a single HDR image (e.g., the input image 602). In some cases, only one or more bright regions of the remaining image subset 902E are extracted and incorporated into the first image, e.g., the one or more extracted bright regions replace corresponding regions of the first image.
The process 900 of preprocessing the image sequence 902 is implemented in the UDC preprocessing module 518 of the UDC recovery module 229. The image restoration module 520 further processes (910) the input image 602 using a deep learning based image processing network. The input image 602 has an input image quality, and the output image 604 output by the image restoration module 5220 has an output image quality greater than the input image quality. In some embodiments, the input image quality is indicated by the SNR of the input image 602 and the output image quality is indicated by the output SNR of the output image. The output SNR is greater than the input SNR.
Referring to fig. 3, the model training module 226 and the data processing module 228 are applied to complete processing of the input image 602 based on an image processing network. In some embodiments, model training module 226 is implemented at server 102 to determine an image processing network based on training data 306, and data processing module 228 is implemented at a client device 104 that is different from server 102. The client device 102 obtains the image processing network from the server 102 and uses the image processing network to reduce diffraction artifacts and blurring and noise. Alternatively, in some embodiments, the model training module 226 and the data processing module 228 are implemented at the electronic device 200 (e.g., the client device 104) to determine an image training network based on the training data 306 and to reduce diffraction artifacts and blurring and noise in the input image 602 using the image processing network.
Specifically, the image processing network is trained using training data 306 that includes a large number of image pairs, and each image pair includes a degraded image and a true image. The degraded image of training data 306 includes or simulates an image taken by UDC260 and having diffraction artifacts and blurring and noise. The image processing network is used to learn a series of linear and nonlinear operations that restore the perceived viewing quality (perceptual viewing quality) of the degraded image to match the perceived viewing quality of the real image in each image pair. In some embodiments, the image processing network includes a deep convolutional neural network having a plurality of convolutional layers and an activation function. During training, the image processing network is used to perform complex linear and nonlinear operations on the degraded image and adjust the weights w and/or offsets b of the convolution layers to recover the degraded image to generate a recovered image. An image quality comparison is performed between the restored image and the corresponding real image and the loss is monitored to provide feedback to determine if more training image-to-training image processing networks are needed. Training of the image processing network is completed when the image quality comparison meets predefined requirements (e.g., loss is less than a loss threshold, loss is minimized).
In some embodiments, the image processing network corresponds to a brightness dynamic range requirement 912 of the image data (e.g., the input image 602). The HDR setting 516 includes a sequence of exposure settings (e.g., exposure times), and the HDR setting 516 is determined according to the requirements of the luminance dynamic range. In these ways, the input image 602, which is composed of the image sequence 702, meets the brightness dynamic range requirement 912.
In some embodiments, the image processing network includes a masking neural network 606 and a restoration neural network 608. A masking neural network is applied to the input image 602 to generate a mask 610 that identifies one or more degradation regions on the input image 602. Masking neural network 606 includes a sequence of convolutional layers. A recovery neural network 608 is applied to the mask 610 to generate the output image 604. The recovery neural network 608 includes a sequence of gating blocks 650, and each gating block 650 includes two parallel convolutional layers 652 and 654 for processing a respective gating input 656. Further, in some embodiments, referring to fig. 7, each of masking neural network 606 and recovery neural network 608 includes a respective U-Net having a decoder network and an encoder network. The encoder network 702 and the decoder network 706 of the masking neural network 606 are based on the convolutional layer sequence 612. The encoder network 708 and decoder network 710 of the recovery neural network 608 are based on the sequence of gating blocks 650. Alternatively, in some embodiments, the image processing network comprises a context aggregation network. Accordingly, the neural network structure is applied to restore image data or video data captured by the UDC 260 to high quality visual data in real time. Particularly when the image processing network (e.g., image restoration model 600 or 700) is executed on a mobile device having limited computing resources, the delay for restoring image data or video data is controlled below a delay threshold.
In some embodiments, referring to fig. 5, the hdr image sequence 902 is an original image extracted from an input image signal 502' captured by an image sensor of the UDC 260, and the output image 604 (e.g., in the restored image signal 526) is provided to the image signal processor 203 of the electronic device 200 for further processing. Alternatively, in some embodiments, the HDR image sequence 902 is a color image provided by the image signal processor 203 of the electronic device 200, and the output image 604 (e.g., in the restored image signal 526) is provided to the image signal processor 203 for further processing or output. For example, the output image 604 is provided to the camera image processing module 510 of the image signal processor 203 for further processing or to the camera output control module 512 for output.
Fig. 10A and 10B compare an example original image 1000 and an image 1020 recovered from the example original image 1000 according to some embodiments. Fig. 10C and 10D compare another example original image 1040 with an image 1060 recovered from the example original image 1040, according to some embodiments. For privacy concerns, facial features are overlaid in fig. 10A-10D. The original image 1000 and the original image 1040 are captured by the UDC 260 and optionally preprocessed based on the HDR settings. The image restoration module 520 processes the original image 1000 and the original image 1040 or the pre-processed image 1000 or 1040 using the image restoration model 600 or the image restoration model 700 to suppress image degradation associated with the UDC 260. Haze effects exist in the original image 1000 and the original image 1040 and are removed in the restored image 1020 and the restored image 1060. The store name 1002 is not discernable in the original image 1000 due to the haze effect, and is restored and discernable in the restored image 1020. The SNR and sharpness of original image 1000 and original image 1040 are improved compared to their corresponding restored image 1020 and restored image 1060. In some embodiments, the resolution of original image 1000 and original image 1040 is enhanced. Thus, image restoration model 600 or image restoration model 700 is a multitasking and lightweight neural network architecture that includes spatial attention mechanisms suitable for image restoration tasks (particularly for UDC 260). Such image restoration model 600 or image restoration model 700 improves the image quality of original image 1000 and original image 1040 by denoising, defogging, super resolution, and diffraction removal, and image restoration model 600 or image restoration model 700 may be effectively performed on edge devices (e.g., mobile devices with limited computing, power, and storage resources).
Fig. 11 is a flow chart of an example image restoration method 1100 according to some embodiments. For convenience, method 1100 is described as being implemented by electronic device 200 (e.g., mobile phone 104C). Method 1100 is optionally managed by instructions stored in a non-transitory computer-readable storage medium and executed by one or more processors of a computer system. Each of the operations shown in fig. 11 may correspond to instructions stored in a computer memory or a non-transitory computer readable storage medium (e.g., memory 206 in fig. 2). The computer readable storage medium may include a magnetic or optical disk storage device, a solid state storage device such as flash memory, or other non-volatile storage device. The instructions stored on the computer-readable storage medium may include one or more of source code, assembly language code, object code, or other instruction formats that are interpreted by one or more processors. Some of the operations in method 1100 may be combined and/or the order of some of the operations may be changed.
The electronic device 200 obtains (1102) an input image 602 from image data captured by the UDC 260 and applies (1104) a masking neural network 606 to the input image 602 to generate a mask 610 identifying one or more defective areas on the input image 602. Masking neural network 606 includes 1106 a convolutional layer sequence 612. The mask 610 and the input image 602 are combined (1108) to provide a masked input image 614. The electronic device 200 applies 1110 a restoration neural network 608 to the masked input image 614 to generate the output image 604. The recovery neural network 608 includes (1112) a sequence of gating blocks 650, and each gating block 650 includes (1114) two parallel convolutional layers 652 and 654 for processing a respective gating input 656. The input image 602 has (1116) an input image quality and the output image 604 has an output image quality greater than the input image quality.
In some embodiments, referring to FIG. 7, masking neural network 606 includes (1118) a lightweight U-Net having at least a masking encoder network 702 that includes a convolutional layer sequence 612.
In some embodiments, the mask 610 includes (1120) a gray scale map having a plurality of gray scale elements, and each gray scale element represents an intensity of the diffraction pattern at a respective pixel or pixel region, and each gray scale pixel has a respective gray scale value in the range of [0,1 ].
In some embodiments, the electronic device 200 uses a pixel-by-pixel loss end-to-end training masking neural network 606, and the pixel-by-pixel loss is an average of the logarithmic scale tone mapping. Each mapping element corresponds to a difference between a logarithmic scale pixel hue and a logarithmic scale true hue of the corresponding element. The logarithmic scale pixel tone is extracted by masking neural network 606.
In some embodiments, by connecting the input image 602 and the mask 610, the mask 610 and the input image 602 are combined to provide a masked input image 614.
In some embodiments, each of the two parallel convolutional layers 652 and 654 comprises a depth convolutional layer using a 1 x 1 convolutional block.
In some embodiments, referring to FIG. 7, the recovery neural network 608 includes (1122) a U-Net with a recovery encoder network 708 and a recovery decoder network 710. The recovery encoder network 708 includes a sequence of gating blocks 650.
In some embodiments, the neural network 608 is recovered based on a weighted sum of pixel-by-pixel loss, multi-scale structural similarity (MS SSIM) loss, and perceived loss based on learning perceived image block similarity (LPIPS).
In some embodiments, masking neural network 606 includes a plurality of first filters operated with a plurality of first weights and a plurality of first biases. The recovery neural network 808 includes a plurality of second filters operated with a plurality of second weights and a plurality of second biases. Each filter and each bias is quantized according to the accuracy setting of the electronic device 200.
In some embodiments, the electronic device 200 obtains the requirements of the masking neural network 606 and the luminance dynamic range of the image data to be processed by the restoration neural network 608. Based on the luminance dynamic range requirements, the electronic device 200 determines an HDR setting comprising a sequence of exposure settings. Furthermore, in some embodiments, the UDC 260 with the exposure setting sequence is controlled to capture a sequence of HDR images. The HDR image sequence is combined to generate an input image 602 meeting the requirements of the luminance dynamic range.
In some embodiments, the input image 602 is an original image extracted from a signal captured by an image sensor of the UDC 260 and the output image 604 is provided to the image signal processor 203 of the electronic device 200 for further processing.
In some embodiments, the input image is a color image provided by the image signal processor 203 of the electronic device 200, and the output image 604 is provided to the image signal processor 203 for further processing or output.
In some embodiments, the input image quality is indicated by the input SNR of the input image 602 and the output image quality is indicated by the output SNR of the output image 604. The output SNR is greater than the input SNR.
It should be understood that the particular order of operations that have been described in fig. 11 is merely exemplary and is not intended to indicate that the order described is the only order in which operations may be performed. Those of ordinary skill in the art will recognize various methods of restoring an image. In addition, it should be noted that the details of the other processes described above with respect to fig. 5-8 also apply in a manner similar to the method 1100 described above with respect to fig. 11. For brevity, these details are not repeated herein.
Fig. 12 is a flow chart of an example image processing method 1200 according to some embodiments. For convenience, the method 1200 is described as being implemented by the electronic device 200 (e.g., the mobile telephone 104C). Method 1200 is optionally managed by instructions stored in a non-transitory computer readable storage medium and executed by one or more processors of a computer system. Each of the operations shown in fig. 12 may correspond to instructions stored in a computer memory or a non-transitory computer readable storage medium (e.g., memory 206 in fig. 2). The computer readable storage medium may include a magnetic or optical disk storage device, a solid state storage device such as flash memory, or other non-volatile storage device. The instructions stored on the computer-readable storage medium may include one or more of source code, assembly language code, object code, or other instruction formats that are interpreted by one or more processors. Some of the operations in method 1200 may be combined and/or the order of some of the operations may be changed.
The electronic device 200 obtains (1202) a High Dynamic Range (HDR) setting comprising a sequence of exposure settings and controls (1204) the UDC 260 with the sequence of exposure settings to capture a sequence of HDR images 902. The electronic device 200 combines 1206 the HDR image sequence 902 to generate an input image 602 and processes 1208 the input image 602 using an image processing network to generate an output image 604. The input image 602 has (1210) an input image quality and the output image 604 has an output image quality that is greater than the input image quality.
In some embodiments, the electronic device 200 obtains (1212) a luminance dynamic range requirement 912 of the image data to be processed by the image processing network. Based on the luminance dynamic range requirements 912, the electronic device 200 determines (1214) an HDR setting 516 comprising a sequence of exposure settings, the input image 602 combined from the sequence of HDR images 902 meeting the luminance dynamic range requirements.
In some embodiments, the exposure setting sequence includes a first exposure time and a second exposure time that is shorter than the first exposure time. The HDR image sequence 902 includes a first plurality of consecutive images 902D and a second image subsequent to the first plurality of consecutive images. A first plurality of consecutive images 902D are taken with a first exposure time and a second image is taken with a second exposure time. Further, in some embodiments, the exposure setting sequence further includes a third exposure time that is shorter than the second exposure time. The HDR image sequence 902 further comprises a third image. A third image is taken with a third exposure time, and the third image follows the second image.
In some embodiments, the sequence of exposure settings includes (1216) a first exposure time and a sequence of decreasing exposure times shorter than the first exposure time. The HDR image sequence 902 includes (1218) a first plurality of consecutive images 902D and a second plurality of consecutive images 902E immediately following the first plurality of consecutive images 902D. The first plurality of consecutive images 902D are captured (1220) with a first exposure time and each of the second plurality of consecutive images 902E is captured with a respectively different exposure time of a decreasing sequence of exposure times. Further, in some embodiments, the HDR image sequence 902 is combined by aligning and merging (1222) the first plurality of consecutive images 902D to the first image and aligning and merging (1224) the first image and the second plurality of consecutive images 902E to the input image 602.
In some embodiments, the HDR image sequence 902 includes a first subset of images 902A having a first respective exposure time, a second subset of images 902B having a second respective exposure time, and a third subset of images 902C having a third respective exposure time. Each second corresponding exposure time is shorter than the first corresponding exposure time and longer than the third corresponding exposure time. Further, in some embodiments, the input image 602 includes one or more bright regions, one or more dark regions, and one or more remaining regions that are different from the one or more bright regions and the one or more dark regions. The HDR image sequence 902 is combined to generate the input image 602 by employing pixels of one or more dark regions from the first image subset 902A, pixels of one or more bright regions from the third image subset 902C, and determining each pixel of one or the remaining regions based on a weighted combination of corresponding pixels of the second image subset 902B. Additionally, in some embodiments, each pixel of one or the remaining regions is determined based on a weighted combination of the corresponding pixels of all images in the HDR image sequence 902.
In some embodiments, the HDR image sequence 902 is an original image extracted from a signal captured by an image sensor of the UDC 260, and the output image 604 is provided to the image signal processor 203 of the electronic device 200 for further processing.
In some embodiments, the HDR image sequence 902 is a color image provided by the image signal processor 203 of the electronic device 200, and the output image 604 is provided to the image signal processor 203 for further processing or output.
In some embodiments, the electronic device 200 processes the input image 602 using an image processing network by applying a masking neural network 606 to the input image 602 to generate a mask 610 identifying one or more degradation regions on the input image 602 and applying a restoration neural network 608 to the mask 610 to generate the output image 604. Masking neural network 606 includes convolutional layer sequence 612. The recovery neural network 608 includes a sequence of gating blocks 650, each gating block 650 including two parallel convolutional layers 652 and 654 for processing a respective gating input 656. Furthermore, in some embodiments, each of masking neural network 606 and recovery neural network 608 includes a respective U-Net having an encoder network and a decoder network. The encoder network 702 and the decoder network 706 of the masking neural network 606 are based on the convolutional layer sequence 612. The encoder network 708 and decoder network 710 of the recovery neural network 608 are based on the sequence of gating blocks 650.
In some embodiments, the input image quality is indicated by an input SNR of the input image 602 and the output image quality is indicated by an output SNR of the output image 604, the output SNR being greater than the input SNR.
It should be understood that the particular order of operations that have been described in fig. 12 is merely exemplary and is not intended to indicate that the order described is the only order in which operations may be performed. Those of ordinary skill in the art will recognize various methods of processing images. In addition, it should be noted that the details of the other processes described above with respect to fig. 5-11 also apply in a manner similar to the method 1200 described above with respect to fig. 12. For brevity, these details are not repeated herein.
The terminology used in the description of the various described embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting. In the description of the various described embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, it will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.
As used herein, the term "if" is optionally interpreted as "when" or "in response to a determination" or "in response to a detection" or "according to a determination" depending on the context. Likewise, the phrase "if a determination" or "if a [ condition or event ] is detected" is optionally interpreted in the context of "at the time of determination" or "in response to a determination" or "at the time of detection of [ condition or event ]" or "in response to detection of [ condition or event ]" or "in accordance with a determination of [ condition or event ] is detected".
The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to understand them.
Although the various figures show some logic stages in a particular order, the stages that are not order dependent may be reordered and other stages may be combined or split. While some reordering or other groupings are specifically mentioned, other ordering or groupings will be apparent to those of ordinary skill in the art, and thus the ordering and groupings described herein are not an exhaustive list of alternatives. Further, it should be appreciated that these stages may be implemented in hardware, firmware, software, or any combination thereof.

Claims (16)

1. An image processing method implemented in an electronic device and comprising:
obtaining a High Dynamic Range (HDR) setting, the HDR setting comprising an exposure setting sequence;
controlling an off-screen camera with the exposure setting sequence to shoot an HDR image sequence;
combining the HDR image sequence to generate an input image; and
processing the input image using an image processing network to generate an output image, wherein the input image has an input image quality and the output image has an output image quality that is greater than the input image quality.
2. The method of claim 1, further comprising:
acquiring the brightness dynamic range requirement of the image data to be processed by the image processing network; and
determining the HDR setting comprising the sequence of exposure settings according to the luminance dynamic range requirements, the input image combined from the sequence of HDR images meeting the luminance dynamic range requirements.
3. The method according to claim 1 or 2, wherein:
the exposure setting sequence includes a first exposure time and a second exposure time shorter than the first exposure time;
the HDR image sequence comprises a plurality of first continuous images and a second image subsequent to the plurality of first continuous images; and
The plurality of first continuous images are captured with the first exposure time and the second image is captured with the second exposure time.
4. A method according to claim 3, wherein:
the exposure setting sequence further includes a third exposure time shorter than the second exposure time;
the HDR image sequence further comprises a third image; and
the third image is captured with the third exposure time, the third image being subsequent to the second image.
5. The method according to claim 1 or 2, wherein:
the exposure setting sequence includes a first exposure time and a decreasing exposure time sequence shorter than the first exposure time;
the HDR image sequence comprises a first plurality of consecutive images and a second plurality of consecutive images immediately following the first plurality of consecutive images; and
the first plurality of consecutive images is captured with the first exposure time and each of the second plurality of consecutive images is captured with the decreasing sequence of exposure times respectively different exposure times.
6. The method of claim 5, combining the HDR image sequence further comprises:
aligning and merging the first plurality of consecutive images into a first image; and
The first image and the second plurality of successive images are aligned and merged into the input image.
7. The method of claim 1 or 2, wherein the HDR image sequence comprises a first subset of images having a first respective exposure time, and a second subset of images having a second respective exposure time, each second respective exposure time being shorter than the first respective exposure time and longer than the third respective exposure time, and a third subset of images having a third respective exposure time.
8. The method of claim 7, wherein:
the input image includes one or more bright regions, one or more dark regions, and one or more remaining regions different from the one or more bright regions and the one or more dark regions; and
the combining the HDR image sequence to generate the input image further comprises:
employing pixels of the one or more dark regions from the first subset of images;
pixels of the one or more bright areas are taken from the third subset of images; and
each pixel of the one or remaining regions is determined based at least on a weighted combination of corresponding pixels of the second subset of images.
9. The method of claim 8, wherein each pixel of the one or remaining regions is determined based on a weighted combination of corresponding pixels of all images in the HDR image sequence.
10. The method of any of the preceding claims, wherein the HDR image sequence is an original image extracted from a signal captured by an image sensor of the off-screen camera, and the output image is provided to an image signal processor of the electronic device for further processing.
11. The method of any of claims 1-6, wherein the HDR image sequence is a color image provided by an image signal processor of the electronic device, and the output image is provided to the image signal processor for further processing or output.
12. The method of any of the preceding claims, wherein processing the input image using the image processing network further comprises:
applying a masking neural network to the input image to generate a mask identifying one or more degradation regions on the input image, the masking neural network comprising a sequence of convolutional layers; and
a recovery neural network is applied to the mask to generate the output image, the recovery neural network comprising a sequence of gating blocks, each gating block comprising two parallel convolutional layers for processing a respective gating input.
13. The method according to claim 12, wherein:
each of the masking neural network and the recovery neural network includes a respective U-Net having an encoder network and a decoder network;
the encoder network and the decoder network of the masking neural network are based on the convolutional layer sequence; and
the encoder network and the decoder network of the recovery neural network are based on the sequence of gating blocks.
14. The method of any of the preceding claims, wherein the input image quality is indicated by an input signal-to-noise ratio (SNR) of the input image and the output image quality is indicated by an output SNR of the output image, the output SNR being greater than the input SNR.
15. An electronic device, comprising:
one or more processors; and
a memory having instructions stored thereon that, when executed by the one or more processors, cause the processors to perform the method of any of claims 1-14.
16. A non-transitory computer-readable medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the method of any of claims 1-14.
CN202280029102.9A 2021-05-04 2022-05-04 Image restoration for an under-screen camera Pending CN117280709A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202163183827P 2021-05-04 2021-05-04
US63/183,827 2021-05-04
PCT/US2022/027683 WO2022235809A1 (en) 2021-05-04 2022-05-04 Image restoration for under-display cameras

Publications (1)

Publication Number Publication Date
CN117280709A true CN117280709A (en) 2023-12-22

Family

ID=83932522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280029102.9A Pending CN117280709A (en) 2021-05-04 2022-05-04 Image restoration for an under-screen camera

Country Status (2)

Country Link
CN (1) CN117280709A (en)
WO (2) WO2022235785A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116614637B (en) * 2023-07-19 2023-09-12 腾讯科技(深圳)有限公司 Data processing method, device, equipment and readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8351720B2 (en) * 2008-04-24 2013-01-08 Hewlett-Packard Development Company, L.P. Method and system providing edge enhanced image binarization
JP2015033107A (en) * 2013-08-07 2015-02-16 ソニー株式会社 Image processing apparatus, image processing method, and electronic apparatus
WO2015124212A1 (en) * 2014-02-24 2015-08-27 Huawei Technologies Co., Ltd. System and method for processing input images before generating a high dynamic range image

Also Published As

Publication number Publication date
WO2022235809A1 (en) 2022-11-10
WO2022235785A1 (en) 2022-11-10

Similar Documents

Publication Publication Date Title
WO2020192483A1 (en) Image display method and device
CN109636754B (en) Extremely-low-illumination image enhancement method based on generation countermeasure network
WO2022042049A1 (en) Image fusion method, and training method and apparatus for image fusion model
US20200234414A1 (en) Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures
CN113454981A (en) Techniques for multi-exposure fusion of multiple image frames based on convolutional neural network and for deblurring multiple image frames
US20230080693A1 (en) Image processing method, electronic device and readable storage medium
CN108364270B (en) Color reduction method and device for color cast image
CN113034384A (en) Video processing method, video processing device, electronic equipment and storage medium
KR102628898B1 (en) Method of processing image based on artificial intelligence and image processing device performing the same
CN116438804A (en) Frame processing and/or capturing instruction systems and techniques
US20230260092A1 (en) Dehazing using localized auto white balance
CN108574803B (en) Image selection method and device, storage medium and electronic equipment
CN112927162A (en) Low-illumination image oriented enhancement method and system
CN117280709A (en) Image restoration for an under-screen camera
US20230267587A1 (en) Tuning color image fusion towards original input color with adjustable details
WO2021024860A1 (en) Information processing device, information processing method, and program
CN113643202A (en) Low-light-level image enhancement method based on noise attention map guidance
WO2023229644A1 (en) Real-time video super-resolution for mobile devices
WO2023229590A1 (en) Deep learning based video super-resolution
WO2023229589A1 (en) Real-time video super-resolution for mobile devices
WO2023229591A1 (en) Real scene super-resolution with raw images for mobile devices
CN117994161B (en) RAW format weak light image enhancement method and device
WO2023177388A1 (en) Methods and systems for low light video enhancement
CN116862801A (en) Image processing method, device, electronic equipment and storage medium
WO2023167682A1 (en) Image processing with encoder-decoder networks having skip connections

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination