WO2023063979A1

WO2023063979A1 - Apparatus and method of automatic white balancing

Info

Publication number: WO2023063979A1
Application number: PCT/US2022/015648
Authority: WO
Inventors: Yi Fan; Hsilin Huang
Original assignee: Zeku, Inc.
Priority date: 2021-10-15
Filing date: 2022-02-08
Publication date: 2023-04-20

Abstract

Apparatuses and methods of automatic white balancing (AWB) are provided. The apparatus may include a processor and memory coupled with the processor and storing instructions. The processor is configured to generate a first candidate illuminant vector based on color statistical features of one frame. A second candidate illuminant vector corresponding to the frame may be generated based on spatial features. Temporal features may be extracted based on static features corresponding to M frames of the plurality of frames, consecutively captured from time t-(M-1) to time t, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame. A final illuminant vector corresponding to the frame may be generated based on the output feature vector corresponding to the frame. The frame may be rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

Description

APPARATUS AND METHOD OF AUTOMATIC WHITE BALANCING

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims the benefit of priority to U.S. Provisional Application No. 63/256,366, entitled “METHOD AND SYSTEM OF AUTOMATIC WHITE BALANCING,” filed on October 15, 2021, the content of which is incorporated by reference herein in its entirety.

BACKGROUND

[0002] Embodiments of the present disclosure relate to apparatuses and methods of automatic white balancing (AWB).

[0003] An image/video capturing device, such as a camera or a camera array, can be used to capture an image/video or a picture of a scene. Cameras or camera arrays have been included in many handheld devices, especially since the advent of social media that allows users to upload pictures and videos of themselves, friends, family, pets, or landscapes on the internet with ease and in real-time. The lens, for example, may receive and focus light onto one or more image sensors that are configured to detect photons. When photons impinge on the image sensor, an image signal corresponding to the scene is generated and sent to an image signal processor (ISP). The ISP performs various operations associated with the image signal to generate one or more processed images of the scene that can then be outputted to a user, stored in memory, or outputted to the cloud.

SUMMARY

[0004] Embodiments of method and apparatus of automatic white balancing (AWB) are disclosed in the present disclosure.

[0005] According to one aspect of the present disclosure, an apparatus of AWB is provided. The apparatus may include a color constancy (CC) algorithm module, a convolutional neural network (CNN) module, and a prediction module. The CC algorithm module may be configured to generate a first candidate illuminant vector, corresponding to one of one or more frames, based on color statistical features of the frame. The CNN module may be configured to extract spatial features based on the frame and generate a second candidate illuminant vector corresponding to the frame based on the spatial features. The prediction module may be configured to estimate a final illuminant vector corresponding to the frame based on the first candidate illuminant vector and the second candidate illuminant vector. The frame may be rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

[0006] According to another aspect of the present disclosure, another apparatus of AWB is provided. The apparatus may include a processor and memory coupled with the processor and storing instructions. When executed by the processor, the instructions cause the processor to generate a first candidate illuminant vector, corresponding to one of a plurality of frames, based on color statistical features of the frame. Spatial features may be extracted based on the frame, and a second candidate illuminant vector corresponding to the frame may be generated based on the spatial features. Temporal features may be extracted based on static features corresponding to M frames of the plurality of frames, consecutively captured from time to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame. AT may be an integer equal to or greater than 2, and the static features of the frame captured at time t may include the first and second candidate illuminant vectors. A final illuminant vector corresponding to the frame may be generated based on the output feature vector corresponding to the frame. The frame may be rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

[0007] According to still another aspect of the present disclosure, a method of AWB is disclosed. The method may include processing one of one or more input frames based on color statistical features to obtain a first candidate illuminant vector corresponding to the frame; extracting spatial features based on the frame to obtain a second candidate illuminant vector corresponding to the frame; and estimating a final illuminant vector corresponding to the frame based on the first candidate illuminant vector and the second candidate luminant vector. The frame may be rectified to form a rectified frames based on the final illuminant vector corresponding to the frame.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate some embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure. Modifications and variations based on the drawings by a person skilled in the pertinent art without extra work may still fall into the scope of the present disclosure.

[0009] FIG. 1 illustrates an exemplary block diagram of a system having an image signal processor pipeline (ISP), according to some embodiments of the present disclosure.

[0010] FIG. 2 illustrates an exemplary block diagram of the ISP pipeline depicted in the system of FIG. 1, according to some embodiments of the present disclosure.

[0011] FIG. 3 illustrates an exemplary block diagram of a network architecture of a convolution neural network (CNN), according to some embodiments of the present disclosure. [0012] FIG. 4 illustrates an exemplary block diagram of a network architecture of a recurrent neural network (RNN), according to some embodiments of the present disclosure. [0013] FIG. 5 illustrates an exemplary diagram of a network architecture of a prediction module, according to some embodiments of the present disclosure.

[0014] FIG. 6 illustrates a flowchart of an exemplary method of automatic white balancing, according to some embodiments of the present disclosure.

[0015] Embodiments of the present disclosure will be described with reference to the accompanying drawings.

DETAILED DESCRIPTION

[0016] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications. [0017] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0018] In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures, or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

[0019] Various aspects of method and apparatus will now be described. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

[0020] For ease of nomenclature, the term “camera” is used herein to refer to an image capture device or other data acquisition device. Such a data acquisition device can be any device or system for acquiring, recording, measuring, estimating, determining and/or computing data representative of a scene, including but not limited to two-dimensional image data, three-dimensional image data, and/or light field data. Such a data acquisition device may include optics, sensors, and image processing electronics for acquiring data representative of a scene, using techniques that are well known in the art. One skilled in the art will recognize that many types of data acquisition devices can be used in connection with the present disclosure, and that the present disclosure is not limited to cameras. Thus, the use of the term “camera” herein is intended to be illustrative and exemplary but should not be considered to limit the scope of the present disclosure. Specifically, any use of such a term herein should be considered to refer to any suitable data acquisition device.

[0021] In addition, the term “frame,” as used herein, may be defined as a data entity (stored, for example, in a file) containing a description of a state corresponding to a single captured sensor exposure in a camera. This state includes the sensor image, and other relevant camera parameters, specified as metadata. The sensor image may be either a raw image or a compressed representation of the raw image. The terms “frame” and “image” may be used interchangeably in the description below. The term “color constancy,” “color balancing,” or “automatic white balancing” may refer to skills, operations, or approaches to balance the color temperatures in an image/frame, and these terms may be used interchangeably in the description below.

[0022] Image enhancement is one of the most important computer vision applications. Image enhancement units are generally deployed in all cameras, such as mobile phones or digital cameras. Image enhancement is a challenging subject since its implementation consists of various units that perform various operations. The units may include, e.g., a super resolution unit, a denoising unit, an automatic white balancing (AWB) unit, a color balancing unit, and the like. Recently, deep learning neural networks have been widely deployed in image enhancement as they demonstrate significant accuracy improvements.

[0023] Light varies at different times of the day and under different illumination conditions, a scene may have different appearances. Accordingly, color temperature is introduced as a metric to describe the light appearances. The human visual system (a collaboration of human eyes and human brain) possesses an ability to perceive colors as relatively constant, despite the color temperatures. This feature of the human visual system is known as “color balancing” or “color constancy.” Cameras, unfortunately, cannot perform such a sophisticated function as humans do. In order to compensate for or remove color casts due to the color temperatures of the ambient light, a function of automatic white balancing (AWB) may be implemented to a digital camera to ensure that the camera can emulate the human visual system and arrive at “color constancy” or “color balancing” as humans do.

[0024] White balancing is a typical adjustment for the color constancy, which is adopted by the camera to balance the color temperatures in an image based on white. In operations, white balancing adds an opposite color to the image in an attempt to bring the color temperature back to neutral. Instead of whites appearing blue or orange in the image, the whites should appear white. There may be still, however, illumination effects on other colors than white. Multi-color balancing is thus introduced for improving the performance of white balancing by mapping target colors to corresponding ground truth colors.

[0025] An AWB unit is a mechanism employed in digital cameras to estimate scene illuminants and perform chromatic adaptations. Based on the estimated illuminants, color intensities of image frames can be adjusted. Consequently, without the user intervention, photographs or videos captured by the camera can appear natural in color and match what is perceived by the human visual system. The AWB unit may be implemented as a module in an image signal processor (ISP) of the digital camera. In some cases, the terms “AWB,” “color balancing,” and “color constancy” may imply that the colors are balanced based on white or multiple target colors, but in the present disclosure, “AWB,” “color balancing,” and “color constancy” may be used interchangeably to refer to operations of balancing the color temperatures.

[0026] The AWB methods for processing a single frame may be mainly categorized into different groups according to the computations they apply, such as static color constancy and learning-based color constancy. The static single-frame color constancy adopts the skills of low-level color statistics and assumptions, while the learning-based methods are configured to extract image features and train the system to learn to predict the illuminants based on the features. Among the learning-based methods, deep learning (DL) is widely used in view of its accuracy.

[0027] On the other hand, fewer methods for processing multiple frames (to achieve temporal color constancy) are introduced. One straightforward solution is to apply the singleframe methods to process each frame in a video independently and combine the processing results. In the operations, however, each video frame is treated independently, and correlation nature of the video frames is not considered. As a result, the rectified video frames cannot provide a satisfactory quality. As videos have become the most popular medium in the current cyber world, other approaches are proposed to address the issues.

[0028] In one approach, termed as temporal color constancy network (TCC-Net), a two-branch neural network architecture is adopted to perform illuminant estimation with respect to a frame sequence. Each branch in the network architecture consists of a convolutional neural network (CNN) backbone followed by a two-dimensional long-short term memory (2D LSTM) that performs 2D convolution operations. The first branch is configured to process an actual input sequence, while the second branch is to use a pseudo zoom-in procedure for simulating how a user can move a camera and thus impact frame shots to generate a simulated input sequence as an additional input sequence. These inputs, including the actual input sequence and the simulated input sequence, are respectively fed into each branch of the CNNs, the processing of which extracts and outputs 512-channel semantic features in each branch. The features are recursively processed by the 2D LSTMs in the two branches for outputting 128-channel features in each branch. Eventually, the 128-channel features are concatenated channel-wise and processed by a 1x1 convolution filter so as to generate a spatial illuminant map. A global illuminant vector can be generated based on the spatial illuminant map.

[0029] The TCC-Net, however, shows some shortcomings. For example, the TCC-Net relies upon the complex network architecture, requiring, e.g., the additional simulated input sequence and the two CNN-plus-2D LSTM branches, to improve the accuracy. Extra computation resources and memory storage, resulting from, e.g., the 2D convolution operations in the 2D LSTMs, may incur more costs and thus consume more power. Further, the TCC-Net is a system purely employing a learning-based neural network architecture. Therefore, for tackling different illumination scenes, a huge amount of training data is required to build a robust and reliable system. Once coping with an unknown or unfamiliar scenario, the system may not recognize it and may misinterpret it, which may lead the system to corruption. Furthermore, for a resource-constrained system, such as a mobile phone, the implementation of the TCC-Net may not appear suitable and feasible.

[0030] In view of the above and other issues, some embodiments of the present disclosure provide an AWB architecture that includes multiple modules, for example, a color constancy (CC) algorithm module, a CNN module, a recurrent neural network (RNN) module, and a prediction module. The AWB architecture may fuse the CNN module and the CC algorithm module to obtain a hybrid approach to complement each other. An input frame may be parallelly fed into and processed by the CC algorithm module and the CNN module, respectively, to generate a concatenated illuminant vector. The concatenated illuminant vector may be recursively (or recurrently) processed by the RNN module to output a feature illuminant vector. Through the prediction module, the feature illuminant vector may be transformed into a final illuminant vector. The final illuminant vector can be used to compensate for a current frame to produce a rectified frame, thereby outputting an image or a video with color constancy or color balancing. In another aspect, some embodiments of the present disclosure provide an AWB method implementing the AWB architecture.

[0031] The proposed solution of the present disclosure combines the advantages of the CC algorithm, the CNN, and the RNN to improve the robustness of the overall system. The CC algorithm in the present disclosure may refer to an algorithm that processes an input frame based on color statistics of pixels of the input frame. The CC algorithm is generally applicable and thus can provide a fair baseline. It uses fewer parameters, thereby having computation efficiency and easily being implemented. In addition, the CC algorithm can generate a fair prediction result. On the other hand, through training data, the CNN shows the strong ability to extract visual sematic features, from the input frame, which are beneficial for accurate predictions. The combination of the CC algorithm and the CNN can thus complement each other. The inclusion of the RNN can carry information and contexts in regard to the time domain. The RNN extracts temporal features with respect to previous frames and brings benefits regarding the correlation nature of the input frames to the predictions. Consistent with the present disclosure, it becomes unnecessary for the input frame to be pre-processed, as in the TCC-Net, to obtain the additional input sequence for the second branch. Additional details of the AWB architecture and method are provided below in connection with FIGs. 1-5.

[0032] FIG. 1 illustrates an exemplary block diagram of a system 100 having an image signal processor pipeline (ISP pipeline), according to some embodiments of the present disclosure. The ISP pipeline may also be referred to as an ISP, and the present disclosure may use “ISP pipeline” and “ISP” interchangeably. In some embodiments, system 100 may include an application processor (AP) 102, an ISP 104, a memory 106, and input-output devices 108. ISP 104 may include a controller 1042, an imaging sensor 1044, an automatic white balancing (AWB) unit 1046, a video generation unit 1048, and a local memory 1050. Input-output devices 108 may include user input devices 1082, display and audio devices 1084, and wireless communication devices 1086. In some embodiments, system 100 may be a device with an imaging capturing function, such as a smartphone or digital camera. The present disclosure, however, does not limit an application of the disclosed AWB architecture and method herein specifically to a smartphone or digital camera. The AWB proposed in the present disclosure may also be implemented to other apparatuses for other applications, such as image segmentation and classification.

[0033] In some embodiments, AP 102 may be a main application processor of system 100 and may host the operating system (OS) of system 100 and all the applications. AP 102 may be any kind of one or more general-purpose processors such as a microprocessor, a microcontroller, a digital signal processor, or a central processing unit, and other needed integrated circuits such as glue logic. The term “processor” may refer to a device having one or more processing units or elements, e.g., a central processing unit (CPU) with multiple processing cores. AP 102 may be used to control the operations of system 100 by executing instructions stored in memory 106, which can be in the same chip as AP 102 or in a separate chip from AP 102. AP 102 may also be configured to generate control signals and transmit the control signals to various portions of system 100 to control and monitor the operations of these portions. In some embodiments, AP 102 can run the OS of system 100, control the communications between a user and system 100, and control the operations of various applications. For example, AP 102 may be coupled with a communications circuitry and execute software to control the wireless communications functionality of system 100. In another example, AP 102 may be coupled to memory 106, ISP 104, and input-output devices 108 to control the processing and display of sensor data, e.g., image data or video data.

[0034] In some embodiments, ISP 104 may include software and/or hardware operatively coupled with AP 102, memory 106, and input-output devices 108. In certain instances, ISP 104 may include an image processing hardware, such as controller 1042, configured to couple (e.g., placed between) AP 102 with at least one of imaging sensor 1044, AWB unit 1046, video generation unit 1048, or local memory 1050. AP 102 may transmit control signals and/or other data to ISP 104 via, e.g., an internal bus to control the operations of ISP 104. Controller 1042 of ISP 104 may include a suitable circuitry that, when controlled by AP 102, performs functions not supported by AP 102, e.g., processing raw image data, extracting frame features, estimating illuminants or motion parameters, rectifying frames, performing feature matching, object identification, aligning frames, video generation, etc. In some embodiments, components, e.g., circuitry, of ISP 104 may be integrated on a single chip. In certain embodiments, controller 1042 of ISP 104 may include a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a graphics processing unit (GPU), a processor, a microprocessor, a microcontroller, a digital signal processor, and other needed integrated circuits for its purposes.

[0035] Memory 106 in system 100 may be memory external to ISP 104. By contrast, local memory 1050 may be internal memory in ISP 104. Memory 106 may include randomaccess memory (RAM), read-only memory (ROM), electronically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, solid state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by AP 102. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Local memory 1050 may include static random-access memory (SRAM), dynamic random-access memory (DRAM), a cache, and registers.

[0036] FIG. 1 illustrates exemplary blocks, modules, and units in system 100 that implements the AWB architecture proposed by the present disclosure. It can be understood that FIG. 1 is given merely for an illustrative purpose, but not in an intention to limit the present disclosure. In other embodiments, system 100 may include other blocks, modules, and/or units configured to perform other functions. For example, ISP 104 may include a motion estimation unit configured to process data captured by an inertial measurement unit (IMU) of input-output devices 108 to estimate motion parameters associated with images captured by imaging sensor 1044.

[0037] FIG. 2 illustrates an exemplary block diagram of the ISP 104 of FIG. 1, according to some embodiments of the present disclosure. As illustrated in FIG. 2, ISP 104 may include controller 1042, imaging sensor 1044, AWB unit 1046, video generation unit 1048, and local memory 1050, operatively coupled with one another. Each of controller 1042, imaging sensor 1044, AWB unit 1046, video generation unit 1048, and local memory 1050 may include suitable software and/or hardware configured to perform the functions of ISP 104. For example, AWB unit 1046 may be operatively coupled with imaging sensor 1044, video generation unit 1048, and local memory 1050 of ISP 104 and controlled, by controller 1042, to perform the color constancy functions disclosed in the present disclosure. Controller 1042 herein may refer to a device having one or more processing units configured to process image- related operations, including an image processing unit, an image processing engine, an image signal processor, and a digital signal processor.

[0038] In some embodiments of the present disclosure, imaging sensor 1044 may include one or more digital cameras that are configured to capture a static image or a video consisting of a plurality of frames from different angles, positions, or perspectives. One or more frames may be captured using at least one of imaging sensor 1044 or one or more external cameras (not shown). The external cameras may be part of system 100 but located external to ISP 104 or external to system 100. ISP 104 may store the image or the plurality of frames in local memory 1050 and/or send them for storage in memory 106 of system 100. The frame(s) captured by external camera(s) may be sent to AP 102, ISP 104, memory 106, or local memory 1050 and be stored and processed with those captured using imaging sensor 1044 together.

[0039] Controller 1042 and/or AP 102 may activate AWB unit 1046 to process one or more frames to achieve the color balancing as disclosed in the present disclosure. In some embodiments of the present disclosure, AWB unit 1046 may include a color constancy (CC) algorithm module 202, a CNN module 204, an RNN module 206, and a prediction module 208, as depicted in FIG. 2. Consistent with the present disclosure, AWB unit 1046 may combine CC algorithm module 202 and CNN module 204 as so to complement each other. An input frame or each of an input frame sequence may be parallelly fed into and processed by CC algorithm module 202 and CNN module 204, respectively, to generate a concatenated illuminant vector. The term “input frame sequence” or “multiple frames” as used in the present disclosure may refer to a set of AT consecutive frames captured by an imaging sensor at consecutive time t-(M- 7), ... , t-1, and /, where A7 is an integer equal to or greater than 2. Meanwhile, the term

“parallelly” may refer to the same input, unlike the TCC-Net, being fed into CC algorithm module 202 and CNN module 204.

[0040] The concatenated illuminant vector may be recursively or recurrently processed by RNN module 206 and transformed into a feature illuminant vector through RNN module 206. Further, a final illuminant vector may be generated by prediction module 208. In some embodiments, AWB unit 1046 may further include rectification module 210 configured to process a current frame based on the final illuminant vector to generate a rectified frame, thereby outputting an image or a video with color balancing based on the rectified frame.

[0041] As described above, AWB unit 1046 may fuse CC algorithm module 202 and CNN module 204 to complement each other, as shown in FIG. 2. Each frame of the input frame sequence may be parallelly fed into and processed by CC algorithm module 202 and CNN module 204, respectively, to generate the concatenated illuminant vector. Through this manner, the pre-processing of the actual input sequence to obtain the additional input sequence for the second branch in TCC-Net can be accordingly avoided.

[0042] In some embodiments, the frame or the input frame sequence may be provided through an upstream module in the ISP pipeline. The frames may not be color balanced. The proposed solution of the present disclosure may estimate and output the final illuminant vector having a length of 3, each corresponding to one primary color channel: red, green, and blue. One or more downstream modules in the ISP pipeline may utilize the final illuminant vector to rectify a current frame to arrive at color balancing.

[0043] CC algorithm module 202 may apply a non-neural-network-based color constancy algorithm (i.e., non-training CC algorithm or CC algorithm for short in the below description) that is generally applicable to provide a baseline. The CC algorithm may use low- level color statistical features, for which CC algorithm module 202 can generalize the input scenarios fairly. Compared to CNN module 204, CC algorithm module 202 uses fewer parameters and thus has the computation efficiency. Therefore, CC algorithm module 202 can be easily implemented in hardware, software, firmware, or any combination thereof. In some embodiments, if implemented in software, the CC algorithm may be stored in or encoded as instructions or codes in local memory 1050 or memory 106. For example, controller 1042 may be configured to perform the AWB operations by executing the instructions stored in local memory 1050 or memory 106. In other embodiments, AP 102 may be configured to generate control signals and transmit the control signals to controller 1042 in AWB unit 1046 to initiate the AWB operations associated with AWB unit 1046.

[0044] In some embodiments, the CC algorithm may include Gray World algorithm. In principle, the Gray World algorithm assumes that an average reflectance of a scene with rich colors is achromatic, i.e., gray, and accordingly adjusts an average input pixel value to be gray. The Gray World algorithm may output a first candidate illuminant vector having a length of 3, respectively corresponding to red, green, and blue in three color-channels for subsequent processing. The averages of the three color-channels of Gray World algorithm can be expressed as:

^K n _ lyiv avg ~ _N >k=l ' k

and

where rk, gk, and bk are a red value, a green value, and a blue value respectively corresponding to a A¹¹¹ pixel of a frame, having N pixels, in a corresponding color.

[0045] Accordingly, the assumption of the Gray World algorithm, i.e., the average reflectance of a scene being achromatic, can be expressed as:

[0046] In some embodiments of the present disclosure, G_avg may be further used as a baseline for normalization, and R_avg, Gavg, and B_avg as computed from Equations (l)-(3) may be compared with respect to G_avg and used to estimate the first candidate illuminant vector, corresponding to the three color-channels, for each input frame as:

[0047] Equation (5) may represent the color statistical features of the input frame based on the Gray World algorithm. Meanwhile, in the present disclosure, the Gray World algorithm is merely an exemplary algorithm that can be implemented to CC algorithm module 202 to obtain the color statistical features. In other embodiments, CC algorithm module 202 may include another non-neural-network-based algorithm instead of the Gray World algorithm, such as White Patch, Shades-of-Gary, or Gray Edge.

[0048] Overall, based on the non-training strategy and low-level color statistics, the Gray World algorithm involves much fewer parameters and computations. Consequently, the Gray World algorithm can provide relatively simple implementation, reduce computation overhead, and produce fairly satisfactory results, thereby providing a baseline for generalizing input scenarios. In dealing with a scene dominated by a particular color (such as in shooting a blue sky, a green tree, and a red apple), however, the Gray World algorithm alone may not be sufficient. That is, the frames that capture the scene may be majorly occupied by objects having dominant colors. For example, a frame consisting of only green grass may be misinterpreted as gray by the Gray World algorithm. As a result, in a practical AWB system that intends to cover a wide spectrum of real scenarios, CC algorithm module 202 may need some compensation. Accordingly, the present disclosure provides some embodiments that add other modules, e.g., deep learning-based modules, into AWB unit 1046 to enhance the accuracy and performance of the system.

[0049] Consistent with some embodiments of the present disclosure, AWB unit 1046 may include CNN module 204 arranged in parallel with CC algorithm module 202 and configured to receive the same frame fed to CC algorithm module 202 for processing to produce a second candidate illuminant vector. In one instance, the second candidate illuminant vector may include a length of 3 corresponding to the three color-channels including red, green, and blue.

[0050] CNN module 204 may apply a CNN algorithm to implement an artificial neural network architecture configured to perform convolution operations to emphasize relevant image features and thus analyze visual imagery. That is, the CNN algorithm may be configured to distinguish meaningful features from an image. CNN module 204 may implement the CC algorithm and may be configured to obtain raw pixel data, train a model, and extract features from the pixels of the frame. Consequently, CNN module 204 can extract higher-level representations to classify the image from a higher perspective as a whole.

[0051] Although the data training may bring the computation complexity to CNN module 204, CNN module 204 is superior with the ability to extract spatial semantic features based on data-driven techniques, thereby arriving at more accurate predictions. For example, in the case of a green grass image as previously described, CNN module 204 may be trained to extract a semantic meaning of the captured frame to be grass and associate the grass with the color green. By contrast, for a case of input data unfamiliar to CNN module 204, however, CC algorithm module 202 can step in to provide a fair and interpretable baseline prediction to complement CNN module 204. Accordingly, the combination of CC algorithm module 202 and CNN module 204 in AWB unit 1046, as shown in FIG. 2, can complement each other in various scenarios.

[0052] In some embodiments, CNN module 204 may include a low-dimensional design space consisting of simple regular networks, named RegNetY where “Y” in “RegNetY” represents a version. The RegNetY network may correspond to a convolutional neural network (CNN) design space with a simple and regular model having parameters. It thus provides network designs that are suitable for a wide range of floating-point (FLOP) operation regimes. The network designs generated by the RegNetY design space are relatively simple, regular, and interpretable in its structure, which can be considered as its particular advantages. [0053] FIG. 3 illustrates an exemplary block diagram of a network architecture of a CNN including a RegNetY network, according to some embodiments of the present disclosure. For simplicity of illustration, FIG. 3 merely shows a high-level concept of the RegNetY network. The RegNetY network may include a stem 32, a body 34, and a head 36. Stem 32 (not shown) may include a stride-two 3x3 convolution with 3 input channels and 32 output channels. Head 36 (now shown) may include a 1x1 convolution with 3 output channels followed by average pooling. The term “convolution” herein may refer to a convolution matrix, a convolution filter, a mask, or a kernel applied to an image frame. Further, 3x3 or 1x1 denotes a size of the convolution matrix. Depending on the contents of an applied convolution matrix, a convolution may be performed with respect to image to blur, sharpen, emboss, edge detect, or cause other effects on the image. The amount of movement between two applications of the convolution matrix on the image is referred to as a stride. The average pooling may be used to calculate an average value for each portion (i.e., each patch) of the image in order to summarize features of the image.

[0054] As shown in FIG. 3, body 34 may include a plurality of stages 340. Each stage 340 may include a plurality of blocks 342. Each block 342 may include a 1x1 convolution 3422, a 3x3 group convolution 3424, a Squeeze-and-Excite block 3426, a final 1x1 convolution 3428, and an accumulator 3430, as depicted in FIG. 3. Squeeze-and-Excite block 3426 is based on squeeze-and-excitation and may be configured to perform dynamic channel-wise feature recalibration to improve the representative quality of a CNN. Each block 342 may be associated with parameters including a width w_;, bottleneck ratio b_h and group width g, for the group convolution, where i denotes an i^th block, and denotes the number of blocks. The first block in each stage may use a stride-two convolution, while the rest of the blocks may include a stride-one convolution. The RegNetY network may include one rectified linear unit (ReLU) (not shown) following each convolution. Considering that the input frame sequence may involve various scenes so as to generate data with huge variations, applying batch normalization thus may not appear suitable. Therefore, in some embodiments of the present disclosure, batch normalization may not be applied to the proposed RegNetY network.

[0055] In designing architecture for mobile video applications, a model size and computation efficiency may become key factors. In some embodiments, to further reduce the computation cost of CNN module 204, the input frame sequence (i.e., the original input frames) may be downsized. In one instance, the bilinear interpolation may be applied, before the input frame sequence is fed into CNN module 204, to obtain a smaller version of size 224 pixels in height, 224 pixels in width, and 3 channels in depth. The downsizing can reduce the computation resources required by CNN module 204. Based on the down-sampled inputs,

CNN module 204 can predict a second candidate illuminant vector, expressed as:

[0056] In order to target mobile regimes of computing resources, according to some embodiments, the RegNetY configuration may include a RegNetY-600MF (MF stands for Mega-FLOPs). In some embodiments, to meet the strict resource constraints of typical mobile ISP systems, the model size and computations of RegNetY-600MF in CNN module 204 may be reduced as RegNetY-200MF, or even a more lightweight model. In other embodiments, CNN module 204 may also include other configurations, such as residual neural network (ResNet), visual geometry group (VGG), or mobile neural network (MobileNet).

[0057] In some embodiments, AWB unit may further include a concatenation operator

212, as shown in FIG. 2. Through concatenation operator 212, the first candidate illuminator vector of length 3 from CC algorithm module 202 and the second candidate illuminant vector of length 3 from CNN module 204 may be concatenated to form a concatenated illuminant vector having a length of 6 and can be expressed as:

[0058] In some embodiments, AWB unit 1046 may further include RNN module 206, and the concatenated illuminant vector obtained from Equation (7) can be inputted into RNN module 206 for processing. RNN module 206 may be implemented to AWB unit 1046 to capture temporal image features between different frames. In applications for the static images, according to some embodiments, RNN module 206 may be omitted or a recurrent branch of RNN module 206 may be set zero to deactivate recurrent operations of RNN module 206.

[0059] Temporal color constancy (TCC) uses multiple temporal images (such as frames in a video) to perform the illuminant estimation. Compared to the single-frame color constancy schemes, the TCC takes additional temporal information inherent in the input sequence; therefore, the TCC is naturally suitable for processing videos. In some embodiments of the present disclosure, AWB unit 1046 may include RNN module 206 to capture temporal characteristics of the input frame sequence.

[0060] In contrast, CNN module 204 is used to extract the image features with respect to static data, while RNN module 206 is better suited to analyze the temporal characteristics for sequential data (such as videos). Moreover, CNN module 204 is a feedforward neural network, while RNN module 206 may be configured to feed results back into itself. A previous state can help RNN module 206 better predict a future state.

[0061] An RNN may include a hidden state h and an optional output that operates on a variable-length input sequence x. At each time step I, the hidden state of the RNN h_t is updated by a non-linear activation function.

[0062] In some embodiments of the present disclosure, RNN module 206 may include a one-dimensional gated recurrent unit (GRU). The term “one dimension” herein may refer to a one-dimensional input x to the GRU. Compared to the 2D LSTMs in the TCC-Net, the ID GRU significantly reduces the requirements for parameters and computations. Operations of the ID GRU may be defined by: r_t = a(W_irx_t + b_ir + W_hrh_t-1 + b_ftr) (8); z_t = a(W_izx_t + b_iz + W_hzh_t-1 + b_hz) (9); n_t = tanh(W_inx_t + b_in + r_t * (W_hnh_t-1 + b_hn)) (10); and h_t = (1 - z_t) * h_t-1 + z_t * n_t (11), where h_t is the hidden state at time /, x_t is the input sequence at time t. h_t-i is the hidden state at time t-1. Further, r_t, z_t, and n_t are a reset gate, an update gate, and a new gate at time /, respectively. W and b represent weights and biases, respectively, c represents the sigmoid activation function, * represents the Hadamard product, and tanh represents a hyperbolic tangent function. The sigmoid activation function may be used to transform an input into a value between 0.0 and 1.0.

[0063] FIG. 4 illustrates an exemplary block diagram of a network architecture of an RNN module including a ID GRU that implements Equations (8)-(l 1), according to some embodiments of the present disclosure. As shown in FIG. 4, the update gate z_t may be configured to determine a ratio of a new hidden state n_t that will contribute to the hidden state at time t (i.e., the gate ht, also, the output) with respect to a previous hidden state at time t-1 (i.e., the gate h_t-i) . z_tmay comprise a value of a range between [0.0, 1.0], On the other hand, the reset gate r_t may be configured to determine whether a previous hidden state at time t-1 (i.e., the gate h_t-i) can be ignored in calculating the new hidden state n>. In some embodiments, the ID GRU may be configured to receive the concatenated illuminant vector (i.e., xi) having a length of 6 from concatenation operator 212 for processing based on Equations (8)-(l 1). In one instance, the initial hidden state of the ID GRU may be set to zero. The ID GRU may include 32 hidden units for the recurrent branch and 32 output units.

[0064] Two feature vectors may be outputted from RNN module 206, each having a length of 32. One of the two feature vectors is an output feature vector of a length of 32, and the other is a hidden feature vector that is fed back into RNN module 206 as an additional input for processing in a recurrent manner. The output feature vector may be fed into prediction module 208 to generate the final illuminant vector to process the current frame. That is, the hidden states extracted by RNN module 206 at frame //may be used as an input for processing It+i in a recurrent manner. In some embodiments, RNN module 206 may be configured to generate the output feature vector, corresponding to the current frame, based on static features of the current frame and M-l preceding frames, where M is an integer equal to or greater than 2. The static features of the M-l frames may include M-l concatenated illuminant vectors that include corresponding first candidate illuminant vectors and corresponding second candidate illuminant vectors. Generally speaking, the static features may include the color statistical features from CC algorithm module 202 and the spatial features from CNN module 204.

[0065] As a result, the final illuminant vector, obtained based on the output feature vector, may inherently contain temporal information in regard to the M images and can be expressed as:

where function /(.) applies a frame It and preceding (A/-1) frames including

It-(M-2), ... , and It-i to estimate the final illuminant vector C_t = | l_r, l_g, l_b | E R³ for It.

[0066] In some embodiments, RNN module 206 may be coupled with local memory 1050 and configured to store previous states for analyzing temporal data. For example, to generate the output feature vector based on the M frames, the M-l previous states (such as the hidden feature vectors corresponding to M-l preceding frames), may be cached in local memory 1050. Alternatively, RNN module 206 may have a network architecture to generate a current hidden state configured to summarize M-l previous frames as a whole. In other embodiments, RNN module 206 may include other model configurations, such as vanilla RNN or long-short term memory (LSTM).

[0067] AWB unit may include prediction module 208 configured to predict the final illuminant vector based on the output feature vector from RNN module 206. In some embodiments, prediction module 208 may include a fully connected layer. The fully connected layer is feedforward neural networks configured to perform discriminative learning so as to learn non-linear weights of the features that can identify an object class. [0068] FIG. 5 illustrates an exemplary diagram of a network architecture of a prediction module including a fully connected layer, according to some embodiments of the present disclosure. Consistent with some embodiments, the fully connected layer may include an input layer 52 having 32 input units xl, x2, ..., x32, one or more hidden layers 54, and an output layer 56 having 3 output units ol, o2, and o3. The fully connected layer may be followed by a sigmoid activation function. The outputs from the output layer 56 to the sigmoid activation function may be transformed into a value between 0.0 and 1.0. FIG. 5 shows two of the one or more hidden layers 54 for exemplary purpose but is not used to limit a number of the hidden layers 54.

[0069] As described above, the output feature vector of the length of 32 from RNN module 206 may be fed into a respective input unit of the input layer 52 of prediction module 208, and prediction module 208 may be configured to process the output feature vector of the length of 32 to produce the final illuminant vector, each element of the final illuminant vector corresponding to a respective output unit of the output layer 56. The output unit may be configured to output elements of the final illuminant vector, corresponding to red, green, and blue.

[0070] In some embodiments, AWB unit 1046 may further include rectification module 210 configured to receive the final illuminant vector and the current frame and to compensate the current frame based on the final illuminant vector to produce a rectified frame. In the video applications, through video generation unit 1048 of ISP 104, a plurality of the rectified frames can be obtained and stacked in a timing order associated with the rectified frames to obtain a video with color constancy. In some embodiments, rectification module 210 may be implemented to AWB unit 1046 as shown in FIG. 2, while, in other embodiments, video generation unit 1048 may include a rectification module instead. In those cases, the final illuminant vector may be outputted from AWB unit 1046 and fed to video generation unit 1048 for processing to obtain the rectified frame(s).

[0071] In another aspect of the present disclosure, a method of automatic white balancing (AWB) is provided. FIG. 6 illustrates a flowchart of an exemplary method of automatic white balancing, according to some embodiments of the present disclosure.

[0072] As shown in FIG. 6, the method may proceed to S602 and S604. At S602, one frame or each frame of an input frame sequence may be processed to obtain a first candidate illuminant vector based on color statistical features. As described above, CC algorithm module 202 may apply a non-neural -network-based CC algorithm (or CC algorithm in short) that can provide a fair baseline. In some embodiments, the CC algorithm may include the Gray World algorithm. In principle, the Gray World algorithm assumes that an average reflectance of a scene with rich colors is achromatic, i.e., gray, and accordingly adjusts an average input pixel value to be gray. The Gray World algorithm may output the first candidate illuminant vector having length 3, respectively corresponding to red, green, and blue in three color-channels for later processing.

[0073] Overall, based on the non-training strategy and low-level color statistics, the Gray World algorithm involves much fewer parameters and computations. The Gray World algorithm can generally produce fairly satisfactory results and provide a relatively simple implementation, thereby providing a fair baseline for generalizing scenarios.

[0074] At S604, the same input frame(s) may be processed (e.g., by CNN module 204 as illustrated in FIG. 3) and extracted for spatial features to obtain a second candidate illuminant vector. CNN module 204 may apply CNN algorithms to implement artificial neural networks that perform convolution operations used to emphasize relevant features and thus analyze visual imagery. The CNN algorithms can distinguish meaningful features in an image. As a result, CNN module 204 can extract higher-level representations from the input frames to classify the image as a whole. CNN module 204 may be configured to obtain raw pixel data, train the model, and extract features from pixels of the frame. As a result, the combination of the first illuminant vector and the second illuminant vector can complement each other in various scenarios.

[0075] In some embodiments, CNN module 204 may include a RegNetY network. In order to target mobile regimes of computing resources, RegNetY configuration in a default model architecture may include a RegNet-600MF. In designing architecture for mobile video applications, a model size and computation efficiency can be key factors. Therefore, to further reduce the computation cost of CNN module 204, in application, the input sequence (the original input frames) may be downsized. In one instance, the bilinear interpolation may be applied before the input sequence is fed into CNN module 204 to obtain a smaller version of size 224 pixels in height, 224 pixels in width, and 3 channels in depth. This downsizing operation can reduce the computation resources required by CNN module 204.

[0076] In some embodiments, as shown in FIG. 6, the processes at S602 and S604 may be performed in parallel such that the first candidate illuminant vector and the second candidate illuminant vector may be obtained substantially at the same time for later concatenation to reduce the computation latency.

[0077] The method may proceed to S606. In some embodiments, the first candidate illuminant vector and the second candidate illuminant vector may be concatenated to form a concatenated illuminant vector. Through concatenation operator 212 shown in FIG. 2, the first candidate illuminator vector of length 3 from CC algorithm module 202 and the second candidate illuminant vector of length 3 from CNN module 204 may be concatenated (or stacked) to form a vector of length 6.

[0078] The method may proceed to S608. In some embodiments, the concatenated illuminant vector may be processed by RNN module as illustrated in FIG. 4 to obtain an output feature vector and a hidden feature vector. RNN module 206 may be configured to capture temporal features between different frames. In some embodiments, RNN module 206 may include a one-dimensional gated recurrent unit (GRU). In some embodiments, the ID GRU may receive the concatenated illuminant vector of length 6 for processing based on Equations (8)-(l 1) as listed above and output two feature vectors, each having a length of 32. One of the two feature vectors is the output feature vector of a length of 32, and the other is the hidden feature vector that is fed back into the ID GRU as an additional input in a recurrent manner for processing the next frame to obtain a next output feature vector.

[0079] The method may proceed to S610. The output feature vector may be processed (e.g., by prediction module 208 as shown in FIG. 5) to obtain a final illuminant vector. Prediction module 208 may be configured to predict the final illuminant vector based on the output feature vector from RNN module 206. In some embodiments, prediction module 208 may include a fully connected layer followed by a sigmoid activation function. As described above, the output feature vector of length 32 from RNN module 206 may be fed into a respective input unit of prediction module 208, and prediction module 208 may process the output feature vector to output the final illuminant vector, each element of the final illuminant vector corresponding to a respective output unit.

[0080] The method may proceed to S612. A current frame may be rectified based on the final illuminant vector obtained from prediction module 208. In some embodiments, rectification module 210 may be configured to obtain the final illuminant vector and the current frame and to compensate the current frame based on the final illuminant vector to produce a rectified frame. For the video applications, a plurality of the rectified frames may be processed and stacked in a timing order to obtain a video of color balancing.

[0081] The present disclosure provides an AWB architecture that includes multiple modules including CNN module 204 and CC algorithm module 202. The AWB architecture may fuse CNN module 204 and CC algorithm module 202 to obtain a hybrid approach and complement each other. The input frame may be parallelly fed into and processed by CNN module 204 and CC algorithm module 202, respectively, to generate a concatenated illuminant vector. In some embodiments, the concatenated illuminant vector may be recurrently processed by RNN module 206 to output the feature vectors. Through prediction module 208, the feature vectors may be transformed into the final illuminant vector. The final illuminant vector can be used to compensate for the current frame to produce the rectified frame, thereby outputting an image or a video with color balancing.

[0082] The proposed solution of the present disclosure combines the advantages of the CC algorithm, the CNN, and the RNN to improve the robustness of the overall system. The CC algorithm is generally applicable and thus can provide a fair baseline. It also uses fewer parameters and thus has computation efficiency. In addition, the CC algorithm is easily being implemented and generates a fair prediction result. On the other hand, through training data, the CNN shows strong the ability to extract visual sematic features that are beneficial for accurate predictions. The combination of CC algorithm module 202 and CNN module 204 can have the benefit of complementing each other. In some embodiments, the inclusion of RNN module 206 can bring temporal information and contexts. RNN module 206 extracts the features from the previous frames to assist in the prediction of the final illuminant vector. Meanwhile, it turns out to be unnecessary for the input frame to be pre-processed, as in the TCC-Net, for generating an additional input sequence for the second branch.

[0083] In evaluating the proposed approaches, TCC benchmark dataset is used, including 600 real-world sequences in the dataset, being captured through a mobile phone camera with a 1824x1368 resolution. The length of these sequences ranges from 3 to 17 frames. Ground truth illuminant vectors are collected using a gray surface calibration object placed in the scene. The training process is constructed to minimize an angular error loss function, e, which is defined as: = acos

where E is the predicted illuminant vector (or the final illuminant vector), and E_gt is the ground truth illuminant vector. “ .” denotes an inner product, and || || denotes the Euclidean norm. Equation (13) implies that the smaller the error, the more accurate a system is when estimating the correct illuminant vector for a scene.

[0084] The angular error as defined in Equation (13) can be used as an accuracy benchmark for evaluating the proposed approaches. Meanwhile, the number of multiply- accumulate (MAC) operations per frame and the total number of model parameters are also used to evaluate computation cost, complexity, and/or device memory usage of the proposed approaches. The MAC operations compute a product of two numbers and add the product to an accumulator. The number of MACs per frame may be used to quantify computation cost. While the total number of model parameters in a neural network dictates what size of memory is required for storage of the network, the MACs can be used as a proxy to estimate the computation complexity of a neural network. The errors between the proposed approaches and the ground truths may be minimized using a gradient descent algorithm.

[0085] Table 1 compares the quantitative measures of the proposed approach of the present disclosure and the TCC-Net. The proposed approach is based on a RegNetY-600MF configuration in CNN module and ID GRU in RNN module. The angular error is computed per sequence, and then averaged over a total of 200 sequences of the test images provided by the benchmark as described above. Assuming a standard input frame size of 224 pixels in height, 224 pixels in width, and 3 channels in depth, the number of MAC operations are counted for each system. The number of parameters is also counted for each system for reference.

Table 1. Quantitative Measures Comparison between the TCC-Net and the Approach Disclosed Herein

[0086] As shown in Table 1, the proposed approach achieves 11% smaller of the mean angular error, thus more accurate than the TCC-Net in estimating the correct illuminants. The proposed approach uses 5 times fewer of the MAC operations and 3 times smaller of the model parameters. Fewer MAC operations imply that the proposed approach may use less power, and fewer model parameters imply that the solution disclosed herein may occupy less device memory. The power efficiency and compactness make the proposed approach more attractive for mobile devices and applications.

[0087] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non- transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as system 100 in FIG. 1. By way of example, and not limitation, such computer-readable media can include random-access memory (RAM), read-only memory (ROM), EEPROM, compact disc read-only memory (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, solid state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

[0088] According to one aspect of the present disclosure, a method of AWB is provided. The apparatus of AWB may include a color constancy (CC) algorithm module, a convolutional neural network (CNN) module, and a prediction module. The CC algorithm module may be configured to generate a first candidate illuminant vector, corresponding to one of one or more frames, based on color statistical features of the frame. The CNN module may be configured to extract spatial features based on the frame and generate a second candidate illuminant vector corresponding to the frame based on the spatial features. The prediction module may be configured to estimate a final illuminant vector corresponding to the frame based on the first candidate illuminant vector and the second candidate illuminant vector. The frame may be rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

[0089] In some embodiments, the one or more frames may include a plurality of frames. The apparatus may further include a recurrent neural network (RNN) module configured to extract temporal features based on static features corresponding to AT frames of the plurality of frames, consecutively captured from time

to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame. AT may be an integer equal to or greater than 2, and the static features of the frame captured at time t may include the first and second illuminant vectors. In some embodiments, the prediction module may be configured to generate the final illuminant vector corresponding to the frame based on the output feature vector corresponding to the frame.

[0090] In some embodiments, the RNN module may be further configured to receive the hidden feature vector associated with time /, and generate a hidden feature vector associated with time t+1 and an output feature vector corresponding to a next frame, captured at time t+1, based on the hidden feature vector associated with time t.

[0091] In some embodiments, the prediction module may include a fully connected layer, and the fully connected layer may include an output layer. The output layer may be configured for outputting three illuminant elements, respectively corresponding to red, green, and blue. The final illuminant vector may include the three illuminant elements.

[0092] In some embodiments, the fully connected layer may further include an input layer coupled with the RNN module and configured to receive the output feature vector to generate, at the output layer, the final illuminant vector of a length of 3 based on the output feature vector.

[0093] In some embodiments, the RNN module may include a one-dimensional gated recurrent unit (GRU) configured to receive the first candidate illuminant vector and the second candidate illuminant vector to generate the hidden feature vector associated with time t and the output feature vector corresponding to the frame.

[0094] In some embodiments, the CC algorithm module may be configured to calculate a red value, a green value, and a blue value respectively corresponding to pixels of the frame in a corresponding color to obtain the color statistical features, and set the red value, the green value, and the blue value to be equal to generate the first candidate illuminant vector corresponding to the frame.

[0095] In some embodiments, the apparatus may further include a concatenation operator configured to concatenate the first candidate illuminant vector and the second candidate illuminant vector to form a concatenated illuminant vector.

[0096] In some embodiments, each of the first candidate illuminant vector and the second candidate illuminant vector may include a length of 3 corresponding to three colorchannels of red, green, and blue. The concatenated illuminant vector may include a length of 6.

[0097] In some embodiments, the apparatus my further include a rectification module configured to rectify the frame based on the final illuminant vector to form the rectified frame. [0098] In some embodiments, the CNN module may include a RegNetY network. The RegNetY network may include a body that may have a plurality of stages, and each of the plurality of stages may include a plurality of blocks. Each of the plurality of blocks may include a 1x1 convolution, a 3x3 group convolution, a Squeeze-and-Excite block, and a 1x1 convolution in series.

[0099] In some embodiments, the frame may be downsized with respect to heights and widths of the frame to form a downsized frame. The CNN module may be further configured to process the downsized frame based on color statistical features of the downsized frame to generate the second candidate illuminant vector corresponding to the frame.

[0100] According to another aspect of the present disclosure, another apparatus of AWB is provided. The apparatus may include a processor and memory coupled with the processor and storing instructions. When executed by the processor, the instructions cause the processor to generate a first candidate illuminant vector, corresponding to one of a plurality of frames, based on color statistical features of the frame. Spatial features may be extracted based on the frame, and a second candidate illuminant vector corresponding to the frame may be generated based on the spatial features. Temporal features may be extracted based on static features corresponding to M frames of the plurality of frames, consecutively captured from time to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame. AT may be an integer equal to or greater than 2, and the static features of the frame captured at time t may include the first and second candidate illuminant vectors. A final illuminant vector corresponding to the frame may be generated based on the output feature vector corresponding to the frame. The frame may be rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

[0101] In some embodiments, the processor may be further configured to receive the hidden feature vector associated with time /, and generate a hidden feature vector associated with time t+1 and an output feature vector corresponding to a next frame, captured at time t+1, based on the hidden feature vector associated with time t.

[0102] In some embodiments, the processor may be further configured to calculate a red value, a green value, and a blue value respectively corresponding to pixels of the frame in a corresponding color to obtain the color statistical features, and set the red value, the green value, and the blue value to be equal to generate the first candidate illuminant vector corresponding to the frame.

[0103] In some embodiments, the processor may be further configured to downsize the frame with respect to heights and widths of the frame to form a downsized frame, and process the downsized frame based on the color statistical features to generate the second candidate illuminant vector corresponding to the frame.

[0104] According to still another aspect of the present disclosure, a method of AWB is disclosed. The method may include processing one of one or more frames based on color statistical features to obtain a first candidate illuminant vector corresponding to the frame; extracting spatial features based on the frame to obtain a second candidate illuminant vector corresponding to the frame; and estimating a final illuminant vector corresponding to the frame based on the first candidate illuminant vector and the second candidate illuminant vector. The frame may be rectified to form a rectified frames based on the final illuminant vector corresponding to the frame.

[0105] In some embodiments, the one or more frames may include a plurality of frames. the method further may include extracting temporal features based on static features corresponding to M frames of the plurality of frames, consecutively captured from time t-(M- 1) to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame. /W may an integer equal to or greater than 2, and the static features of the frame captured at time t may include the first and second illuminant vectors. The method may further include generating the final illuminant vector corresponding to the frame based on the output feature vector corresponding to the frame.

[0106] In some embodiments, the method may further include calculating a red value, a green value, and a blue value respectively corresponding to pixels of the frame in a corresponding color to obtain the color statistical features; and setting the red value, the green value, and the blue value to be equal to generate the first candidate illuminant vector corresponding to the frame.

[0107] In some embodiments, the method may further include downsizing the frame with respect to heights and widths of the frame to form a downsized frame; and processing the downsized frame to generate the second candidate illuminant vector corresponding to the frame.

[0108] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

[0109] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.

[0110] Various functional blocks, modules, and steps are disclosed above. The particular arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

[oni] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments but should be defined only in accordance with the following claims and their equivalents.

Claims

- 27 - CLAIMS WHAT IS CLAIMED IS

1. An apparatus of automatic white balancing (AWB), comprising: a processor; and memory coupled with the processor and storing instructions that, when executed by the processor, cause the processor to: generate a first candidate illuminant vector, corresponding to one of a plurality of frames, based on color statistical features of the frame; extract spatial features based on the frame and generate a second candidate illuminant vector corresponding to the frame based on the spatial features; extract temporal features based on static features corresponding to M frames of the plurality of frames, consecutively captured from time

to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame, M being an integer equal to or greater than 2, and the static features of the frame captured at time t comprising the first and second candidate illuminant vectors; and generate a final illuminant vector corresponding to the frame based on the output feature vector corresponding to the frame, wherein the frame is rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

2. The apparatus of claim 1, wherein the processor is further configured to: receive the hidden feature vector associated with time /; and generate a hidden feature vector associated with time t+1 and an output feature vector corresponding to a next frame, captured at time t+1, based on the hidden feature vector associated with time t.

3. The apparatus of claim 1, wherein the processor is further configured to: calculate a red value, a green value, and a blue value respectively corresponding to pixels of the frame in a corresponding color to obtain the color statistical features; and set the red value, the green value, and the blue value to be equal to generate the first candidate illuminant vector corresponding to the frame.

4. The apparatus of claim 1, wherein the processor is further configured to: downsize the frame with respect to heights and widths of the frame to form a downsized frame; and process the downsized frame based on the color statistical features to generate the second candidate illuminant vector corresponding to the frame.

5. A method of automatic white balancing, comprising: processing one of one or more frames based on color statistical features to obtain a first candidate illuminant vector corresponding to the frame; extracting spatial features based on the frame to obtain a second candidate illuminant vector corresponding to the frame; and estimating a final illuminant vector corresponding to the frame based on the first candidate illuminant vector and the second candidate illuminant vector, wherein the frame is rectified to form a rectified frames based on the final illuminant vector corresponding to the frame.

6. The method of claim 5, wherein: the one or more frames comprise a plurality of frames; and the method further comprises: extracting temporal features based on static features corresponding to M frames of the plurality of frames, consecutively captured from time

to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame, AT being an integer equal to or greater than 2, and the static features of the frame captured at time t comprising the first and second illuminant vectors; and generating the final illuminant vector corresponding to the frame based on the output feature vector corresponding to the frame.

7. The method of claim 5, further comprising: calculating a red value, a green value, and a blue value respectively corresponding to pixels of the frame in a corresponding color to obtain the color statistical features; and setting the red value, the green value, and the blue value to be equal to generate the first candidate illuminant vector corresponding to the frame.

8. The method of claim 5, further comprises: downsizing the frame with respect to heights and widths of the frame to form a downsized frame; and processing the downsized frame to generate the second candidate illuminant vector corresponding to the frame.

9. An apparatus of automatic white balancing (AWB), comprising: a color constancy (CC) algorithm module configured to generate a first candidate illuminant vector, corresponding to one of one or more frames, based on color statistical features of the frame; a convolutional neural network (CNN) module configured to extract spatial features based on the frame and generate a second candidate illuminant vector corresponding to the frame based on the spatial features; and a prediction module configured to estimate a final illuminant vector corresponding to the frame based on the first candidate illuminant vector and the second candidate illuminant vector, wherein the frame is rectified to form a rectified frame based on the final illuminant vector corresponding to the frame.

10. The apparatus of claim 9, wherein: the one or more frames comprise a plurality of frames; the apparatus further comprises a recurrent neural network (RNN) module configured to extract temporal features based on static features corresponding to AT frames of the plurality of frames, consecutively captured from time

to time /, to generate a hidden feature vector associated with time t and an output feature vector corresponding to the frame, AT being an integer equal to or greater than 2, and the static features of the frame captured at time t comprising the first and second illuminant vectors; and the prediction module is configured to generate the final illuminant vector corresponding to the frame based on the output feature vector corresponding to the frame.

11. The apparatus of claim 10, wherein the RNN module is further configured to: receive the hidden feature vector associated with time /; and generate a hidden feature vector associated with time t+1 and an output feature vector corresponding to a next frame, captured at time t+1, based on the hidden feature vector associated with time t.

12. The apparatus of claim 10, wherein: the prediction module comprises a fully connected layer; and the fully connected layer comprises an output layer, the output layer being configured for outputting three illuminant elements, respectively corresponding to red, green, and blue, based on the output feature vector, and the final illuminant vector comprising the three illuminant elements.

13. The apparatus of claim 12, wherein: the fully connected layer further comprises an input layer coupled with the RNN module and configured to receive the output feature vector to generate, at the output layer, the final illuminant vector of a length of 3 based on the output feature vector.

14. The apparatus of claim 10, wherein the RNN module comprises a onedimensional gated recurrent unit (GRU) configured to receive the first candidate illuminant vector and the second candidate illuminant vector to generate the hidden feature vector associated with time t and the output feature vector corresponding to the frame.

15. The apparatus of claim 9, wherein the CC algorithm module is configured to: calculate a red value, a green value, and a blue value respectively corresponding to pixels of the frame in a corresponding color to obtain the color statistical features; and set the red value, the green value, and the blue value to be equal to generate the first candidate illuminant vector corresponding to the frame.

16. The apparatus of claim 9, further comprising: a concatenation operator configured to concatenate the first candidate illuminant vector and the second candidate illuminant vector to form a concatenated illuminant vector.

17. The apparatus of claim 16, wherein: each of the first candidate illuminant vector and the second candidate illuminant vector comprises a length of 3 corresponding to three color-channels of red, green, and blue; and the concatenated illuminant vector comprises a length of 6.

18. The apparatus of claim 9, further comprising: - 31 - a rectification module configured to rectify the frame based on the final illuminant vector to form the rectified frame.

19. The apparatus of claim 9, wherein: the CNN module comprises a RegNetY network, the RegNetY network comprising a body comprising a plurality of stages, and each of the plurality of stages comprising a plurality of blocks; and each of the plurality of blocks comprises a 1x1 convolution, a 3x3 group convolution, a Squeeze-and-Excite block, and a 1x1 convolution in series.

20. The apparatus of claim 9, wherein: the frame is downsized with respect to heights and widths of the frame to form a downsized frame; and the CNN module is further configured to process the downsized frame based on color statistical features of the downsized frame to generate the second candidate illuminant vector corresponding to the frame.