CN114503541A

CN114503541A - Apparatus and method for efficient regularized image alignment for multi-frame fusion

Info

Publication number: CN114503541A
Application number: CN202080054064.3A
Authority: CN
Inventors: 甄睿文; J.W.格洛茨巴赫; H.R.谢克
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-08-06
Filing date: 2020-07-30
Publication date: 2022-05-13
Also published as: EP3956863A1; WO2021025375A1; EP3956863A4

Abstract

One method comprises the following steps: receiving a reference image and a non-reference image; dividing a reference image into a plurality of slices; determining, using the electronic device, a motion vector map based on a coarse-to-fine motion vector estimation; and generating an output frame using the motion vector map together with the reference image and the non-reference image.

Description

Apparatus and method for efficient regularized image alignment for multi-frame fusion

Technical Field

The present disclosure relates generally to image capture systems. More particularly, the present disclosure relates to an apparatus and method for regularized image alignment for multi-frame fusion.

Background

In the context of multi-frame fusion, the alignment of each non-reference frame with a selected reference frame is a critical step. If this step is of low quality, it directly affects the following image blending steps and may lead to an insufficient blending level or even to ghosting artifacts. A global image registration algorithm using a global transformation matrix is a common and efficient way to achieve alignment. But using a global transformation matrix can only reduce misalignment due to camera motion, sometimes even in the absence of matching features, a reliable solution cannot be found. In High Dynamic Range (HDR) applications, this situation occurs frequently due to either underexposure or overexposure of the input frames. An alternative is to find dense correspondences between frames using methods such as optical flow. Although these methods produce high quality alignment, they require significant computational cost, which presents significant challenges to the mobile platform.

Disclosure of Invention

Technical solution

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which like reference numbers represent like parts:

FIG. 1 illustrates an example network configuration including an electronic device in accordance with this disclosure;

FIGS. 2A and 2B illustrate an example process for efficient regularization image alignment using a multi-frame fusion algorithm according to this disclosure;

FIG. 3 illustrates slice-based motion vector estimation from coarse to fine according to an example of the present disclosure;

4A, 4B, 4C, and 4D illustrate example outlier removal according to the present disclosure;

fig. 5A and 5B illustrate example structure retention refinements in accordance with the present disclosure.

Fig. 6 illustrates an example improvement of ghosting artifacts for HDR applications in accordance with the present disclosure.

FIG. 7 illustrates an example mixing problem reduction for MBR applications according to the present disclosure; and is

FIG. 8 illustrates an example method of efficient regularized image alignment for multi-frame fusion according to this disclosure.

Detailed Description

The present disclosure provides an apparatus and method for multi-frame fused regularized image alignment.

In a first embodiment, a method comprises: receiving a reference image and a non-reference image; dividing a reference image into a plurality of slices; determining, using the electronic device, a motion vector map based on a coarse-to-fine motion vector estimation; and generating an output frame using the motion vector map together with the reference image and the non-reference image.

In a second embodiment, an electronic device includes at least one sensor and at least one processing device. The at least one processor is configured to: receiving a reference image and a non-reference image; dividing the reference image into a plurality of slices; determining a motion vector map using a coarse-to-fine based motion vector estimation; and generating an output frame using the motion vector map together with the reference image and the non-reference image.

In a third embodiment, a non-transitory machine-readable medium contains instructions that, when executed, cause at least one processor of an electronic device to: receiving a reference image and a non-reference image; dividing the reference image into a plurality of slices; determining a motion vector map using a coarse-to-fine based motion vector estimation; and generating an output frame using the motion vector map together with the reference image and the non-reference image.

Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.

Before beginning the detailed description section below, it may be advantageous to set forth definitions of certain terms and phrases used throughout this patent document. The terms "transmit," "receive," and "communicate," as well as derivatives thereof, encompass both direct and indirect communication. The terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation. The term "or" is inclusive, meaning and/or. The phrase "associated with … …" and derivatives thereof means "including," "included within … …," "interconnected with … …," "including," "included within … …," "connected to … …," or "connected with … …," "coupled to … …," or "coupled with … …," "communicable with … …," "cooperative with … …," "interleaved with … …," "juxtaposed with … …," "proximate to … …," "bound to … …," or "bound with … …," "has … …," "has a property of … …," "has a relationship with … …," or "has a relationship with … …," and so forth.

Further, various functions described below may be implemented or supported by one or more computer programs, each of which is formed from computer-readable programming code and embodied in a computer-readable medium. The terms "application" and "program" refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or portions thereof adapted for implementation by suitable computer readable program code. The phrase "computer readable program code" includes any type of computer code, including source code, object code, and executable code. The phrase "computer readable medium" includes any type of medium capable of being accessed by a computer, such as Read Only Memory (ROM), Random Access Memory (RAM), a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), or any other type of memory. Non-transitory computer readable media exclude wired, wireless, optical, or other communication links to transmit transitory electrical or other signals. Non-transitory computer readable media include media that permanently store data and media capable of storing data and later overwriting data, such as rewritable optical disks or erasable memory devices.

As used herein, terms and phrases such as "having," "may have," "include," or "may include" a feature, such as a value, function, operation, or component (e.g., part), specify the presence of the feature, but do not preclude the presence or addition of other features. Also, as used herein, the phrase "a or B," "at least one of a and/or B," or "one or more of a and/or B" may include all possible combinations of a and B. For example, "a or B," "at least one of a and/or B," or "one or more of a and/or B" may indicate all of (1) including at least one a, (2) including at least one B, or (3) including at least one a and including at least one B. Furthermore, as used herein, the terms "first" and "second" may modify various components without regard to importance, nor without limitation to such components. These terms are only used to distinguish one element from another. For example, the first user equipment and the second user equipment may indicate user equipments different from each other regardless of the order or importance of the equipments. A first component may be termed a second component, and vice-versa, without departing from the scope of the present disclosure.

It will be understood that when an element such as a first element is referred to as being "operatively or communicatively" coupled "or" connected "or" operatively or communicatively "connected" to another element, it is directly or indirectly coupled or connected to the other element or elements. In contrast, it will be understood that when an element (such as a first element) is referred to as being "directly coupled" or "directly connected" or "directly coupled" or "directly connected" to another element (such as a second element), there are no other elements (such as a third element) intervening between the element and the other element.

As used herein, the phrase "configured (disposed) as" may be used interchangeably with the phrases "adapted," "having … … capability," "designed as," "adapted," "enabling," or "capable," as the case may be. The phrase "configured (arranged) to" does not essentially mean "specifically designed to" in hardware. Rather, the phrase "configured to" may refer to one device being capable of performing operations with another device or other portion. For example, the phrase "a processor configured (arranged) to perform A, B and C" may refer to a general-purpose processor (such as a CPU or an application processor) or a special-purpose processor (e.g., an embedded processor) that performs operations by executing one or more software programs stored in a memory device.

The terms and phrases used herein are provided only to describe some embodiments of the present disclosure and not to limit the scope of other embodiments of the present disclosure. It should be understood that the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. All terms and phrases used herein, including technical and scientific terms and phrases, have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present disclosure belongs unless otherwise defined. It will be further understood that terms and phrases, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. In some instances, the terms and phrases defined herein may be construed to exclude embodiments of the disclosure.

Examples of "electronic devices" according to embodiments of the present disclosure may include at least one of the following options: a smart phone, a tablet Personal Computer (PC), a mobile phone, a video phone, an e-book reader, a desktop PC, a laptop computer, a notebook computer, a workstation, a Personal Digital Assistant (PDA), a Portable Multimedia Player (PMP), an MP3 player, an ambulatory medical device, a camera, or a wearable device such as smart glasses, a Head Mounted Device (HMD), an electronic garment, an electronic bracelet, an electronic necklace, an electronic accessory, an electronic tattoo, a smart mirror, or a smart watch. Definitions for other words and phrases are also potentially provided throughout this patent document. Those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.

None of the description in this application should be read as implying that any particular element, step, or function is an essential element which must be included in the claim scope. The scope of patented subject matter is defined only by the claims.

Modes for the invention

Fig. 1 through 8 will be discussed below and various embodiments of the present disclosure will be described with reference to these figures. It is to be understood, however, that the present disclosure is not limited to those embodiments, and that all variations and/or equivalents or alternatives to those embodiments also fall within the scope of the present disclosure. The same or similar reference numbers may be used throughout the specification and drawings to refer to the same or like elements.

The speed and memory requirements motivate the construction of simple algorithms, making a trade-off between computational cost and corresponding quality. A coarse-to-fine alignment on the four-level gaussian pyramid of the input frame is first performed to find similarities between image slices. Thereafter, an outlier rejection step and subsequent quadratic structure preservation constraints are taken to reduce image content distortion from the previous step.

In the context of multi-frame fusion, aligning each non-reference frame with a selected reference frame is a critical step. If this step is of low quality, it directly affects the following image blending steps and may lead to an insufficient blending level or even to ghosting artifacts. Global image registration algorithms that use a global transformation matrix to achieve alignment are a common and efficient way, but these algorithms only reduce misalignment caused by camera motion, and sometimes even fail to find a reliable solution in the absence of matching features (which frequently occurs in HDR applications because the input frames are under-or over-exposed). An alternative is to find dense correspondence between frames, such as optical flow. Although this approach yields high quality alignment, the relatively high computational cost of this approach presents significant challenges to the mobile platform. One or more embodiments of the present disclosure provide a simple algorithm based on speed/memory requirements that will trade off between computational cost and quality of correspondence. A coarse-to-fine alignment on the four-level gaussian pyramid of the input frame is first performed to find correspondences between image slices. Thereafter, an outlier rejection step and subsequent quadratic structure preservation constraints are taken to reduce image content distortion from the previous step. Its effectiveness and efficiency has been demonstrated via a large number of input frames for HDR and MBR applications. One or more embodiments of the present disclosure provide algorithms that can align multiple images/frames in the presence of camera motion or small object motion without introducing significant image distortion. It is an essential component in the pipeline of any multi-frame blending algorithm, such as, for example, the multi-frame blending algorithm can be a high dynamic range imaging and motion blur suppression technique, both of which fuse several images captured at different exposure/ISO settings.

Fig. 1 illustrates an example network configuration 100 including an electronic device according to this disclosure. The embodiment of the network configuration 100 shown in fig. 1 is for illustration only. Other embodiments of the network configuration 100 may be used without departing from the scope of this disclosure.

According to one or more embodiments of the present disclosure, an electronic device 101 is incorporated into a network configuration 100. Electronic device 101 may include at least one of bus 110, processor 120, memory 130, input/output (I/O) interface 150, display 160, communication interface 170, or sensors 180. In some embodiments, electronic device 101 may exclude at least one of these components or may add at least one other component. The bus 110 includes circuitry for interconnecting the

components

120 and 180 and for passing communications (such as control messages and/or data) between these components.

The processor 120 includes one or more of a Central Processing Unit (CPU), an Application Processor (AP), or a Communication Processor (CP). The processor 120 is capable of controlling at least one of the other components of the electronic device 101 and/or performing operations or data processing related to communications. In some embodiments, processor 120 is a Graphics Processor Unit (GPU). For example, the processor 120 may receive image data captured by at least one camera during a capture event. The processor 120 is capable of processing image data using, among other things, picture cut based markers (as discussed in more detail below) to generate an HDR image of a dynamic scene.

The memory 130 may include volatile and/or non-volatile memory. For example, the memory 130 may store commands or data related to at least one other component of the electronic device 101. Memory 130 may store software and/or programs 140 in accordance with one or more embodiments of the present disclosure. Programs 140 include, for example, a kernel 141, middleware 143, an Application Programming Interface (API)145, and/or application programs (or "applications") 147. At least a portion of the kernel 141, the middleware 143, or the API 145 may be referred to as an Operating System (OS).

The kernel 141 may control or manage system resources (such as the bus 110, the processor 120, or the memory 130) for performing operations or functions implemented in other programs (such as the middleware 143, the API 145, or the application 147). The kernel 141 provides an interface that allows the middleware 143, API 145, or application 147 to access various components of the electronic device 101 to control or manage system resources. Applications 147 include one or more applications for image capture, as discussed below. These functions may be performed by a single application or may be performed by multiple applications that each perform one or more of these functions. The middleware 143 can act as a relay, allowing the API 145 or application 147 to communicate with, for example, the kernel 141. Multiple applications 147 may be provided. Middleware 143 can control work requests received from applications 147, such as by assigning priority to at least one of the plurality of applications 147 for use of system resources of electronic device 101, such as bus 110, processor 120, or memory 130. The API 145 is an interface that allows the application 147 to control functions provided by the kernel 141 or the middleware 143. For example, the API 145 includes at least one interface or function (such as a command) for document control, window control, image processing, or text control.

The I/O interface 150 functions as an interface capable of communicating command or data inputs from a user or other external device to other components of the electronic device 101, for example. I/O interface 150 may also output commands or data received from other components of electronic device 101 to a user or other external device.

The display 160 includes, for example, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, an Organic Light Emitting Diode (OLED) display, a quantum dot light emitting diode (QLED) display, a micro-electro-mechanical systems (MEMS) display, or an electronic paper display. The display 160 may also be a depth aware display, such as a multi-focus display. The display 160 is capable of displaying various content (such as text, images, video, icons, and/or symbols) to a user. The display 160 may include a touch screen and may receive touch, gesture, proximity, or hover input made using, for example, an electronic pen or a user body part.

The communication interface 170 is, for example, capable of establishing communication between the electronic device 101 and an external device, such as the first external electronic device 102, the second external electronic device 104, or the server 106. For example, the communication interface 170 may be connected with the

network

162 or 164 through wireless or wired communication, thereby communicating with an external electronic device. The communication interface 170 may be a wired or wireless transceiver or any other component for sending and receiving signals, such as images.

The electronic device 101 further includes one or more sensors 180 capable of metering physical quantities or detecting activation states of the electronic device 101 and converting the metered or detected information into electrical signals. For example, the one or more sensors 180 may include one or more buttons for touch input, one or more cameras, gesture sensors, gyroscopes or gyroscope sensors, barometric pressure sensors, magnetic sensors or magnetometers, acceleration sensors or accelerometers, grip sensors, proximity sensors, color sensors (such as red, green, blue (RGB) sensors), biophysical sensors, temperature sensors, humidity sensors, lighting sensors, Ultraviolet (UV) sensors, Electromyogram (EMG) sensors, electroencephalogram (EEG) sensors, Electrocardiogram (ECG) sensors, Infrared (IR) sensors, ultrasonic sensors, iris sensors, or fingerprint sensors. Sensor(s) 180 may also include an inertial measurement unit, which may include one or more accelerometers, gyroscopes, and other components. The sensor(s) 180 may further comprise a control unit for controlling at least one sensor comprised therein. Any of these sensor(s) 180 may be located within the electronic device 101.

The first external electronic device 102 or the second external electronic device 104 may be a wearable device or a wearable device (such as an HMD) to which the electronic device may be mounted. When the electronic device 101 is installed into the electronic device 102 (such as an HMD), the electronic device 101 may communicate with the electronic device 102 through the communication interface 170. The electronic device 101 may be directly connected with the electronic device 102 to communicate with the electronic device 102 without involving a separate network. The electronic device 101 may also be an augmented reality wearable device (such as glasses) that includes one or more cameras.

For example, the wireless communication can use at least one of the following as cellular communication protocols: long Term Evolution (LTE), long term evolution-advanced (LTE-a), fifth generation wireless system (5G), millimeter wave or 60GHz wireless communication, wireless USB, Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Universal Mobile Telecommunications System (UMTS), wireless broadband (WiBro), or global system for mobile communications (GSM). The wired connection may include, for example, at least one of a Universal Serial Bus (USB), a high-definition multimedia interface (HDMI), a recommended standard 232(RS-232), or Plain Old Telephone Service (POTS). Network 162 includes at least one communication network, such as a computer network (e.g., a Local Area Network (LAN) or a Wide Area Network (WAN)), the Internet, or a telephone network.

The first and second external

electronic devices

102, 104 and the server 106 may each be the same or different type of device as the electronic device 101. According to certain embodiments of the present disclosure, the server 106 includes a set of one or more servers. Also, all or part of the operations performed on electronic device 101 may be performed on another electronic device or on a plurality of other electronic devices (such as

electronic devices

102 and 104 or server 106), in accordance with some embodiments of the present disclosure. Further, according to some embodiments of the present disclosure, when electronic device 101 should automatically or upon request perform a certain function or service, instead of or in addition to performing the function or service on itself, electronic device 101 may request another device (such as

electronic devices

102 and 104 or server 106) to perform at least some of the functions associated therewith. Another electronic device, such as

electronic devices

102 and 104 or server 106, can perform the requested function or additional functions and communicate the results of the execution to electronic device 101. The electronic device 101 may provide the requested function or service by processing the received result as it is or in addition. To this end, cloud computing, distributed computing, or client-server computing techniques, for example, may be used. Although fig. 1 shows the electronic device 101 as including a communication interface 170 to communicate with an external electronic device 104 or server 106 via the network 162, the electronic device 101 may be standalone without requiring separate communication functionality, according to some embodiments of the present disclosure.

The server 106 may optionally support the electronic device 101 by performing or supporting at least one of the operations (or functions) implemented on the electronic device 101. For example, the server 106 may include a processing module or processor that supports the processor 120 implemented in the electronic device 101.

Although fig. 1 shows one example of a network configuration 100 including an electronic device 101, various changes may be made to fig. 1. For example, network configuration 100 may include any number of each component in any suitable arrangement. In general, computing and communication systems have a wide variety of configurations, and fig. 1 is not intended to limit the scope of the present disclosure to any particular configuration. Moreover, although FIG. 1 illustrates an operating environment in which the various features disclosed in this patent document may be used, these features may be used in any other suitable system.

Fig. 2A and 2B illustrate an example process for efficient regularization image alignment using a multi-frame fusion algorithm according to this disclosure. For ease of explanation, the process 200 shown in fig. 2A is described as being performed using the electronic device 101 shown in fig. 1. However, the process 200 shown in fig. 2A may be used in any suitable system in conjunction with any other suitable electronic device.

The process 200 includes the following steps: multiple image frames of a scene are captured at different exposures and processed to generate a fused output. The fused image is blended using additional chrominance processing to reduce ghosting and blurring in the image.

Process 200 involves capturing a plurality of image frames 205. In the example shown in fig. 2A, two image frames are captured and processed, although more than two image frames may be used. Each image frame 205 is captured at a different exposure, such as when one of the image frames 205 is captured using an auto-exposure ("auto-exposure") or other longer exposure, and the other of the image frames 205 is captured using a shorter exposure (as compared to the auto-or longer exposure). Automatic exposure generally refers to exposure that is typically determined automatically (such as without human intervention activity) by a camera or other device with little or no user input. In some embodiments, the user is allowed to specify an exposure mode (such as portrait, landscape, sports, or other mode) and the automatic exposure may be generated without any other user input based on the selected exposure mode. Each exposure setting is typically associated with a different setting of the camera, such as a different aperture, shutter speed, and camera sensor sensitivity. Shorter exposure image frames are typically darker, lack image detail, and have more noise than auto-exposure or other longer exposure image frames. Thus, a shorter exposure image frame may include one or more under-exposed regions, while an auto-exposure or other longer exposure image frame may have one or more over-exposed regions. In some embodiments, the short exposure frame may have only a shorter exposure time, but a higher ISO to match the total image brightness of the auto-exposure or long exposure frame. Note that although often described below as involving the use of an auto-exposure image frame and at least one shorter exposure image frame, embodiments of the present disclosure may be used with any suitable combination of image frames captured using different exposures.

In some cases, during a capture operation, the processor 120 may control a camera of the electronic device 101 to enable rapid capture of the image frames 205, such as in a burst mode. The capture request that triggers the capture of the image frame 205 represents any command or input indicating a need or expectation to capture an image of a scene using the electronic device 101. For example, the capture request may be initiated in response to the user pressing a "soft" button presented on display 160 or the user pressing a "hard" button. In the illustrated example, two image frames 205 are captured in response to a capture request, although more than two images may be captured. The image frames 205 may be generated in any suitable manner, such as simply capturing each image frame by a camera or capturing multiple initial image frames using a multi-frame fusion technique and combining them into one or more of the image frames 205.

During the next operation, one image frame 205 may be used as a reference image frame and another image frame 205 may be used as a non-reference image frame. Depending on the situation, the reference image frame may represent an auto-exposure or other longer exposure image frame, or the reference image frame may represent a shorter exposure image frame. In some embodiments, an auto-exposure or other longer exposure image frame may be used as a reference image frame by default, as doing so allows for greater use of image frames with more image detail in generating a composite or final image of a scene. However, as described below, there are situations where it is undesirable to do so (such as due to the build up of image artifacts), in which case a shorter exposure image frame may be selected as the reference image frame.

In a pre-processing operation 210, the raw image frames are pre-processed in some manner to provide portions of image processing. For example, the pre-processing operation 210 may perform a white balance function to change or correct the color balance in the original image frame. For example, the pre-processing operation 210 may also perform the function of reconstructing a full-color image frame from incomplete color samples contained in the original image frame using a mask (such as a CFA mask).

As shown in fig. 2A, the image frame 205 is provided to an image alignment operation 215. The image alignment operation aligns the image frames 205 and produces aligned image frames. For example, the image alignment operation 215 may modify the non-reference image frame so that a particular feature in the non-reference image frame is aligned with a corresponding feature in the reference image frame. In this example, one of the aligned image frames may represent an aligned version of the reference image frame and another of the aligned image frames may represent an aligned version of the non-reference image frame. Alignment may be required to compensate for misalignment caused by the electronic device 101 moving or rotating between image capture events, which causes slight movement or rotation of objects in the image frames 205 (as is common with handheld devices). The image frames 205 may be aligned in both geometric and photometric terms. In some embodiments, the image alignment operation 215 may align the image frames using global orientation FAST and rotation brief (orb) features and local features from a block lookup, although other implementations of image registration operations may also be used. Note that the reference image frame herein may or may not be modified during alignment, and the non-reference image frame may represent the only image frame that is modified during alignment.

As part of the pre-processing operation 210, histogram matching may be performed on the aligned image frames (non-reference image frames). Histogram matching matches the histogram of the non-reference image frame to the histogram of the reference image frame, such as by applying an appropriate transfer function to the aligned image frame. For example, histogram matching may operate to make the brightness level substantially equal for two aligned image frames. Histogram matching may involve increasing the brightness of the shorter exposure image frame to substantially match the brightness of an auto-exposure or other longer exposure image frame, although the reverse may occur. Doing so also results in the generation of pre-processed alignment image frames associated with the alignment image frames. More details regarding the image alignment operation 215 will be described in connection with fig. 2B.

Thereafter, the aligned image frames are blended in a blending operation 220. Blending operation 220 blends or otherwise combines pixels from the image frames based on the label map(s) to produce at least one final image of the scene. The final image generally represents a mixture of image frames, where each pixel in the final image is extracted from either a reference image frame or a non-reference image frame (depending on the corresponding value in the marker map). Once the appropriate pixels are extracted from the image frame and the image is formed therewith, additional image processing operations may occur. Ideally, the final image has little or no artifacts and has improved image detail, even in areas where at least one of the image frames 205 is over-exposed or under-exposed.

The blended frames are then subjected to post-processing operations 225. Post-processing operation 225 may perform any processing on the blended image to complete fused output image 230.

Fig. 2B illustrates an example local image alignment operation 215 according to this disclosure. If multiple frames are being captured, the camera moves when the lens does not remain perfectly stationary. Slight movement of the camera between the different images captured will cause the images to be misaligned. Alignment of these images is required.

Global alignment is the determination of how much movement of the camera has occurred between each of these frames. A model for the movement of the image will be assigned based on the movement of the camera. The image will be corrected on the basis of a model assigned based on the determined movement of the camera from the reference frame. This approach will only capture camera motion and the model is non-ideal. This may not always be true for a real camera even if the camera is only determined to be moving within a single frame, since there are always moving objects in the scene, or there may be other secondary effects such as depth-related distortion from motion. This results in approximate camera motion. Even if the two images appear to be aligned, the global alignment may have features in the images that are not aligned correctly.

Some alignments start with a simple global alignment, after which an alignment adapted to the local content is allowed. One way to achieve this is by optical flow methods. Optical flow methods attempt to confirm or find that every location in the non-reference image is also found in the reference image. If this is performed for each pixel, a full map can be generated that reflects how much movement has occurred. This approach is time-costly and expensive and is impractical for mobile electronic devices. The streaming technique also has image complications if there is no actual image content (such as a wall or sky), where the output will be completely noisy or simply wrong.

Some alignments provide the benefit of obtaining optical flow quality in the alignment without a cost and add more regularization to ensure that the objects are properly aligned in the final image. Small motions are also a factor in the correction or in the case of camera movement; but the objects in the image are moving independently. Small motion tends to accompany the camera and appears as camera motion. Large motion will look like actual scene motion and be outside the determination of alignment.

The image alignment operations may include a histogram matching operation 235, a coarse to fine tile-based motion vector estimation operation 240, an outlier removal operation 245, and a structure-guided refinement operation 250. The image alignment operation 215 receives the input reference frame 255 and the non-reference frame 260 that have been processed in the pre-processing operation 210. During subsequent operations, one image frame may be used as a reference image frame 255 and another image frame may be used as a non-reference image frame 260. Reference image frame 255 may represent an auto-exposure or other longer exposure image frame, or reference image frame 255 may represent a shorter exposure image frame, as the case may be. In some embodiments, an auto-exposure or other longer exposure image frame may be used as the reference image frame 255 by default, as doing so allows for greater use of image frames with more image detail in generating a composite or final image of the scene. As described below, it may be undesirable to do so in some situations (such as due to the establishment of image artifacts), in which case a shorter exposure image frame may be selected as the reference image frame 255.

The image alignment operation 215 includes a histogram matching operation 235. Histogram matching occurs first, so that multiple images or frames with different capture configurations reach the same brightness level. Histogram matching is required for the next motion vector search. The purpose of histogram matching is to take two images that may not be exposed in the same manner and make adjustments accordingly. Comparing an underexposed image with a correctly exposed image will make one image darker than the other. The darker image may need to be adjusted to the correctly exposed image to properly compare the two images. A segment exposure time with a high gain to achieve the same effect as a longer exposure with a low gain will also require histogram matching. These pictures should look the same, but will still have slightly different images, which can use histogram matching for the normalized difference between the images so that they look as similar as possible.

Histogram matching occurs first to bring multiple images or frames with different capture configurations to the same brightness level. This histogram matching may be used for search of motion vectors, which is not necessary for sparse features such as directional FAST and rotation brief (orb). The histogram of the non-reference frame 260 is compared to the histogram of the reference frame 255. The histogram of the non-reference frame 260 is transformed to match the histogram of the reference frame 255. The transformed histogram of the non-reference frame may be used later in the image alignment operation 215, and may be used in the image blending operation 220 or the post-processing operation 225.

The slice-based motion vector image decomposes the image into a plurality of slices and attempts to find a motion map, e.g. a stream. The goal would be to find a motion vector for each slice to find the content of the slice in another frame. The result is a two-dimensional (2D) map of the motion vectors of the reference frame. The tile-based motion vectors may generate uniformly distributed features. For example, sparse features such as ORBs are sparse and sometimes biased towards local regions. This distribution characteristic reduces the likelihood of registration failure.

Motion vectors for patches (patch) can be determined between the reference frame 255 and the non-reference frame 260 by the coarse to fine tile-based motion vector estimation (MV estimation) 240. A tile-based search for motion vectors may generate uniformly distributed features. Sparse features such as ORBs are sparse and sometimes biased toward local regions. This distribution characteristic reduces the likelihood of registration failure. The search for motion vectors (features) can be performed in a coarse to fine scheme to reduce the search radius, which in turn reduces processing time. The choice of tile size and search radius ensures that most common cases can be covered. The sub-pixel search improves the alignment accuracy. Normalized cross-correlation is employed to be robust to poor histogram matching or different noise levels. The output of MV estimation 240 is a motion vector for each of the tiles into which reference frame 255 is divided. The set of motion vectors for the reference frame 255 and the non-reference frame 260 are output for subsequent use in the image alignment operation 215, in the image blending operation 220, or in the post-processing operation 225. MV estimation 240 may use a hierarchical implementation to narrow the search radius and thereby obtain faster results. MV estimation 240 may search for motion vectors at the sub-pixel level for higher accuracy. MV estimation 240 may search the motion vectors of each slice to provide sufficient matching features to obtain a more robust motion vector estimation. MV estimation 240 may use L2 distance search to search for images with the same exposure. For images with different exposures, the L2 distance may be replaced by normalized cross-correlation. More details regarding MV estimation 240 will be described in connection with fig. 3.

Outlier removal operation 245 can receive locally aligned non-reference frames 260 and can determine whether any outliers were generated during motion vector estimation 240. Outlier removal is a concept that removes terms that do not match global motion. For example, a person moving across a frame may exhibit motion vectors in significantly different directions. This motion will be marked as an outlier.

Outlier rejection can reject undesirable motion vectors (motion vectors on non-depth of field (flat) areas and large motion areas) in some multi-frame fusion modes. In multi-frame Higher Dynamic Range (HDR) applications, motion vectors are unreliable over regions without depth of field due to lack of features. In other multi-frame applications (e.g., multi-frame noise suppression), motion vectors on large moving objects are not constrained during the search. Both cases may result in distortion of the non-reference frame image, which is generally not a problem because the blending operation 220 may reject inconsistent pixels between the reference image 255 and the warped non-reference image 265. In some fusion modes (such as HDR), the image content in the non-reference image 260 is unique and may appear in the final composite image. Outlier removal 245 may reject erroneous motion vector outliers to avoid possible image distortion, resulting in a more robust image. More details regarding outlier removal operation 245 will be described in connection with fig. 4.

The structure-guided refinement operation 250 may preserve the image structure from suffering problems such as warping. A quadratic optimization problem is established in an attempt to preserve the structure in the image. For example, if a building with edges is located in the image, the edges of the building should not be distorted. The refinement operation 250 may maintain objects in the scene, such as edges in buildings.

Structure-guided mesh warping may be used in multi-frame fusion mode. The refinement operation 250 may impose constraints on the depth-free regions and motion regions with a global transformation while deforming the remaining image regions with the searched features. Compared to ordinary optical flow, quadratic optimization equations can be solved in closed form, and processing time is feasible on mobile platforms. The refinement operation 250 may pose constraints as a quadratic optimization problem and solve the problem using linear equations, resulting in faster results. The refinement operation 250 may add similarity constraints and global constraints to reduce image content distortion for enhanced structural protection. The refinement operation can output the warped non-reference frame 265 to the image blending operation 220 or the post-processing operation 225. More details regarding the structure-guided refinement operation 250 will be described in connection with FIG. 5.

Although fig. 2A and 2B illustrate an example of a process for efficient regularized image alignment using a multi-frame fusion algorithm, various changes may be made to fig. 2A and 2B. For example, while shown as a particular sequence of steps, various operations shown in fig. 2A and 2B may overlap, occur in parallel, occur in a different order, or occur any number of times. Also, the specific operations shown in fig. 2A and 2B are merely examples, and each of the operations shown in fig. 2A and 2B may be performed using other techniques.

Fig. 3 illustrates a coarse to fine tile-based motion vector estimation 240 according to an example of the present disclosure. In particular, fig. 3 illustrates how the motion vectors for each slice partitioned from the reference frame 255 can be quickly and accurately generated using the coarse to fine slice-based motion vector estimation 240 shown in fig. 2B. For ease of explanation, the generation of motion vectors is described as being performed using the electronic device 101 of fig. 1, but any other suitable electronic device in any suitable system may be used.

The reference frame 325 and the non-reference frame 330 correspond to the reference frame 255 and the non-reference frame 260, respectively, shown in fig. 2B.

Frames

325 and 330 are different resolutions of a person sitting on a window sill in an office building while waving the left hand. Behind this person is a window with an urban landscape that includes many natural (trees, clouds, etc.) and man-made objects (other buildings, roads, etc.). The hand movement is elevated from the reference frame 325 to the non-reference frame 330. Movement of the arms causes less movement of the overall posture of the person. The motion vector map is able to accurately detect the different motions to obtain enhanced local alignment.

The high resolution image is basically taken from coarse to fine tile-based motion vectors and the resolution is reduced to a low resolution in steps 305a-305 d.

Frames

325 and 330 are then broken into slices at each step. Beginning in the low resolution step 305a, a slice 310 from a reference frame 325 is found in another frame, such as a non-reference frame 330. In other words, the reference image frame is reduced in resolution and then divided into slices. Starting in the low resolution step 305a, the motion of each slice in the reference picture is then found in the non-reference picture. Doing so allows searching over a large area with low resolution, covering more content.

Motion is found in a low resolution first step 305a, after which the search is moved to a higher resolution second step 305 b. The slice 310 in the low resolution step 305a is used as a search area 315 in a second step 305 b. The search area 315 is divided into different slices 320 for searching the current slice 310 in the different steps 305a-305d for a match. Once the slice 310 is found in the second step 305b, the search moves up again to another higher horizontal resolution third step 305 c.

In a third step 305c, the slice in which the slice 310 is found to match is used as the search area 315 in the third step 305 c. The electronic device 101 may divide the search area 315 in the third step 305c into a plurality of tiles. Thereafter, the slices are searched for 310. Once the slice 310 is found in the slices of the third step 305c, the search moves up to the highest resolution fourth step 305 d.

Electronic device 101 may search for the region of tile 320 based on third step 305c in the highest resolution fourth step 305 d. Once a tile 310 matches a tile 320 in the fourth step 305d, the tile 310 is considered to be in the non-reference image and a motion vector can be developed to determine how far away the location of the tile 320 in the final image is compared to the tile 325 in the reference image 325. Motion vector 340 may be marked in motion vector map 335.

The respective motion vector maps 335 for each resolution show the motion vectors 340 based on color and intensity of color. The color indicates the direction of the motion vector, and the intensity of the color indicates the amount of movement within the specified direction. A stronger or darker color indicates significant movement between image frames.

Although the search area 315 appears in the upper left corner of each respective step 305a-305d, the search area may be located at any slice of the lower resolution step 305b-305 d. The example shown in fig. 3 indicates that the movement of slice 310 within the upper left corner is negligible in non-reference frames. The corresponding motion vector 340 in the motion vector map 335 will be of very low intensity if any color is present.

In the example shown in fig. 3, an auto-exposure or other longer exposure image frame is employed as the reference image frame 325 during the estimation 240 from the coarse to fine tile-based motion vectors, and any regions of the image frame 325 containing motion may be replaced with corresponding regions of the non-reference image frame 330. It is noted that there may be various situations where one or more saturated regions in a longer exposure image frame are partially obscured by at least one moving object in a shorter exposure image frame.

The motion vector estimation 240 may perform a coarse to fine alignment on the four gaussian pyramids of the input frame. The electronic device 101 may divide the reference image frame into a plurality of slices. On each pyramid, the electronic device 101 may search for a corresponding slice within the neighborhood of the non-reference image frame 330 for each slice in the reference image frame. The tile size and search radius may vary from level to level 305a-305 d.

The electronic device 101 may evaluate multiple hypotheses when upsampling coarser levels of motion vectors to avoid boundary problems. For images of the same exposure, the electronic device 101 may search for the movement vector 340 by minimizing the L2 norm distance. For differently exposed images, the search is performed by minimizing the normalized cross-correlation.

The search method described above can only generate pixel level alignments. To generate motion vectors with sub-pixel accuracy, the electronic device 101 may fit near the pixel minimum using a quadratic function and directly compute the sub-pixel minimum.

Fig. 4A-4D illustrate example outlier removal for non-reference images used in generating HDR images of dynamic scenes in accordance with the present disclosure. In particular, fig. 4A-4D illustrate how the image outlier removal 245 shown in fig. 2B may be used to generate an aligned non-reference image for use in generating an HDR image of a dynamic scene. For ease of explanation, the generation of the outlier-removed non-reference image 260 herein is described as being performed using the electronic device 101 of FIG. 1, but any other suitable electronic device in any suitable system may be used.

Image 405 shows a non-reference image prior to local alignment. The background of the image appears as a featureless area. The hands appear blurred due to the movement and the arms appear to swing.

Motion vectors on "depth-free areas" (such as saturated areas or featureless areas) may be unreliable. "depth-of-field free regions" are detected by comparing the gradient magnitude accumulation within the slice to a predefined threshold. A "large motion vector" may cause significant image distortion if used directly for image warping.

The electronic device 101 may use the motion vectors from the motion vector estimation 240 to calculate a global geometric transformation that may be described by an affine matrix. Although the transformation is described by an affine matrix, any type of global geometric transformation may be applied. The affine matrix "records" or holds straight lines in the two-dimensional image. A motion vector is referred to as a "large motion vector" if the distance between the motion vector and the global affine matrix exceeds a threshold. If this threshold is too small, the fine alignment result may approach global registration. When the threshold is increased, the strength of the local alignment is correspondingly increased.

In image 410, local alignment has been performed, but outlier removal has not been applied. Although the arms appear to be aligned with only slight misalignment of the fingers and arms, the background is significantly distorted. The straight edges of the road are disjointed due to imperfections associated with the local alignment process that experience a lack of detail in the alignment image. When an edge is detected, the outlier removal process will correct for a straight edge in the image 415. The image 415 still has artifacts due to motion.

After outlier removal within the depth of field free region, the electronic device 101 may apply outlier removal within the motion region. As can be seen in image 420, artifacts due to motion in the arms and fingers are corrected. Due to the size of the finger, the motion may not be completely corrected. There is a trade-off between accuracy and computational complexity of small details in the finger.

Fig. 5A and 5B illustrate an example structure-guided refinement operation 255 according to this disclosure. In particular, fig. 5A illustrates a quadratic mesh of the final image generated using the structure-guided refinement 250 illustrated in fig. 2B. Fig. 5B shows an example pre-refinement image and a post-refinement image generated using the structure preserving refinement operation 255 shown in fig. 2B.

The electronic device 101 may preserve the image structure by applying quadratic constraints to the mesh vertices 520. The quadratic constraint can be defined by the following equation.

E＝E_p+λ₁E_g+λ₂E_s (1)

The local alignment term Ep may represent an error term for the feature point 515 in the non-reference frame 510 after the feature point 515 (represented by the bilinear combination of vertices 520) in the non-reference frame 510 is warped to align with the corresponding feature point in the reference image frame. The feature point 515 in the reference image frame 505 is the center of the slice at the finest scale. The corresponding feature points are feature points in the reference image frame 505 shifted by the calculated motion vector.

The similarity term Es represents the error in similarity of the triangle coordinates 525 (formed by the three vertices) after warping, which means that the shape of the triangle should be similar before and after warping to keep Es low. The global constraint term Eg may enhance the "depth-free region" and the "large motion region" to take a global affine transformation.

As can be seen from the pre-refinement image, artifact 535 is still present. In practice the bulge on the knuckle area of the human hand should not be in the image but an artifact of local alignment. There is also an artifact 535 on the sill edge at the bottom of the window. The refined image causes these "convex" artifacts 535 to be corrected.

Although fig. 3 to 5B illustrate various examples of the image alignment operation, various changes may be made to fig. 3 to 5B. For example, fig. 3-5B are merely intended to illustrate examples of the types of results that may be obtained using the approaches described in this disclosure. Obviously, there may be a wide range of variations in the image of the scene, and the results obtained using the scheme described in this patent document may also vary widely as appropriate.

Fig. 6 and 7 illustrate example enhancement results 600, 700 of the image alignment operation 215 according to the present disclosure. The embodiments of results 600 and results 700 shown in fig. 6 and 7 are for illustration only. Fig. 6 and 7 do not limit the scope of the present disclosure to any particular result of the image alignment operation.

Fig. 6 illustrates an example ghosting artifact correction 600 for ghosting artifact 615 in accordance with the present disclosure. These images capture one side of a person's head and the background through the window. Several buildings can be seen in this background. Above the building appears to be light reflected in the window. Ghosting artifacts appear around the person's hair in the globally aligned image 605. As shown in the locally aligned image 610, ghost artifacts can be significantly reduced (if not completely corrected) using the local alignment method of the present application.

FIG. 7 illustrates an example hybrid problem correction 700 according to this disclosure. The blending problem that still exists in the global alignment image 705 can be seen from the tree of the global alignment blend map 715 and the global alignment image 705. The locally aligned hybrid graph 720 has more detail in these trees that can be interpreted. The local alignment image 710 has fewer blending problems than the global alignment image.

FIG. 8 illustrates an example method 800 for efficient regularized image alignment for multi-frame fusion in accordance with this disclosure. For ease of explanation, the method 800 shown in fig. 8 is described as involving performing the process 200 shown in fig. 2A using the electronic device 101 shown in fig. 1. However, the method 800 shown in fig. 8 may be used in any suitable system in conjunction with any other suitable electronic device.

In operation 805, the electronic device 101 may receive a reference image and a non-reference image. "receive" in this context may refer to capture using an image sensor, receive from an external device, or load from memory. The reference image and the non-reference image may be captured at different exposures using the same lens, may be captured at different exposures using different lenses, may have been captured at different resolutions using different lenses, and so on. The reference image and the non-reference image may capture the same subject at different resolutions, exposures, offsets, etc.

In operation 810, the electronic device 101 may divide the reference image into a plurality of slices. The division of the reference image into slices is to search for each slice in one or more non-reference frames. The tiles may have the same size, which means that the image may be divided evenly in the horizontal direction and in the vertical direction.

In operation 815, the electronic device 101 may determine a motion vector map with local alignment for each slice using the gaussian pyramid of the non-reference image. To determine the motion vector map, the electronic device 101 may divide the lower resolution frame corresponding to the gaussian pyramid of the non-reference image into a plurality of search tiles. Thereafter, for each of the plurality of tiles in the reference image, the electronic device 101 may locate a matching tile from the plurality of search tiles that corresponds to a tile of the plurality of tiles, and determine a low resolution movement vector based on a change in position of the tile from the reference image relative to the matching tile in the lower resolution frame of the non-reference image. The result is a low resolution motion map generated based on the low resolution motion vectors of the plurality of slices in the reference image. This sub-process may be performed for the lowest resolution level of the gaussian pyramid.

At each higher resolution level, the electronic device 101 may divide the search area in the non-reference image into a plurality of second search tiles, where the search area corresponds to a matching tile. A search area in the non-reference image is divided into a plurality of second search tiles, wherein the search area corresponds to a matching tile. A second matching tile corresponding to a tile of the plurality of tiles is located from the second plurality of search tiles. The electronic device 101 determines a motion vector based on a change in position of a slice from the reference image relative to a second matching slice in the non-reference image. A motion vector map is generated based on the motion vectors of the plurality of slices in the reference image.

The electronic device may determine outlier motion vectors in the motion vector map. A global affine matrix is calculated using the motion vectors in the motion vector map. The difference is determined by comparing each of the motion vectors to a global affine matrix. A large motion vector is determined when the difference is greater than a threshold value, and the determined large motion vector is removed from the motion vector map.

The electronic device 101 may also determine a depth free area to be removed from the motion vector image. The non-reference picture may be divided into a plurality of non-reference slices. The gradient magnitude accumulation in the non-reference slice may be compared to a predefined threshold and a depth-free region may be determined based on the gradient magnitude accumulation exceeding the predefined threshold. The electronic device 101 may remove the motion vectors corresponding to the region without depth of field from the motion vector map.

To preserve the image structure, the electronic device may impose a quadratic constraint on grid vertices of the image structure corresponding to the non-reference image, wherein the quadratic constraint is by E ═ E_p+λ₁E_g+λ₂E_sWhere Ep is a local alignment term, Eg is a global constraint term, and Es is a similarity term, are defined.

In operation 820, the electronic device 101 may generate an output frame using the motion vector map together with the reference image and the non-reference image. After the image alignment process 215, the image blending operation generates a blended map using the alignment output. The electronic device 101 performs a post-processing operation using the reference image, the non-reference image, the motion vector map, and the blend map to generate an output image.

Although FIG. 8 illustrates one example of a method 800 for efficient regularized image alignment for multi-frame fusion, various changes may be made to FIG. 8. For example, while shown as a series of steps, various steps in FIG. 8 could overlap, occur in parallel, occur in a different order, or occur any number of times.

While the present disclosure has been described with reference to various exemplary embodiments, various changes and modifications may be suggested to one skilled in the art. It is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims.

Claims

1. A method, comprising:

receiving a reference image and a non-reference image;

dividing the reference image into a plurality of slices;

determining, using the electronic device, a motion vector map based on a coarse-to-fine motion vector estimation; and

generating an output frame using the motion vector picture together with the reference picture and non-reference pictures.

2. The method of claim 1, wherein determining the motion vector map comprises:

dividing a lower resolution frame of a Gaussian pyramid into a plurality of search tiles, the Gaussian pyramid corresponding to the non-reference image;

for each of the plurality of slices in the reference image:

locating matching tiles from the plurality of search tiles that correspond to tiles from the plurality of tiles; and

determining a low resolution motion vector based on a difference between the location of the slice in the reference image and the location of the matching slice in a lower resolution frame of the non-reference image; and

generating a low resolution motion map based on the low resolution motion vectors of the plurality of slices in the reference image.

3. The method of claim 2, wherein determining the motion vector map further comprises:

for each of the plurality of slices in the reference image:

dividing a search area in the non-reference image into a plurality of second search tiles, wherein the search area corresponds to the matching tile,

locating a second matching tile corresponding to the tile of the plurality of tiles from the second plurality of search tiles, an

Determining a motion vector based on a difference between the location of the slice in the reference image and the location of the second matching slice in the non-reference image; and

generating the motion vector map based on low resolution motion vectors of the plurality of slices in the reference image.

4. The method of claim 1, further comprising:

computing a global geometric transformation using motion vectors in the motion vector map;

determining a difference by comparing each of the motion vectors to the global geometric transform;

determining a large motion vector when the difference is greater than a threshold; and

removing the determined large motion vector from the motion vector map.

5. The method of claim 4, further comprising:

dividing the non-reference picture into a plurality of non-reference slices;

comparing the gradient magnitude accumulation within the non-reference slice to a predefined threshold;

detecting a depth-free region based on the gradient magnitude accumulation exceeding the predefined threshold; and

removing a motion vector corresponding to the depth-free region from the motion vector map.

6. The method of claim 1, further comprising:

applying a quadratic constraint on grid vertices of the image structure corresponding to the non-reference map.

7. The method of claim 6, wherein the quadratic constraint is defined by:

E＝E_p+λ₁E_g+λ₂E_s,

wherein E is_pIs a local alignment term, E_gIs a global constraint term, and E_sIs a similarity item.

8. An electronic device, comprising:

at least one sensor; and

at least one processor configured to:

receiving a reference image and a non-reference image;

dividing the reference image into a plurality of slices;

determining a motion vector map using local alignment based on each slice from coarse to fine motion vector estimation; and

generating an output frame using the motion vector map together with the reference image and a non-reference image.

9. The electronic device of claim 8, wherein determining the motion vector map comprises:

for each of the plurality of slices in the reference image:

determining a low resolution motion vector based on a difference between a location of the slice from the reference image and a location of the matching slice in a lower resolution frame of the non-reference image; and

10. The electronic device of claim 9, wherein determining the motion vector map further comprises:

for each of the plurality of slices in the reference image:

11. The electronic device of claim 8, wherein the at least one processor is further configured to:

removing the determined large motion vector from the motion vector map.

12. The electronic device of claim 11, wherein the at least one processor is further configured to:

dividing the non-reference picture into a plurality of non-reference slices;

13. The electronic device of claim 8, wherein the at least one processor is further configured to:

applying a quadratic constraint on grid vertices of the image structure corresponding to the non-reference image.

14. The electronic device of claim 13, wherein the quadratic constraint is defined by:

E＝E_p+λ₁E_g+λ₂E_s,

15. A machine-readable medium comprising instructions that when executed cause at least one processor of an electronic device to perform operations corresponding to one of the methods of claims 1-7.