CN114708173A

CN114708173A - Image fusion method, computer program product, storage medium, and electronic device

Info

Publication number: CN114708173A
Application number: CN202210163334.5A
Authority: CN
Inventors: 蒋霆; 李鑫鹏; 韩明燕; 林文杰; 蒋承知; 刘震; 刘帅成
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-07-05

Abstract

The application relates to the technical field of image processing, and provides an image fusion method, a computer program product, a storage medium and an electronic device. The image fusion method comprises the following steps: acquiring n frames of images to be fused with incompletely same exposure degrees; fusing channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image; for channel images belonging to the same target channel in n frames of images to be fused, calculating a channel fusion image by the following steps: determining n frames of online fusion masks corresponding to the n frames of channel images by using n lookup tables corresponding to the n frames of channel images; and performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain a channel fusion image. The method directly calculates the on-line fusion mask in a mode of looking up a one-dimensional table, thereby being capable of efficiently finishing image fusion and having better real-time performance.

Description

Image fusion method, computer program product, storage medium, and electronic device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image fusion method, a computer program product, a storage medium, and an electronic device.

Background

When a high-brightness area and areas with relatively low brightness, such as shadows and backlight, which are irradiated by a strong light source, coexist in a shooting scene, a phenomenon that a bright area becomes white due to overexposure and a dark area becomes black due to underexposure often occurs in an image collected by a camera, and the phenomenon seriously affects the image quality. There is a limitation in the appearance of a camera to the brightest and darker areas in the same scene, which is commonly referred to as "dynamic range".

High-Dynamic Range (HDR) is a technology for restoring an image to the details of a scene as much as possible when there is strong brightness contrast in a captured scene. A typical application of this technique is: by utilizing a specific fusion algorithm, multiple frames of images with different exposure degrees collected for the same scene are synthesized into one HDR image, so that the HDR image has good detailed expression no matter in highlight parts or shadow parts.

However, the existing fusion algorithms are relatively complex, so that the efficiency is too low when image fusion is performed, and the existing fusion algorithms cannot be basically used in occasions with high requirements on real-time performance, and great limitation is brought to popularization of the HDR technology.

Disclosure of Invention

An object of the embodiments of the present application is to provide an image fusion method, a computer program product, a storage medium, and an electronic device, so as to improve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides an image fusion method, including: acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel; fusing n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps: determining n frames of online fusion masks corresponding to the n frames of channel images by using n lookup tables corresponding to the n frames of channel images; the method comprises the steps that a pixel value in each frame of channel image is mapped to the weight of an online fusion mask corresponding to the frame of channel image at the same position according to the mapping relation between the pixel value recorded in a lookup table corresponding to the frame of channel image and the weight; and performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain the channel fusion image.

The method maps the channel image of the target channel into the online fusion mask directly in a table look-up mode, so that the calculation of the fusion weight (the online fusion mask) can be completed in a very short time, and the image fusion can be efficiently completed.

In addition, only the mapping relation between the pixel value and the weight is recorded in the lookup table used in the method, and the pixel value is the pixel value in the single-channel image, namely, the pixel value is a single numerical value rather than a plurality of numerical values, so that the lookup table used in the method is a one-dimensional lookup table, the structure is very simple, and the lookup efficiency is extremely high.

Moreover, the look-up tables used in the method are n in number, namely different mapping relations possibly existing in pixel values and weights under different exposure degrees are fully considered, so that the obtained online fusion mask has higher precision and is beneficial to improving the HDR effect of the fusion result image.

In an implementation manner of the first aspect, the determining, by using n lookup tables corresponding to the n channel images, an n-frame online fusion mask corresponding to the n channel images includes: down-sampling the n frames of channel images to obtain n frames of low-resolution channel images; determining n frames of low-resolution online fusion masks corresponding to the n frames of low-resolution channel images by using n lookup tables corresponding to the n frames of low-resolution channel images; mapping the pixel value in each frame of low-resolution channel image into the weight of the low-resolution online fusion mask corresponding to the frame of low-resolution channel image at the same position according to the mapping relation between the pixel value recorded in the lookup table corresponding to the frame of low-resolution channel image and the weight; and performing up-sampling on the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images.

In the implementation mode, the channel image of the target channel is firstly downsampled to reduce the resolution, then the online fusion mask with the low resolution is obtained through table lookup, and finally the online fusion mask with the original resolution is obtained through upsampling. Because the low-resolution image is used for table look-up, the efficiency of mask calculation is obviously improved, and the efficiency of image fusion is improved.

In an implementation manner of the first aspect, the upsampling the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as that of the n frames of channel images includes: and performing edge-preserving smooth filtering and up-sampling on the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images.

When the online fusion mask with the original resolution is calculated by using the online fusion mask with the low resolution, the method not only performs upsampling, but also performs edge-preserving smooth filtering (for example, guided filtering), so that the method is favorable for inhibiting the mask quality reduction caused by the upsampling and improving the precision of the fusion mask.

In an implementation manner of the first aspect, before the fusing n channel images belonging to a same channel in the n images to be fused, the method further includes: traversing each pixel value x in the pixel value range, and determining the mapping relation between the pixel values and the weights recorded in the n lookup tables by executing the following steps: acquiring the same n frames of monochrome images, wherein the monochrome images only comprise one channel and the pixel values of all pixels in the images are x; extracting a basic feature map of the n frames of monochrome images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps; calculating n frames of offline fusion masks corresponding to the n frames of monochromatic images by using n branch networks in the neural network model; each branch network is used for calculating a frame of offline fusion mask corresponding to a frame of monochrome image according to a frame of sub-feature image, and each frame of offline fusion mask comprises a weight for fusing the corresponding frame of monochrome image; and determining the weight corresponding to the pixel value x in the lookup table corresponding to each frame of offline fusion mask according to the weight in each frame of offline fusion mask, and determining n weights corresponding to the pixel value x in the n lookup tables.

If the neural network model performs identity mapping, the monochrome image is input into the model and is output as the original monochrome image, but actually, the pre-trained neural network model does not perform identity mapping, so that the output image (off-line fusion mask) is changed relative to the pixel value in the original monochrome image, and the change reflects the mapping relation between the pixel value learned by the neural network model and the weight, so that the required lookup table can be constructed only by storing the mapping relation, and the constructed lookup table also replaces the mapping relation modeled by the neural network model to a certain extent.

Furthermore, if the neural network model is directly adopted to calculate the on-line fusion mask, the calculation precision is high, and the quality of the fusion result image (HDR image) is better. The reason for this is that: firstly, the neural network model is obtained through training and is not a preset rule determined according to experience, so that the calculated weight has better robustness, and the phenomena of grey pictures, lack of details and the like of the obtained fusion result image when the exposure degree span of the image to be fused is larger are favorably improved. Secondly, the neural network model is divided into a main network and branch networks, the main network is shared by n frame channel images, so that inter-frame information communication is facilitated, the accuracy of the on-line fusion mask is improved, each branch network is used for calculating the on-line fusion mask corresponding to one frame channel image, parameter optimization is facilitated for the frame channel image, the accuracy of the on-line fusion mask can be improved, the calculated accuracy of the on-line fusion mask is higher, and the fusion effect is better naturally.

However, the neural network model has a relatively complex structure, and the real-time performance of image fusion using the neural network model is poor. The online fusion mask is calculated by using the lookup table, so that the real-time performance is excellent, and as the lookup table constructed based on the neural network model can replace the model to a certain extent, the calculation precision of the online fusion mask is high enough, namely the efficiency and the quality of image fusion are well considered.

In an implementation manner of the first aspect, the determining, according to the weight in each frame of offline fusion mask, the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask includes: determining the average value of the weights in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask; or, determining the weight with the highest frequency of occurrence in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask.

In the implementation mode, the mapping relation between the pixel values and the weights in the lookup table is determined according to the off-line fusion mask, so that the method is flexible.

In an implementation manner of the first aspect, after the determining n weights corresponding to the pixel value x in the n lookup tables, the method further includes: normalizing the n weights such that the sum of the n weights is 1.

In the implementation mode, n weights corresponding to the same pixel are normalized and then stored in the lookup table, so that the value range of the weights corresponding to different pixel values is unified, and in addition, the subsequent image fusion is directly performed according to the searched weights conveniently.

In one implementation of the first aspect, the n-frame offline fusion mask has a lower resolution than the n-frame monochromatic image.

In the above implementation manner, if the offline fusion mask is a low-resolution image (the resolution of the offline fusion mask is related to the design of the neural network model, and some neural network models are added with a downsampling module, then a low-resolution offline fusion mask may be obtained), the lookup table may also be constructed.

In one implementation of the first aspect, the neural network model is trained to: when n frames of single-channel images with incompletely identical exposure degrees are input into the main network in the model, n frames of fusion masks corresponding to the n frames of single-channel images are output from the n branch networks in the model; each branch network outputs a frame of corresponding fusion mask, and each frame of fusion mask comprises a corresponding weight for fusing a frame of single-channel image.

In the above implementation, the neural network model is trained to predict the fusion mask, so that the lookup table constructed based on the neural network model can also perform online computation of the fusion mask more accurately, that is, the lookup table plays a role in image fusion instead of the neural network model.

In an implementation manner of the first aspect, the performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain the channel fusion image includes: constructing a corresponding Gaussian pyramid according to each frame of online fusion mask to obtain n Gaussian pyramids in total, and constructing a corresponding Laplacian pyramid according to each frame of channel image to obtain n Laplacian pyramids in total; the layers of the Gaussian pyramid and the Laplacian pyramid are the same, and the resolution of pyramid images in the same layer in the two pyramids is also the same; performing weighted fusion on the n laplacian pyramids by using the n laplacian pyramids to obtain a fused laplacian pyramid; each layer of pyramid image in the fused laplacian pyramid is obtained by weighting and fusing n frames of pyramid images in the layer in the n laplacian pyramids by using the n frames of pyramid images in the layer; and performing image reconstruction by using the fused Laplacian pyramid to obtain the channel fusion image.

In the above implementation, the channel images themselves are not directly subjected to linear fusion, but the laplacian pyramid constructed based on the channel images is subjected to hierarchical linear fusion. In the laplacian pyramid, the residual images of each layer, including the top-most small images, can be regarded as feature maps, and these feature maps represent image details of different frequencies, because the details lost in each downsampling in the process of building the pyramid are different. Therefore, the weighted fusion of the laplacian pyramid images on the same layer is equivalent to the independent image fusion on the subspace corresponding to each frequency, so that the fusion difficulty is low, and the fusion effect is better than that of the fusion directly on the original channel image, thereby being beneficial to improving the quality of the channel fusion image.

In an implementation manner of the first aspect, if each frame of image to be fused includes multiple channels, and the multiple channels are not all target channels, n frames of channel images belonging to the same non-target channel in the n frames of images to be fused are weighted and fused by using a fusion mask calculated when the channel images of the target channels are fused.

In the implementation mode, for the channel images of the non-target channel, the fusion mask used by the channel images of the target channel can be directly used during fusion, so that the fusion process is simplified, and the fusion efficiency is improved. For example, the target channel may select a relatively important channel (such as the Y channel of the YUV image) in the image to be fused, and the non-target channel may select a relatively minor channel (such as the U, V channel of the YUV image) in the image to be fused, so that the fusion mask used by the non-target channel does not have a great influence on the image quality of the fusion result image even if the fusion mask is not calculated according to the own channel image.

In a second aspect, an embodiment of the present application provides an image fusion apparatus, including: the image acquisition component is used for acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel; the image fusion component is used for fusing n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fusion image, and determining a fusion result image according to the at least one frame of channel fusion image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps: determining n frames of online fusion masks corresponding to the n frames of channel images by using n lookup tables corresponding to the n frames of channel images; the method comprises the steps that a pixel value in each frame of channel image is mapped to the weight of an online fusion mask corresponding to the frame of channel image at the same position according to the mapping relation between the pixel value recorded in a lookup table corresponding to the frame of channel image and the weight; and performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain the channel fusion image.

In a third aspect, an embodiment of the present application provides a computer program product, which includes computer program instructions, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, wherein the memory stores computer program instructions, and the computer program instructions, when read and executed by the processor, perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 illustrates steps of an image fusion method provided by an embodiment of the present application;

FIG. 2 illustrates sub-steps that may be included in step S120 of FIG. 1;

fig. 3 illustrates a specific workflow of an image fusion method provided by an embodiment of the present application;

FIG. 4 shows another specific workflow of the image fusion method provided by the embodiment of the present application;

FIG. 5 shows a Gaussian pyramid, a Laplacian pyramid construction process, and a Laplacian reconstruction process;

fig. 6 shows a construction flow of a lookup table in an image fusion method provided by an embodiment of the present application;

fig. 7 illustrates a training flow of a neural network in an image fusion method provided by an embodiment of the present application;

FIG. 8 shows functional components included in an image fusion apparatus provided in an embodiment of the present application;

fig. 9 shows a structure of an electronic device provided in an embodiment of the present application.

Detailed Description

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been advanced significantly. Artificial Intelligence (AI) is a new scientific technology that is developed to study and develop theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly a machine is used for identifying the world, and computer vision technology generally comprises the technologies of face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction, computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to various fields, such as security, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, testimony verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like. The image fusion method in the embodiment of the application also utilizes the technologies of aspects such as artificial intelligence and the like.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Fig. 1 illustrates steps of an image fusion method provided in an embodiment of the present application, which may be, but is not limited to, performed by the electronic device illustrated in fig. 9, and reference may be made to the following description about fig. 9 regarding the structure of the electronic device. Referring to fig. 1, the image fusion method includes:

step S110: and acquiring n frames of images to be fused with incompletely same exposure degrees.

Wherein n is an integer greater than 1, the images to be fused refer to images that need to be fused with each other, and the fusion is aimed at obtaining images with HDR effect. The exposure degree of the image to be fused can be quantitatively expressed by a certain means, for example, an exposure compensation value (EV) or the like. Note that, in step S110, it is only required that the exposure degrees of the n frames of images to be fused are not exactly the same, but it is not required that the exposure degrees of the n frames of images to be fused are completely different: for example, when n is 3, 3 frames of images of EV-1, EV0, EV +1 are a satisfactory set of images to be fused; when n is 4, the 4-frame images of EV-1, EV0, EV0 and EV +1 are also a satisfactory set of images to be fused, wherein the image of EV0 appears 2 times, which indicates that the image of EV0 is important and should have a higher percentage in the fusion result.

The present application does not limit how to obtain n frames of images to be fused that satisfy the above requirements: for example, it may be that an image captured in real time is received from a camera of the electronic device as an image to be fused (for example, EV values of the camera are set to-1, 0, and 1, respectively, to continuously capture 3 frames of images); for another example, the image to be fused that is stored in advance may be directly read from the memory of the electronic device, and so on. It should be further understood that the n images to be fused are directed to the same scene in content, otherwise it is not meaningful to fuse them.

Step S120: and fusing the n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image.

Each frame of image to be fused comprises at least one channel, and the number of the channels contained in each frame of image to be fused is the same. For example, the n frames of images to be fused may all be grayscale images, that is, all include 1 channel; for another example, the n frames of images to be fused may all be YUV images, that is, each includes 3 channels; for another example, the n frames of images to be fused may all be RGB images, i.e., each including 3 channels, and so on.

If each frame of image to be fused only contains one channel, the channel image of the channel is the image to be fused, in step S120, n frames of channel images are fused, that is, n frames of images to be fused are fused, and the obtained channel fused image is the fusion result image.

If each frame of image to be fused comprises two or more channels, the channels are divided for image fusion to obtain channel fusion images with the same number as the channels, and the channel fusion images are spliced together according to the channels to which the channel fusion images belong to obtain a fusion result image. For example, if n frames of images to be fused are YUV images, n frames of channel images of the Y channel should be fused to obtain a channel fusion image of the Y channel, n frames of channel images of the U channel should be fused to obtain a channel fusion image of the U channel, n frames of channel images of the V channel should be fused to obtain a channel fusion image of the V channel, and finally, 3 frames of channel fusion images are spliced together as new Y, U, V channel images, respectively, so that a fusion result image can be obtained, and obviously, the fusion result image is still a YUV image.

The number of the target channels may be one or more, and if there are remaining channels in the image to be fused besides the target channels, these remaining channels may be called non-target channels, and it may be predetermined which channels are target channels. For example, if the image to be fused is a grayscale image, only one channel thereof is necessarily also the target channel; if the image to be fused is a YUV image, the Y channel may be selected as the target channel, where U, V is a non-target channel, or Y, U, V channels may be all selected as target channels, where no non-target channel exists.

The difference between the target channel and the non-target channel is that for n frames of channel images of each target channel, the steps S121 to S122 in fig. 2 should be performed to fuse into a corresponding one frame of channel fusion image, and for n frames of channel images of each non-target channel (if there is a non-target channel), the step S123 in fig. 2 may be performed to fuse into a corresponding one frame of channel fusion image. Obviously, steps S121 to S123 all belong to the substeps of step S120.

Steps S121 to S122 are the main fusion scheme proposed in the present application, and can significantly improve the efficiency of image fusion, and some implementations of these steps can also improve the effect of image fusion (mainly referred to as HDR effect). Therefore, as an optional strategy, in order to improve the image fusion efficiency as much as possible, all channels included in the image to be fused may be used as target channels. Or, if other factors are considered, only one or a few channels in the image to be fused may be used as the target channel (e.g., Y channel in YUV image), and the remaining channels are used as the non-target channels (e.g., U, V channels in YUV image), for example, as will be known in the following text, in some implementations, to construct the lookup table, the corresponding neural network model needs to be trained, and if a neural network model is trained for each channel, the required computational resources are not small, so when the computational resources are limited, only a part of the channels in the image to be fused may also be used as the target channels.

The steps in fig. 2 are explained in detail below, and will be primarily described in conjunction with the flow shown in fig. 3.

Step S121: and determining n frames of online fusion masks corresponding to the n frames of channel images by using n lookup tables corresponding to the n frames of channel images.

The online fusion mask can be regarded as a weighted image, that is, each pixel value in the online fusion mask represents a weight used when the corresponding frame of channel image is fused (hereinafter, for convenience of distinction, the pixel values in the online fusion mask are collectively referred to as the weights), and the resolution of the online fusion mask is the same as that of the channel image. How to fuse the channel images of the target channel using the on-line fusion mask is explained in step S122. The meaning of the term "online" in the online fusion mask is explained later.

And mapping the pixel value in each frame of channel image to the weight of the online fusion mask corresponding to the frame of channel image at the same position according to the mapping relation between the pixel value recorded in the lookup table corresponding to the frame of channel image and the weight. Thus, by using a frame of channel image and a corresponding lookup table, a frame of corresponding online fusion mask can be calculated, and n frames of online fusion masks are calculated in total.

For example, in fig. 3, the range of the pixel values in the channel image is [0,255], the lookup table includes 256 entries, the indexes are 0 to 255 respectively, which represent 256 possible pixel value values, the content of each entry is a value representing the weight, for example, for the uppermost lookup table in fig. 3, the content of entry 0 is weight 0.6, the content of entry 1 is weight 0.5, and so on. Thus, each entry records a mapping relationship of a pixel value and a weight, such as 0 → 0.6, 1 → 0.5, and so on. For the uppermost one of the channel images of fig. 3, assuming that the pixel value at the coordinate (0,0) position is 1 and the pixel value at the coordinate (1,0) position is 0, it can be determined that the weight of the corresponding on-line fusion mask at the (0,0) position is 0.5 and the weight at the (1,0) position is 0.6 using the corresponding look-up table. The situation is similar for the rest of the channel images in fig. 3.

It should be noted that each frame of channel image corresponds to a different lookup table, and because the exposure degree of each frame of channel image is different, it is not suitable to use the same mapping relationship for the calculation of the on-line fusion mask.

How the n lookup tables used in step S121 are constructed is described in detail later. Since the pixel values in the n frames of channel images are in the same range (because they are from the same target channel in the image to be fused), for any pixel value x in the range, there is a corresponding weight in each lookup table, and there are n weights in total, for example, in fig. 3, n is 3, the weights corresponding to the pixel value 0 in 3 lookup tables are 0.6, 0.3, and 0.1, and the weights corresponding to the pixel value 1 in 3 lookup tables are 0.5, 0.3, and 0.2, respectively, and so on.

Optionally, when constructing the n lookup tables, n weights corresponding to the same pixel value may be normalized, that is, the sum of the n weights is 1, and each weight is located in the interval [0,1 ]. For example, the n look-up tables in fig. 3 are normalized and are processed to have a value of 0.6+0.3+ 0.1-1 for pixel value 0, 0.5+0.3+ 0.2-1 for pixel value 1, and so on. The normalization operation can unify the value range of each weight, and is convenient for subsequent image fusion directly according to the searched weight. When step S122 is described later, for simplicity, the weights in the inline fusion mask are not considered to be normalized.

In the implementation of step S121, the corresponding online fusion mask can be obtained by directly performing table lookup on the pixel values in the channel image, but in another implementation, the channel image may be downsampled first and then the table lookup is performed, that is, a downsampling and upsampling mechanism may be introduced in step S121, which is specifically performed as follows:

step A: the n-frame channel images are downsampled to obtain n-frame low-resolution channel images, as shown in fig. 4.

Note that "low resolution" herein is in a relative sense and not in an absolute sense, meaning that the resolution is lower than the original resolution of the channel image, and the "low resolution" hereinafter should be understood.

And B: the n frames of low resolution online fusion masks corresponding to the n frames of low resolution channel images are determined using the n look-up tables corresponding to the n frames of low resolution channel images, as shown in fig. 4.

The low-resolution online fusion mask is similar to the original-resolution online fusion mask, and can be regarded as a weighted image, but the resolution is low, so that the weights in the mask and the pixel values in the channel image are not in a one-to-one correspondence relationship, which is inconvenient for directly using the low-resolution online fusion mask to fuse the channel image, and the low-resolution online fusion mask needs to be sampled back to the original resolution.

And mapping the pixel value in each frame of low-resolution channel image to the weight of the low-resolution online fusion mask corresponding to the frame of low-resolution channel image at the same position according to the mapping relation between the pixel value recorded in the lookup table corresponding to the frame of low-resolution channel image and the weight. Thus, by using a frame of low-resolution channel image and a corresponding lookup table, a frame of corresponding low-resolution online fusion mask can be calculated, and n frames of low-resolution online fusion masks are calculated in total.

The lookup tables used herein are similar to the lookup tables described above, except that pixel values in the low-resolution channel image are used for lookup, and weights in the low-resolution on-line fusion mask are found, and are not repeated.

And C: the n frames of low-resolution online fusion masks are up-sampled to obtain n frames of online fusion masks with the same resolution as the n frames of channel images, as shown in fig. 4.

If the down-sampling, table-lookup and up-sampling operations are considered to be performed in a "black box", from the result, steps a to C still map the pixel values in each frame of channel image to the weights at the same positions for the corresponding on-line fusion masks using the corresponding lookup tables in the frame of channel image.

The realization mode uses the image with low resolution to look up the table, which is beneficial to improving the efficiency of mask calculation and further improving the efficiency of image fusion, especially when the resolution of the image to be fused is higher and the down-sampling multiple is larger, the improvement of the efficiency is more obvious.

Further, the inventor finds that the upsampling operation may cause mask quality degradation to a certain extent, and therefore in an alternative, when upsampling is performed on an n-frame online fusion mask with low resolution, smoothing filtering with edges preserved may also be performed on the upsampling, so as to improve the quality of the obtained n-frame online fusion mask. Referring to fig. 4, at the location of the upsampling operation, the filtering operation is also indicated.

Edge-preserving smoothing is a class of image filtering algorithms, such as bilateral filtering and guided filtering, which can preserve edge information in an image as much as possible while smoothing image content. The filtering may be performed before the upsampling, for example, the on-line fusion mask with a low resolution may be first subjected to guided filtering (the guide map selects a channel image with a low resolution), and then the filtering result is subjected to upsampling to obtain the on-line fusion mask.

Alternatively, Filtering and Upsampling may be performed simultaneously (herein, "simultaneously" should be understood that the processes of Upsampling and Filtering are mixed together, the boundary between the two is not clear, and should not be understood as "parallel"), for example, a Guided Filtering for Joint Upsampling (GFU) module may be adopted, the GFU module includes 3 inputs, which are respectively a low-resolution channel image (denoted as x _ l), a low-resolution online fusion mask (denoted as m _ l), and a channel image (denoted as x _ h), and the output is an online fusion mask (denoted as m _ h), two low-resolution mapping matrices a and B are computed from x _ l and m _ l inside the GFU module, and then, performing upsampling on the A and the B to obtain two mapping matrixes A and B with original resolution, and finally, taking x _ h as a guide map and mapping the guide map into m _ h by using a formula of m _ h ═ A × _ x _ h + B. The guide map in the GFU module can greatly reduce the loss of edge information caused by upsampling because x _ h with higher resolution is adopted instead of x _ l.

Each frame of low-resolution online fusion mask can be processed by the GFU module to obtain one frame of online fusion mask, and the GFU module can be designed to have a batch processing function, namely only one GFU module needs to be arranged, n frames of low-resolution online fusion masks are input into the module at one time, and the module outputs n frames of online fusion masks at one time. Of course, even if the GFU module is not used for upsampling and smoothing, the n-frame low-resolution online fusion mask can be batch processed.

It should be understood that the down-sampling mechanism and the up-sampling mechanism in the above-mentioned flow are used together, and if the down-sampling is not performed, the up-sampling is naturally not necessary.

Step S122: and performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain a channel fusion image.

First, taking the linear fusion manner as an example, the linear fusion can be expressed by the following formula:

wherein, Y^*Representing a channel fusion image, Y_iRepresenting the ith frame channel image, M_iIndicating that frame i is fused on-line with Y_iThe corresponding in-line fusion mask), the symbol "", indicates that the pixel values of the corresponding positions are multiplied, as described above, the in-line fusion mask and the channel image have the same resolution, so that both can perform an operation [. It can be seen that the linear fusion is actually a process of taking n frames of online fusion masks as weight images and performing weighted summation on the n frames of channel images to obtain a channel fusion image, and if the weights in the online fusion masks are normalized, it can be ensured that pixels in the channel fusion image and pixels in the channel image have the same value range, e.g., [0,255 [ ], all]。

Of course, linear fusion is not the only weighted fusion method, and for example, pyramid fusion can also be used, and the steps thereof are briefly described as follows:

step a: and constructing a corresponding Laplacian pyramid according to each frame of channel image to obtain n Laplacian pyramids.

An image pyramid is a structure formed by stacking a group of images with different resolutions, a gaussian pyramid and a laplacian pyramid are two different types of image pyramids, and each layer of image in the image pyramid can be called a pyramid image. Fig. 5 shows a gaussian pyramid, a laplacian pyramid construction procedure, and a laplacian reconstruction procedure.

Referring to the left column of fig. 5, the white rectangle with the largest size at the upper left corner represents the original image, i.e. the layer 1 image of the gaussian pyramid; the image of the 1 st layer of the Gaussian pyramid is subjected to Gaussian filtering and downsampling to obtain a white rectangle with a slightly smaller size, namely the image of the 2 nd layer of the Gaussian pyramid; the 2 nd layer image of the Gaussian pyramid is subjected to Gaussian filtering and downsampling to obtain a white rectangle with the minimum size, namely the 3 rd layer image (top layer image) of the Gaussian pyramid; thus, a 3-level gaussian pyramid is constructed.

Referring to the middle column of fig. 5, directly taking the layer 3 image of the gaussian pyramid as the layer 3 image of the laplacian pyramid; performing upsampling (the multiple is the same as that of left downsampling) and Gaussian filtering on the 3 rd layer image of the Laplacian pyramid, wherein the obtained result image has the same resolution as that of the 2 nd layer image of the Gaussian pyramid, but generally cannot completely restore the 2 nd layer image of the Gaussian pyramid, and subtracting the result image from the 2 nd layer image of the Gaussian pyramid to obtain a residual image, namely the 2 nd layer image of the Laplacian pyramid, which is represented as a dark rectangle with a smaller size; performing upsampling (the multiple is the same as that of left downsampling) and Gaussian filtering on the image of the layer 2 of the Laplacian pyramid, wherein the obtained result image has the same resolution as that of the image of the layer 1 of the Gaussian pyramid, but generally cannot completely restore the image of the layer 1 of the Gaussian pyramid, and subtracting the result image from the image of the layer 1 of the Gaussian pyramid to obtain a residual image, namely the image of the layer 1 of the Laplacian pyramid, which is represented as a dark rectangle with a larger size; thus, a 3-level Laplacian pyramid is constructed.

It can be seen that, for the same original image, the laplacian pyramid of the original image is constructed first, and then the laplacian pyramid of the original image is constructed, wherein except for the pyramid image at the top layer in the laplacian pyramid, the pyramid images at other layers are residual images, or detail information lost in the downsampling process of the pyramid images at the same layer of the laplacian pyramid.

Referring to the right column of fig. 5, according to the structural manner of the laplacian pyramid, upsampling and gaussian filtering are performed on the 3 rd layer image of the laplacian pyramid, and then the obtained result image and the 2 nd layer image of the laplacian pyramid are added, so that the 2 nd layer image of the laplacian pyramid can be restored without loss; and performing upsampling and Gaussian filtering on the reduced image of the layer 2 of the Gaussian pyramid, and adding the obtained result image and the image of the layer 1 of the Laplacian pyramid to losslessly reduce the image of the layer 1 of the Gaussian pyramid, namely the original image. Thus, the original image can be reconstructed based on the laplacian pyramid.

Although the above pyramid structure and image reconstruction are all exemplified by a 3-level pyramid, the same is true for a 2-level or 3-level or higher pyramid.

It should be noted that the gaussian pyramid in step a is constructed according to the online fusion mask (i.e., the online fusion mask is the original image), and the laplacian pyramid is constructed according to the channel image (i.e., the channel image is the original image), so that the construction processes of the two pyramids are independent, can be performed sequentially, and can also be performed in parallel, which is different from the pyramid construction process in fig. 5 in a certain degree because both the gaussian pyramid and the laplacian pyramid in fig. 5 are constructed based on the original image.

In step a, the number of layers of all the gaussian pyramids is the same, the resolution of the pyramid images in the same layer in each gaussian pyramid is also the same, the number of layers of all the laplacian pyramids is the same, the resolution of the pyramid images in the same layer in each laplacian pyramid is also the same, the number of layers of any one gaussian pyramid and any one laplacian pyramid is also the same, and the resolution of the pyramid images in the same layer in two types of pyramids is also the same.

Follow the above notation if M_iRepresenting the ith frame fusion mask, the n Gaussian pyramids can be represented as

Wherein

Representing the ith Gaussian pyramid (based on M)_iBuild) layer j pyramid images. If Y is_iRepresenting the ith frame channel image, the n laplacian pyramids may be represented as

Wherein

Represents the ith Laplacian pyramid (based on Y)_iBuild) layer j pyramid images. The above i and j should take all values in their respective value ranges, i takes all integers in the 1 to n passes, and j takes all integers in the 1 to L passes on the assumption that the gaussian pyramid and the laplacian pyramid are L layers (L is an integer greater than 1).

Step b: and performing weighted fusion on the n Laplacian pyramids by using the n Gaussian pyramids to obtain the fused Laplacian pyramid.

And weighting and fusing the n frames of pyramid images positioned on the layer in the n laplacian pyramids. For example, when linear fusion is employed, the fusion process can be expressed as the following equation:

wherein, Y^jRepresenting pyramid images at the j-th layer in the fused Laplacian pyramid, the entire fused Laplacian pyramid can be denoted as

j takes all integers from 1 to L.

Step c: and performing image reconstruction by using the fused Laplacian pyramid to obtain a channel fusion image.

The method for reconstructing an image based on the laplacian pyramid has been explained in the introduction of fig. 5, and the result of the reconstruction is the channel fusion image to be calculated in step S122.

In contrast, pyramid fusion does not directly perform linear fusion on the channel image by using the online fusion mask, but performs layered linear fusion on the laplacian pyramid of the channel image by using the gaussian pyramid of the online fusion mask. According to the construction process of the laplacian pyramid, in the laplacian pyramid, the pyramid images of each layer can be regarded as feature maps extracted from the channel images, and the feature maps represent image details (residuals) of different frequencies in the channel images. Therefore, the weighted fusion of the pyramid images on the same layer is equivalent to the independent image fusion on the subspace corresponding to each frequency, so that the fusion difficulty is low, and the fusion effect is better than that of the fusion directly on the original channel image, thereby being beneficial to improving the quality of the channel fusion image. Of course, the algorithm complexity is higher by adopting pyramid fusion than by adopting linear fusion, so that different fusion modes can be adopted according to the requirements in specific implementation.

Step S123: and fusing the n frames of channel images by using other methods to obtain a channel fused image.

For the channel images belonging to the same non-target channel, other methods may be adopted for fusion, and the "other methods" in step S123 refer to methods other than steps S121 to S122, such as any existing image fusion method for obtaining an HDR image, or may also utilize a neural network model to be described later for image fusion.

Furthermore, it is also possible to perform weighted fusion of n-frame channel images belonging to the same non-target channel directly using a fusion mask calculated when the channel images of the target channel are fused. For example, the image to be fused is a YUV image, the target channel is a Y channel, and n frames of online fusion masks for the Y channel are calculated by performing steps S121 to S122, so that n frames of channel images for a U channel or a V channel can be directly fused by using the n frames of online fusion masks, and the fusion manner can refer to step S122, for example, linear fusion or pyramid fusion can be selected. For example, if there are two target channels, n frames of online fusion masks can be calculated by using the channel images of the two target channels, that is, there are 2n frames of online fusion masks, then averaging the online fusion masks corresponding to each two frames to obtain n frames of fusion masks, and then fusing the channel images of the non-target channels by using the n frames of fusion masks.

The fusion mode simplifies the image fusion process of the non-target channel (avoids table look-up operation), and improves the fusion efficiency. For example, the target channel may select a relatively important channel (such as the Y channel of the YUV image) in the image to be fused, and the non-target channel may select a relatively minor channel (such as the U, V channel of the YUV image) in the image to be fused, so that the fusion mask used by the non-target channel does not have a great influence on the image quality of the fusion result image even if the fusion mask is not calculated according to the own channel image.

The image fusion method in fig. 1 and 2 is briefly summarized below: the method maps the channel image of the target channel into the online fusion mask directly in a table look-up mode, so that the calculation of the fusion weight (the online fusion mask) can be completed in a very short time, and the image fusion can be efficiently completed.

In addition, only the mapping relation between the pixel value and the weight is recorded in the lookup table used in the method, and the pixel value here is the pixel value in the single-channel image, i.e. a single numerical value rather than a plurality of numerical values, so that the lookup table used in the method is a one-dimensional lookup table (e.g. the array in fig. 3 or fig. 4, which is a one-dimensional data structure), the structure is very simple, and the lookup efficiency is very high.

Moreover, the number of the lookup tables used in the method is n, namely different mapping relations possibly existing in pixel values and weights under different exposure degrees are fully considered, so that the obtained online fusion mask has higher precision and is beneficial to improving the quality of a fusion result image.

Finally, in some implementation modes of the method, a down-sampling mechanism and an up-sampling mechanism can be introduced, so that the table look-up times are reduced, and the image fusion efficiency is further improved.

Next, on the basis of the above embodiment, the construction of the lookup table is continued. Since the value range of the pixel values in the channel image is generally known, for example [0,255], the key for constructing the lookup table is to determine how much weight each pixel value should be mapped to, and only if the determined weight is reasonable, the lookup table can be used to calculate the on-line fusion mask with higher quality, so that the high-quality channel fusion image and even the fusion result image can be calculated. The step of building a look-up table may be performed before step S120.

In one implementation, each pixel value x in the pixel value range may be traversed (e.g., x is taken through each integer in [0,255 ]), and the mapping relationship between the pixel values and the weights recorded in the n lookup tables is determined by performing the following steps 1 to 4, and the mapping relationship is determined, which is naturally the lookup table is constructed.

Step 1: the same n-frame monochrome image is acquired.

Here, the monochrome image refers to an image including only one channel and having all pixel values of x, for example, in a traversal round in which x is 0, the monochrome image refers to a single-channel image having all pixel values of 0, in a traversal round in which x is 1, the monochrome image refers to a single-channel image having all pixel values of 1, and the like. In addition, the resolution of the monochrome image may be the same as that of the image to be fused in step S110.

There are various ways to obtain a monochrome image: for example, it may be generated in real time when step 1 is performed; for another example, the monochrome image may be generated before step 1 is executed, and the monochrome image may be directly read out from the memory of the electronic device when step 1 is executed.

And 2, step: and extracting the basic characteristic diagram of the monochromatic image of the n frames by using a trunk network in the pre-trained neural network model, and splitting the basic characteristic diagram into sub characteristic diagrams of the n frames.

And step 3: and calculating n frames of off-line fusion masks corresponding to the n frames of monochromatic images by using n branch networks in the neural network model.

Steps

2 and 3 are described in conjunction. The neural network model used in the step is trained in advance (before step 2 is executed), and the training process is described later, if there are a plurality of target channels, one neural network model may be trained for each target channel to construct n look-up tables for the target channel, or in order to save computational resources, a plurality of target channels may share one neural network model.

The neural network model is not limited to which neural network is specifically adopted, and may be, for example, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), an Artificial Neural Network (ANN), or a combination of these neural networks. The neural network model includes at least two parts, namely a trunk network and n branch networks, the tail part of the trunk network is connected with the head part of each branch network, and the trunk network and the branch networks are also neural networks respectively and have trainable parameters, and fig. 6 shows the overall structure of the neural network model.

The method comprises the steps of utilizing a backbone network to carry out feature extraction on n monochromatic images to obtain a corresponding basic feature map, then splitting the basic feature map into n frames of sub-feature maps, and enabling each frame of sub-feature map to serve as the input of a branch network.

Referring to fig. 6, the white rectangle on the left side represents n-frame monochrome images (n ═ 3), and the n-frame channel images can be input as one whole image when input to the backbone network, the shape of the whole image being n × 1 × W × H, where n is the number of frames, 1 is the number of channels contained in each frame monochrome image, and W and H are the width and height of the monochrome image, respectively. The shape of the base feature map (not shown in fig. 6) may be n × c × W × H, where n is the number of frames, c is the number of channels contained per frame (c is not necessarily equal to 1), W and H are the width and height of the base feature map, respectively, W < W and H < H if a downsampling module is contained in the backbone network, otherwise W and H are H. A splitting module (not shown in fig. 6) in the backbone network may split the basic feature map according to the frame dimension, that is, the shape of each frame in the n-frame sub-feature map (not shown in fig. 6) may be 1 × c × w × h, and each frame of the sub-feature map may be considered to correspond to one frame of the monochrome image, but "corresponding" herein should not be understood to mean that each frame of the sub-feature map is calculated only according to one frame of the monochrome image.

And calculating a corresponding frame of offline fusion mask based on the input frame of sub-feature map by using each branch network, wherein the calculated offline fusion mask has n frames for the total n branch networks. As described above, the sub-feature map corresponds to the monochrome image, so that each frame of the off-line fusion mask calculated from the sub-feature map also corresponds to one frame of the monochrome image.

The offline fusion mask can be regarded as a weighted image, that is, each pixel value in the offline fusion mask represents a weight for fusing a corresponding frame of monochrome images (of course, as will be understood from the following description, monochrome images are not really fused). The resolution of the offline fusion mask may be the same as or lower than the resolution of the monochrome image (e.g., when a downsampling module is provided in the backbone network). For example, if the shape of the monochrome image is 1 × 1 × W × H, the shape of the offline fusion mask may be 1 × 1 × W × H, or may also be 1 × 1 × W × H (W < W and H < H). The meaning of the term "offline" in the offline fusion mask is explained later.

As to why the pixel values in the offline fusion mask can represent weights for fusion, because the neural network model is trained for the purpose of predicting weights.

For example, the neural network model may be trained as: when n frames of single-channel images with incompletely identical exposure degrees are input into a main network in the model, n frames of fusion masks corresponding to the n frames of single-channel images are output from n branch networks in the model. Each branch network outputs a frame of corresponding fusion mask, and each frame of fusion mask comprises a corresponding weight for fusing a frame of single-channel image.

"trained to" in the above example may be understood as a trained neural network model having inputs and outputs as described in the example. The n frames of single-channel images may be any images containing only one channel, for example, the n frames of channel images belonging to the same target channel in the n frames of images to be fused may be referred to as "n frames of single-channel images", and the n frames of fusion masks are weighted images for fusing (may adopt linear fusion, pyramid fusion, etc.) the n frames of single-channel images, and are functionally similar to the aforementioned n frames of online fusion masks, but in a different calculation manner.

Since the neural network model is trained to have the above properties, if the n monochromatic images are input into the neural network model as n single-channel images with different exposure degrees, the pixel values of the offline n-frame fusion mask output by the model can necessarily represent the weights for monochromatic image fusion.

The specific training process of the neural network model is described later.

And 4, step 4: and determining the weight corresponding to the pixel value x in the lookup table corresponding to each frame of offline fusion mask according to the weight in each frame of offline fusion mask, and determining n weights corresponding to the pixel value x in n lookup tables.

There are various ways to determine the weights, two of which are listed below:

mode 1: and determining the average value of the weights in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask.

For example, W × H weights are shared in a frame of offline fusion mask, and the average of the W × H weights may be calculated as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask.

Mode 2: and determining the weight with the highest frequency of occurrence in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask.

For example, if W × H weights are shared in a frame of offline fusion masks, and the weight of 2/3 among the W × H weights is 0.6 (not excluding that all weights are rounded in advance, for example, only one decimal is reserved), 0.6 can be used as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion masks.

It should be understood that there are other ways of determining the weight corresponding to the pixel value x in the lookup table besides way 1 and way 2, which are not illustrated one by one. In addition, the weight corresponding to the pixel value x in the lookup table is determined, and the corresponding relationship between the pixel value x and the weight is also determined, for example, in fig. 6, the weight corresponding to the pixel value 0 is determined to be 0.6 according to the uppermost frame offline fusion mask, and only 0.6 needs to be stored in the table entry with the index of 0 in the uppermost lookup table, and the table entry can naturally express the mapping relationship between 0 and 0.6.

Optionally, in step 4, smoothing filtering (for example, the aforementioned edge-preserving smoothing filtering) may be performed on the n frames of offline fusion masks to remove noise possibly existing in the masks, and then determining the weight corresponding to the pixel value x based on the filtered offline fusion masks.

Optionally, after n weights corresponding to the pixel value x are obtained in step 4, the n weights may be normalized, so that the sum of the n weights is 1, and the value of each weight is within the interval [0,1 ]. For example, the n look-up tables in fig. 6 are normalized and are processed to have a value of 0.6+0.3+ 0.1-1 for a pixel value of 0, 0.5+0.3+ 0.2-1 for a pixel value of 1, and so on. The normalization operation can unify the value range of each weight, and is convenient for image fusion directly according to the searched weight when a lookup table is used subsequently.

Further, it has been mentioned before that the resolution of the off-line fusion mask may be the same as that of the monochrome image (referred to as the off-line fusion mask of the original resolution), or may be lower than that of the monochrome image (referred to as the off-line fusion mask of the low resolution), but this does not affect the normal execution of step 4. In addition, there is no necessary relationship between the lookup table constructed according to the offline fusion mask with what resolution and whether to calculate the online fusion mask with low resolution in step S121: for example, a lookup table may be constructed according to the off-line fusion mask with low resolution, but in step S121, the on-line fusion mask with original resolution is obtained directly based on the channel image lookup table (fig. 3); for another example, the lookup table may be constructed according to the offline fusion mask with the original resolution, but in step S121, the channel image is downsampled to obtain the channel image with the low resolution, then the online fusion mask with the low resolution is obtained based on the lookup table of the channel image with the low resolution, and finally the online fusion mask with the original resolution is obtained by upsampling (fig. 4).

In addition, the possibility that the off-line fusion mask calculated in step 3 is low-resolution and is first up-sampled to the original resolution before step 4 is performed is not excluded. Moreover, edge-preserving smoothing filtering may also be performed during upsampling, similar to the operation mentioned in the introduction step S121, and will not be repeated.

According to the steps 1-4, every time a pixel value x is traversed, the weights corresponding to x in n lookup tables can be determined, so that after all values of x are traversed, all weights in the n lookup tables are determined, and at the moment, the n lookup tables are constructed.

The significance of constructing a look-up table using monochrome images is briefly analyzed as follows:

if the neural network model performs identity mapping, the monochrome image is input into the model and is output as the original monochrome image, but actually, the pre-trained neural network model does not perform identity mapping, so that the output image (offline fusion mask) changes relative to the pixel value in the original monochrome image, and the change reflects the mapping relationship between the pixel value learned by the neural network model and the weight, so that the required lookup table can be constructed only by storing the mapping relationship, and the constructed lookup table also replaces the mapping relationship modeled by the neural network model to a certain extent (it can be considered that each lookup table replaces the function realized by the main network and one branch network).

Further, if the neural network model is directly adopted to calculate the online fusion mask in step S121, the calculation accuracy may be higher, and the quality of the fusion result image may also be better. The reason for this is that:

firstly, the neural network model is obtained through training and is not a preset rule determined according to experience, so that the calculated weight has better robustness, and the phenomena that when the exposure degree span of the image to be fused is larger, the obtained fusion result image is gray in picture and lacks details and the like are improved. Here, if the exposure degree is expressed by the EV value, the large exposure degree span can be understood as: after the n frames of images to be fused are arranged according to the ascending or descending order of the EV values, the difference distance of the EV values of two adjacent frames of images to be fused is larger, or the difference distance of the EV values of the first frame and the last frame of images to be fused is larger.

Secondly, the neural network model is divided into a main network and branch networks, the main network is shared by n frames of channel images, so that inter-frame information communication is facilitated, the accuracy of the on-line fusion mask is improved, each branch network is used for calculating the on-line fusion mask corresponding to one frame of channel image, parameter optimization is facilitated for the frame of channel image, the accuracy of the on-line fusion mask can also be improved, and the higher the calculated accuracy of the on-line fusion mask is, the better the fusion effect is.

The use of the monochrome image to construct the lookup table has a further advantage that n frames of the same monochrome image are constructed for each pixel value in the pixel value range through the neural network model, and assuming that the resolution of the monochrome image is W × H, it is equivalent to that each lookup table counts the mapping results of a total of W × H identical pixel values during construction, so that the weights in the lookup tables have high reliability.

It should be understood that it is also possible to construct the lookup table using sets of other types of single-channel images (each set of n frames, and non-monochrome images), but ensuring that the pixel values in these single-channel images cover each pixel value within the range of pixel values.

Further, in some implementations, the neural network model described above has a mechanism of attention. When a person observes an image, different attention degrees are shown to different parts in the image, namely, important parts have higher attention degree and unimportant parts have lower attention degree. Thus, for a specific image processing task (e.g., classification, segmentation, fusion, recognition), features of different parts in an image differ in importance, and the so-called attention mechanism is to express the difference in importance of features of different parts in an image by some technical means. For example, features in different parts of an image are given different weights, a feature given a greater weight indicating a higher importance, and a feature given a lesser weight indicating a lower importance.

In the neural network model, the attention mechanism can be realized by adopting a structure of adding a multiplier to two branches, wherein one branch of the two branches is used for inputting the feature map to the multiplier, the other branch is used for calculating the weight of the feature of different parts in the feature map and inputting the weights to the multiplier, and the multiplier is used for realizing the weighting of the feature. How different parts of the feature map are divided will create different types of attention mechanisms.

In the solution of the present application, the feature map has 4 dimensions, and the basic feature map mentioned above is taken as an example, and the shape is assumed to be n × c × w × h, where n represents the frame dimension, c represents the channel dimension, w represents the abscissa dimension, and h represents the ordinate dimension. W and h can also be considered as one dimension, i.e. the spatial dimension, when the feature map has 3 dimensions in total. Thus, the following three attention mechanisms are possible in the solution of the present application:

channel attention, i.e., the attention mechanism of channel dimensions, refers to giving corresponding weights to data in different channels of the feature map;

frame attention, namely an attention mechanism of frame dimensions, means that corresponding weights are given to data in different frames of a feature map;

spatial attention, i.e., the attention mechanism of the spatial dimension, refers to giving corresponding weights to data at different spatial positions of the feature map.

In the neural network model, the attention mechanism may be introduced into both the trunk network and the branch network, or both the trunk network and the branch network. The main network may have at least one of a channel attention mechanism, a frame attention mechanism and a space attention mechanism, and the branch network may have at least one of a channel attention mechanism and a space attention mechanism. There is no frame attention to be considered for the branch network, since the input sub-feature map for each branch network may only be one frame (shape 1 xcxw x h), and it makes sense to weight the data in different frames according to the definition of frame attention.

Different attention mechanisms are added into the neural network, so that the accuracy of the calculated offline fusion mask is improved, and the weight value in the lookup table constructed based on the offline fusion mask is more reasonable.

Next, on the basis of the above embodiment, how the above neural network model is trained will be briefly described. The model can adopt unsupervised training and also can adopt supervised training.

First, taking supervised training as an example, referring to fig. 7, n training images (all single-channel images) with incompletely identical exposure degrees can be used as a training sample, and after the training sample is input into a neural network model, n training fusion masks are calculated by using the model. Each branch network is used for calculating a frame of training fusion mask corresponding to a frame of training image, each frame of training fusion mask comprises a weight for fusing the corresponding frame of training image, the training fusion mask can be understood by referring to concepts of an online fusion mask and an offline fusion mask, and the training fusion mask is mainly generated in a training stage. Next, n training images may be weighted-fused (for example, linear fusion, pyramid fusion, or the like may be used) by using the n training fusion mask to obtain a training fusion image, and the training fusion image may be understood by referring to the concept of the fusion result image, which is mainly different in that the training fusion image is obtained in the "training" stage. Then, the prediction loss can be calculated by using the training fusion image and the standard result image corresponding to the training sample, and the form of the loss function is not limited, wherein the standard result image, namely the ground route, has a good HDR effect and can be determined in advance by combining other image fusion algorithms manually. Finally, parameters in the neural network model may be updated based on the calculated prediction loss until the model converges, e.g., a back propagation algorithm may be employed for model parameter updates.

For the unsupervised training, although there is no standard result image, the updating of the model parameters can be realized by calculating blind evaluation indexes such as MEF-SSIM, and the detailed description is omitted.

The trained neural network model may have the property that: when n frames of single-channel images with incompletely identical exposure degrees are input into a main network in the model, n frames of fusion masks corresponding to the n frames of single-channel images are output from n branch networks in the model.

Finally, on the basis of the above embodiments, three phases are introduced that the solution of the present application may involve, named training phase, offline phase and online phase respectively:

a training stage: and training the neural network model by using the training images in the training set.

An off-line stage: and constructing a lookup table by using the monochrome image and the trained neural network model.

An online stage: and realizing the fusion of the images to be fused by utilizing the constructed lookup table.

The three stages are sequentially performed in time, and once the lookup table is constructed in the off-line stage, the neural network model can not be used in the subsequent image fusion process. And certainly, some images with poor fusion effect are found in the fusion process, and can be used as training images to retrain the neural network model and reconstruct the lookup table. For the specific implementation of each stage, reference may be made to the foregoing, which adds the definitions of "online", "offline", "training", etc. before fusing the masks, and just for distinguishing the masks generated by different stages.

Fig. 8 illustrates functional components included in the image fusion apparatus 200 according to an embodiment of the present disclosure. Referring to fig. 8, the image fusion apparatus 200 includes:

the image acquisition component 210 is used for acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel;

the image fusion component 220 is configured to fuse n channel images belonging to the same channel in the n to-be-fused images to obtain at least one channel fusion image, and determine a fusion result image according to the at least one channel fusion image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps: determining n frames of online fusion masks corresponding to the n frames of channel images by using n lookup tables corresponding to the n frames of channel images; the method comprises the steps that a pixel value in each frame of channel image is mapped to the weight of an online fusion mask corresponding to the frame of channel image at the same position according to the mapping relation between the pixel value recorded in a lookup table corresponding to the frame of channel image and the weight; and performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain the channel fusion images.

In one implementation of the image fusion apparatus 200, the determining, by the image fusion component 220, the n online fusion masks corresponding to the n channel images by using the n lookup tables corresponding to the n channel images includes: down-sampling the n frames of channel images to obtain n frames of low-resolution channel images; determining n frames of low-resolution online fusion masks corresponding to the n frames of low-resolution channel images by using n lookup tables corresponding to the n frames of low-resolution channel images; mapping the pixel value in each frame of low-resolution channel image into the weight of the low-resolution online fusion mask corresponding to the frame of low-resolution channel image at the same position according to the mapping relation between the pixel value recorded in the lookup table corresponding to the frame of low-resolution channel image and the weight; and performing up-sampling on the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images.

In an implementation manner of the image fusion apparatus 200, the up-sampling the n frames of low-resolution online fusion masks by the image fusion component 220 to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images includes: and performing edge-preserving smooth filtering and up-sampling on the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images.

In one implementation of the image fusion apparatus 200, the apparatus further comprises: the lookup table constructing component is configured to, before the image fusion component 220 fuses n channel images belonging to the same channel in the n images to be fused, perform the following steps: traversing each pixel value x in the pixel value range, and determining the mapping relation between the pixel values and the weights recorded in the n lookup tables by executing the following steps: acquiring the same n frames of monochrome images, wherein the monochrome images only comprise one channel and the pixel values of all pixels in the images are x; extracting a basic feature map of the n frames of monochrome images by using a trunk network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps; calculating n frames of offline fusion masks corresponding to the n frames of monochromatic images by using n branch networks in the neural network model; each branch network is used for calculating a frame of offline fusion mask corresponding to a frame of monochrome image according to a frame of sub-feature image, and each frame of offline fusion mask comprises a weight for fusing the corresponding frame of monochrome image; and determining the weight corresponding to the pixel value x in the lookup table corresponding to each frame of offline fusion mask according to the weight in each frame of offline fusion mask, and determining n weights corresponding to the pixel value x in the n lookup tables.

In an implementation manner of the image fusion apparatus 200, the determining, by the lookup table constructing component, a weight corresponding to the pixel value x in the lookup table corresponding to each frame of offline fusion mask according to the weight in the frame of offline fusion mask includes: determining the average value of the weights in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask; or, determining the weight with the highest frequency of occurrence in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask.

In one implementation of the image fusion apparatus 200, the lookup table construction component is further configured to: after n weights corresponding to the pixel value x in the n lookup tables are determined, the n weights are normalized so that the sum of the n weights is 1.

In one implementation of the image fusion apparatus 200, the resolution of the n-frame offline fusion mask is lower than the n-frame monochromatic image.

In one implementation of the image fusion apparatus 200, the neural network model is trained to: when n frames of single-channel images with incompletely identical exposure degrees are input into the main network in the model, n frames of fusion masks corresponding to the n frames of single-channel images are output from the n branch networks in the model; each branch network outputs a frame of corresponding fusion mask, and each frame of fusion mask comprises a corresponding weight for fusing a single-channel image.

In an implementation manner of the image fusion apparatus 200, the image fusion component 220 performs weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain the channel fusion image, including: constructing a corresponding Gaussian pyramid according to each frame of online fusion mask to obtain n Gaussian pyramids in total, and constructing a corresponding Laplacian pyramid according to each frame of channel image to obtain n Laplacian pyramids in total; the layers of the Gaussian pyramid and the Laplacian pyramid are the same, and the resolution of the pyramid images in the same layer in the two pyramids is also the same; performing weighted fusion on the n laplacian pyramids by using the n gaussian pyramids to obtain fused laplacian pyramids; each layer of pyramid image in the fused laplacian pyramid is obtained by weighting and fusing n frames of pyramid images in the layer in the n laplacian pyramids by using the n frames of pyramid images in the layer in the n laplacian pyramids; and performing image reconstruction by using the fused Laplacian pyramid to obtain the channel fusion image.

In one implementation of the image fusion apparatus 200, if each frame of the image to be fused includes a plurality of channels, and the plurality of channels are not all target channels, the image fusion component 220 performs weighted fusion on n frames of channel images belonging to the same non-target channel in the n frames of images to be fused by using a fusion mask calculated when the channel images of the target channels are fused.

The image fusion apparatus 200 according to the embodiment of the present application, the implementation principle and the resulting technical effects of which have been described in the foregoing method embodiments, and for brief description, the corresponding contents in the method embodiments may be referred to where the apparatus embodiments are not mentioned in part.

Fig. 9 shows a structure of an electronic device 300 provided in an embodiment of the present application. Referring to fig. 9, the electronic device 300 includes: a processor 310, a memory 320, and a communication interface 330, which are interconnected and in communication with each other via a communication bus 340 and/or other form of connection mechanism (not shown).

The processor 310 includes one or more (only one is shown), which may be an integrated circuit chip having signal processing capability. The Processor 310 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Network Processor (NP), or other conventional processors; the Processor may also be a dedicated Processor, including a Graphics Processing Unit (GPU), a Neural-Network Processing Unit (NPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, and a discrete hardware component. Also, when there are multiple processors 310, some of them may be general-purpose processors, and another part may be special-purpose processors.

The Memory 320 includes one or more (Only one is shown in the figure), which may be, but not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like.

The processor 310, as well as possibly other components, may access, read, and/or write data to the memory 320. In particular, one or more computer program instructions may be stored in the memory 320, and may be read and executed by the processor 310 to implement the image fusion method provided by the embodiment of the present application.

Communication interface 330 includes one or more (only one shown) that may be used to communicate directly or indirectly with other devices for the purpose of data interaction. Communication interface 330 may include an interface to communicate wired and/or wireless.

It will be appreciated that the configuration shown in fig. 9 is merely illustrative and that electronic device 300 may include more or fewer components than shown in fig. 9 or have a different configuration than shown in fig. 9. For example, the electronic device 300 may further include a camera for capturing an image or a video, and both the captured image or a frame in the video may be used as the image to be fused in step S110; for another example, the electronic device 300 may not be provided with the communication interface 330 if it is not necessary to communicate with another device.

The components shown in fig. 9 may be implemented in hardware, software, or a combination thereof. The electronic device 300 may be a physical device such as a cell phone, a video camera, a PC, a laptop, a tablet, a server, a robot, etc., or may be a virtual device such as a virtual machine, a container, etc. The electronic device 300 is not limited to a single device, and may be a combination of a plurality of devices or a cluster including a large number of devices.

The embodiment of the present application further provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor, the image fusion method provided by the embodiment of the present application is executed. The computer readable storage medium may be embodied as the memory 320 in the electronic device 300 in fig. 9, for example.

The embodiment of the present application further provides a computer program product, where the computer program product includes computer program instructions, and when the computer program instructions are read and executed by a processor, the image fusion method provided by the embodiment of the present application is executed.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An image fusion method, comprising:

acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel;

fusing n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps:

determining n frames of online fusion masks corresponding to the n frames of channel images by using n lookup tables corresponding to the n frames of channel images; the method comprises the steps that a pixel value in each frame of channel image is mapped to the weight of an online fusion mask corresponding to the frame of channel image at the same position according to the mapping relation between the pixel value recorded in a lookup table corresponding to the frame of channel image and the weight;

and performing weighted fusion on the n frames of channel images by using the n frames of online fusion masks to obtain the channel fusion image.

2. The image fusion method according to claim 1, wherein the determining the n-frame online fusion mask corresponding to the n-frame channel image by using the n lookup tables corresponding to the n-frame channel image comprises:

down-sampling the n frames of channel images to obtain n frames of low-resolution channel images;

determining n frames of low-resolution online fusion masks corresponding to the n frames of low-resolution channel images by using n lookup tables corresponding to the n frames of low-resolution channel images; mapping the pixel value in each frame of low-resolution channel image into the weight of the low-resolution online fusion mask corresponding to the frame of low-resolution channel image at the same position according to the mapping relation between the pixel value recorded in the lookup table corresponding to the frame of low-resolution channel image and the weight;

and performing up-sampling on the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images.

3. The image fusion method according to claim 2, wherein the up-sampling the n frames of low-resolution online fusion mask to obtain the n frames of online fusion mask with the same resolution as the n frames of channel images comprises:

and performing edge-preserving smooth filtering and up-sampling on the n frames of low-resolution online fusion masks to obtain the n frames of online fusion masks with the same resolution as the n frames of channel images.

4. The image fusion method according to any one of claims 1-3, wherein before the fusing of the n channel images belonging to the same channel in the n images to be fused, the method further comprises:

traversing each pixel value x in the pixel value range, and determining the mapping relation between the pixel values and the weights recorded in the n lookup tables by executing the following steps:

acquiring the same n frames of monochrome images, wherein the monochrome images only comprise one channel and the pixel values of all pixels in the images are x;

extracting a basic feature map of the n frames of monochrome images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps;

calculating n frames of offline fusion masks corresponding to the n frames of monochromatic images by using n branch networks in the neural network model; each branch network is used for calculating a frame of offline fusion mask corresponding to a frame of monochrome image according to a frame of sub-feature image, and each frame of offline fusion mask comprises a weight for fusing the corresponding frame of monochrome image;

and determining the weight corresponding to the pixel value x in the lookup table corresponding to each frame of offline fusion mask according to the weight in each frame of offline fusion mask, and determining n weights corresponding to the pixel value x in the n lookup tables.

5. The image fusion method of claim 4, wherein determining the weight corresponding to the pixel value x in the lookup table corresponding to each frame of offline fusion mask according to the weight in the frame of offline fusion mask comprises:

determining the average value of the weights in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask; or,

and determining the weight with the highest frequency of occurrence in each frame of offline fusion mask as the weight corresponding to the pixel value x in the lookup table corresponding to the frame of offline fusion mask.

6. The image fusion method according to claim 4 or 5, wherein after said determining n weights corresponding to a pixel value x in said n lookup tables, said method further comprises:

normalizing the n weights such that the sum of the n weights is 1.

7. The image fusion method of any of claims 4-6, characterized in that the resolution of the n-frame offline fusion mask is lower than the n-frame monochromatic image.

8. The image fusion method of any one of claims 4-7, wherein the neural network model is trained to: when n frames of single-channel images with incompletely identical exposure degrees are input into the main network in the model, n frames of fusion masks corresponding to the n frames of single-channel images are output from the n branch networks in the model; each branch network outputs a frame of corresponding fusion mask, and each frame of fusion mask comprises a corresponding weight for fusing a frame of single-channel image.

9. The image fusion method according to any one of claims 1 to 8, wherein the performing weighted fusion on the n-frame channel images by using the n-frame online fusion mask to obtain the channel fusion image comprises:

constructing a corresponding Gaussian pyramid according to each frame of online fusion mask to obtain n Gaussian pyramids in total, and constructing a corresponding Laplacian pyramid according to each frame of channel image to obtain n Laplacian pyramids in total; the layers of the Gaussian pyramid and the Laplacian pyramid are the same, and the resolution of pyramid images in the same layer in the two pyramids is also the same;

performing weighted fusion on the n laplacian pyramids by using the n gaussian pyramids to obtain fused laplacian pyramids; each layer of pyramid image in the fused laplacian pyramid is obtained by weighting and fusing n frames of pyramid images in the layer in the n laplacian pyramids by using the n frames of pyramid images in the layer in the n laplacian pyramids;

and performing image reconstruction by using the fused Laplacian pyramid to obtain the channel fusion image.

10. The image fusion method according to any one of claims 1 to 9, wherein if each frame of the images to be fused includes a plurality of channels, and the plurality of channels are not all target channels, n frames of channel images belonging to the same non-target channel among the n frames of images to be fused are subjected to weighted fusion using a fusion mask calculated when the channel images of the target channels are fused.

11. A computer program product comprising computer program instructions which, when read and executed by a processor, perform the method of any one of claims 1 to 10.

12. A computer-readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the method of any one of claims 1-10.

13. An electronic device, comprising: a memory having stored therein computer program instructions which, when read and executed by the processor, perform the method of any of claims 1-10.