CN115330658A

CN115330658A - Multi-exposure image fusion method, device, equipment and storage medium

Info

Publication number: CN115330658A
Application number: CN202211265514.0A
Authority: CN
Inventors: 金�一; 谭晓; 陈怀安; 屠韬; 范鑫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2022-11-11
Anticipated expiration: 2042-10-17
Also published as: CN115330658B

Abstract

The application discloses a multi-exposure image fusion method, a device, equipment and a storage medium, at least two original images to be fused are obtained, each original image is subjected to pre-enhancement processing based on a trained pre-enhancement network, each pre-enhancement image with the exposure level approaching to a reference exposure level is obtained, each other pre-enhancement image is subjected to homography matrix estimation by taking the pre-enhancement image corresponding to the reference original image as a reference, each other pre-enhancement image is subjected to homography matrix estimation to obtain a homography matrix of each other pre-enhancement image, the original image corresponding to the homography matrix is subjected to homography transformation to obtain each transformed image which is aligned with the background of the reference original image and retains more image information, and the reference original image and each transformed image are fused by taking the fused image approaching to the reference exposure level as a target, so that the fused image approaching to the reference exposure level, free of artifacts and containing more image information can be obtained.

Description

Multi-exposure image fusion method, device, equipment and storage medium

Technical Field

The present application relates to the field of image fusion technologies, and in particular, to a method, an apparatus, a device, and a storage medium for multi-exposure image fusion.

Background

The dynamic range of the existing digital camera is limited and is usually smaller than that of an actual scene, so that an over-exposure area or an under-exposure area often exists in a single-frame image acquired by the digital camera, especially in a single-frame image acquired in a scene with a high dynamic range, and the over-exposure area or the under-exposure area may have high noise and information loss, so that a clear digitized image is difficult to acquire and perform subsequent image processing. However, in some scenarios, such as scenarios showing industrial monitoring systems, the detail information of the over-exposed and under-exposed areas is very important for the decision-making.

In order to obtain the detail information of the over-exposure area or the under-exposure area, a plurality of multi-exposure images in the same scene can be fused to obtain a fused image with a proper exposure state in the scene. The high-quality multi-exposure fusion image under a static scene can be obtained by applying the conventional multi-exposure fusion algorithm, but the conventional algorithm has the problems of serious artifacts and the like when processing the multi-exposure image fusion task under a dynamic scene with background motion or main body motion, and the quality of the fusion image is not high.

Disclosure of Invention

In view of the foregoing problems, the present application provides a multi-exposure image fusion method, apparatus, device and storage medium to implement a fusion task of multi-exposure images in a dynamic scene, and obtain a fusion image with a proper exposure state, no artifacts and containing much original image information.

The specific scheme is as follows:

in a first aspect, a multi-exposure image fusion method is provided, including:

acquiring at least two original images to be fused;

performing pre-enhancement processing on each original image based on a trained pre-enhancement network to obtain a corresponding pre-enhancement image, wherein the pre-enhancement network is obtained by taking sample images with various exposure levels as training samples and taking the corresponding pre-enhancement sample images as labels for training, and the exposure level of the pre-enhancement sample images is a reference exposure level;

taking a pre-enhanced image corresponding to a reference original image as a reference, and performing homography matrix estimation on each of the rest pre-enhanced images to obtain respective homography matrices of the rest pre-enhanced images, wherein the reference original image is one of the original images determined according to a preset rule;

for each other pre-enhanced image, performing homography transformation on the original image corresponding to the other pre-enhanced images based on the homography matrixes of the other pre-enhanced images to obtain a transformed image corresponding to each other pre-enhanced image;

and fusing the reference original image and each transformed image to obtain a fused image by taking the fused image approaching the reference exposure level as a target.

Optionally, fusing the reference original image and each of the transformed images, including:

and fusing the reference original image and each transformed image by using a trained fusion network, wherein the fusion network is obtained by taking a plurality of non-completely aligned sample images with different exposure levels as training samples and taking corresponding reference images as labels for training, the number of the non-completely aligned sample images is consistent with that of the original images, the reference images are pre-enhanced sample images corresponding to the reference sample images, and the reference sample image is one of the non-completely aligned sample images determined according to the preset rule.

Optionally, the converged network includes an encoder, a merging layer, and a decoder connected in sequence;

the number of the encoders is consistent with the number of the original images, and the encoders are used for extracting the characteristics of the images input to the encoders;

the merging layer is used for merging the features extracted by the encoders to obtain merged features;

and the decoder is used for decoding according to the merging characteristics to obtain a fused image.

Optionally, each of the encoders includes a plurality of encoder hidden layers connected in sequence, the decoder includes a plurality of decoder hidden layers connected in sequence, and each encoder hidden layer is connected to a decoder hidden layer having the same number of input channels as the number of output channels thereof.

Optionally, the pre-enhanced sample image corresponding to the sample image of any exposure level is an image obtained by fusing a plurality of images with different exposure levels and completely aligned with the sample image.

Optionally, the pre-emphasis network includes a first pre-emphasis network and a second pre-emphasis network, where the first pre-emphasis network is obtained by training using a sample image with an exposure level lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a tag, and the second pre-emphasis network is obtained by training using a sample image with an exposure level not lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a tag;

the pre-enhancement processing is carried out on each original image based on the trained pre-enhancement network, and the pre-enhancement processing comprises the following steps:

performing pre-enhancement processing on each original image with exposure level lower than the reference exposure level based on the trained first pre-enhancement network;

and performing pre-enhancement processing on each original image with the exposure level not lower than the reference exposure level based on the trained second pre-enhancement network.

Optionally, the preset rule is to use an image with a lower exposure level of the at least two images as a reference image, or use an image with a higher exposure level of the at least two images as a reference image.

In a second aspect, there is provided a multi-exposure image fusion apparatus, comprising:

the image fusion device comprises an image to be fused acquisition unit, a fusion processing unit and a fusion processing unit, wherein the image to be fused acquisition unit is used for acquiring at least two original images to be fused;

the pre-enhancement image generation unit is used for carrying out pre-enhancement processing on each original image based on a trained pre-enhancement network to obtain a corresponding pre-enhancement image, wherein the pre-enhancement network is obtained by taking sample images with various exposure levels as training samples and taking the corresponding pre-enhancement sample images as labels for training, and the exposure level of the pre-enhancement sample images is a reference exposure level;

the homography matrix estimation unit is used for carrying out homography matrix estimation on each of the rest pre-enhanced images by taking the pre-enhanced image corresponding to the reference original image as a reference to obtain the homography matrix of each of the rest pre-enhanced images, wherein the reference original image is one of the original images determined according to a preset rule;

the image homography transformation unit is used for carrying out homography transformation on the original images corresponding to the rest of the pre-enhanced images based on the homography matrixes of the rest of the pre-enhanced images for each rest of the pre-enhanced images to obtain transformed images corresponding to each rest of the pre-enhanced images;

and the image fusion unit is used for fusing the reference original image and each transformed image by taking the approach of the fused image to the reference exposure level as a target to obtain a fused image.

In a third aspect, there is provided a multi-exposure image fusion apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing the steps of the multi-exposure image fusion method.

In a fourth aspect, a storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the individual steps of the multi-exposure image fusion method described above.

By means of the technical scheme, the application provides a multi-exposure image fusion scheme, and at least two original images to be fused are obtained; performing pre-enhancement processing on each original image based on a trained pre-enhancement network to obtain a corresponding pre-enhancement image, wherein the pre-enhancement network is obtained by taking sample images with various exposure levels as training samples and taking the corresponding pre-enhancement sample images as labels for training, and the exposure level of the pre-enhancement sample images is a reference exposure level; taking a pre-enhanced image corresponding to a reference original image as a reference, and performing homography matrix estimation on each of the rest pre-enhanced images to obtain respective homography matrices of the rest pre-enhanced images, wherein the reference original image is one of the original images determined according to a preset rule; for each other pre-enhanced image, performing homography transformation on the original image corresponding to the other pre-enhanced images based on the homography matrixes of the other pre-enhanced images to obtain a transformed image corresponding to each other pre-enhanced image; and fusing the reference original image and each transformed image to obtain a fused image by taking the fused image approaching the reference exposure level as a target. According to the method, the trained pre-enhancement network is used for pre-enhancing the original image, so that a pre-enhanced image with the exposure level approaching to the reference exposure level can be obtained; the original image is subjected to homography transformation by using the homography matrix obtained based on the pre-enhancement image estimation, so that a transformed image which is aligned with the background of the reference original image and retains more original image information can be obtained, if the pre-enhancement processing is not carried out, the homography transformation is directly carried out by using the homography matrix obtained based on the original image estimation, and under the condition that the exposure level difference of each original image is large, the transformed image with the aligned background is difficult to obtain, and the fusion effect of multi-exposure images is influenced; by taking the approach of the fused image to the reference exposure level as a target, fusing each converted image with the reference original image, and obtaining a high-quality fused image with the exposure level approaching the reference exposure level, no artifact and more image information.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a multi-exposure image fusion method provided in an embodiment of the present application;

FIG. 2 illustrates a process diagram for deriving a transformed image from an original image;

fig. 3 is a schematic structural diagram of a pre-emphasis network according to an embodiment of the present application;

FIG. 4 illustrates a set of static multi-exposure images and their pre-emphasis images;

FIG. 5 illustrates training data for a converged network;

fig. 6 is a schematic structural diagram of a converged network provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of another converged network provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a multi-exposure image fusion apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a multi-exposure image fusion apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides a multi-exposure image fusion method, a multi-exposure image fusion device, a multi-exposure image fusion equipment and a multi-exposure image fusion storage medium, which can realize the fusion task of the multi-exposure image in a dynamic scene and obtain a fusion image with proper exposure state, no artifact and less image information loss of an over-exposure area or an under-exposure area.

The scheme can be realized based on a terminal with data processing capacity, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.

Next, referring to fig. 1, the multi-exposure image fusion method provided by the present application may include the following steps:

step S101, at least two original images to be fused are obtained.

It should be noted that the multiple original images to be fused may have exposure levels of different degrees, and there may be a certain background motion and a certain foreground motion in each two original images, which may be two images of a scene taken from different angles, or two images taken from different or the same angle.

And S102, performing pre-enhancement processing on each original image based on the trained pre-enhancement network to obtain a corresponding pre-enhancement image.

Specifically, the pre-emphasis network is obtained by training sample images with various exposure levels as training samples and corresponding pre-emphasis sample images as labels, and the exposure level of the pre-emphasis sample image is a reference exposure level. Processing an original image by using the trained pre-enhancement network to obtain a pre-enhancement image which corresponds to the original image and has an exposure level close to the reference exposure level, and performing homography matrix estimation based on the pre-enhancement image to obtain a more accurate homography matrix capable of realizing global alignment, thereby solving the problem of difficult alignment caused by large difference of the exposure levels of the images to be aligned. For example, the sample images used to train the pre-emphasis network may be a set of fully aligned images taken by a fixed position camera from a fixed camera angle at various exposure levels, with fully aligned meaning that there is no foreground and background motion between the images. On this basis, the pre-enhanced sample image corresponding to the sample image may be an image with the exposure level in the image group as the reference exposure level, that is, the sample image and the corresponding pre-enhanced sample image may be two images completely aligned.

Step S103, taking the pre-enhanced image corresponding to the reference original image as a reference, and estimating a homography matrix of each rest of pre-enhanced images.

Specifically, the step S103 may include: and taking the pre-enhanced image corresponding to the reference original image as a reference, and performing homography matrix estimation on each of the rest pre-enhanced images to obtain respective homography matrices of the rest pre-enhanced images, wherein the reference original image is one of the original images determined according to a preset rule.

It should be noted that the purpose of determining a reference original image and a pre-enhanced image corresponding to the reference original image is to determine a reference for background alignment and a direction of homography transformation, and further determine a scene condition and a subject pose in a final fused image.

And step S104, performing homography transformation on the original images except the reference original image based on the homography matrix to obtain a transformed image.

Specifically, the step S104 may include: and for each other pre-enhanced image, performing homography transformation on the original image corresponding to the other pre-enhanced images based on the homography matrixes of the other pre-enhanced images to obtain a transformed image corresponding to each other pre-enhanced image.

It should be noted that, after the original image is subjected to the pre-enhancement processing, a part of image information is lost, and if the fused object is a pre-enhanced image or a pre-enhanced image after the homography transformation, the image information is lost, and a high-quality fused image cannot be obtained. That is to say, a more accurate homography matrix can be estimated based on the pre-enhanced image, and the pre-enhanced image is obtained by pre-enhancing the original image through the trained pre-enhanced network, so the homography matrix can also be applied to the original image, and the homography transformation is performed on the original image based on the homography matrix, so that a transformed image with aligned background and containing more original image information can be obtained.

And step S105, fusing the reference original image and each transformed image to obtain a fused image by taking the approach of the fused image to the reference exposure level as a target.

It should be noted that the fused image obtained by fusing the reference original image and each of the transformed images is the same as the reference original image in terms of scene, subject pose, and the like, but compared to the reference original image, the fused image uses all the image information of the original image to be fused, and includes a larger dynamic range, and the fused image cannot be obtained by changing only the exposure level of the reference original image.

For example, for an original image to be fused, namely, an underexposed image and an overexposed image, shown in fig. 2, if pre-enhancement processing is not performed, a homography matrix 1 is directly estimated based on the underexposed image and the overexposed image, homography transformation is performed, and background alignment cannot be achieved, as shown in the aligned overexposed image 1 in fig. 2, the underexposed image and the overexposed image are pre-enhanced by using a trained pre-enhancement network to obtain the preemphasized image and the pre-enhanced overexposed image, a homography matrix 2 is estimated based on the pre-enhanced overexposed and underexposed images, and a transformed image aligned with the original underexposed image background can be obtained by performing homography transformation on the overexposed image based on the homography matrix 2, as shown in the aligned overexposed image 2 in fig. 2.

The application provides a multi-exposure image fusion scheme, which can realize the fusion of a plurality of multi-exposure images, and can obtain a pre-enhanced image with an exposure level approaching to a reference exposure level by performing pre-enhancement processing on an original image by using a trained pre-enhancement network; the method comprises the steps of performing homography transformation on an original image by using a homography matrix obtained based on pre-enhancement image estimation, obtaining a transformed image which is aligned with a reference original image background and retains more original image information, and if the pre-enhancement processing is not performed, directly performing the homography transformation by using the homography matrix obtained based on the original image estimation, wherein the transformed image with the aligned background is difficult to obtain under the condition that the difference of exposure levels of all original images is large, and the fusion effect of multi-exposure images is influenced; the fused image approaching the reference exposure level is taken as a target, and each transformed image and the reference original image are fused, so that a high-quality fused image with the exposure level approaching the reference exposure level, no artifact and more image information can be obtained.

Fig. 3 is a schematic structural diagram of a pre-emphasis network according to an embodiment of the present application, and in combination with fig. 3, the pre-emphasis network provided in the embodiment of the present application may include a first convolution layer, three residual blocks, and a second convolution layer, which are connected in sequence. Specifically, the convolution kernel of the first convolutional layer may have a size of 3 × 3, a step size of 1, an activation function of a ReLU function, and the number of channels outputting the feature map is 32, the residual block may be composed of two convolutional layers and one ReLU activation layer, the convolution kernel of each convolutional layer has a size of 3 × 3, a step size of 1, and the number of channels outputting the feature map is 32, the convolution kernel of the second convolutional layer has a size of 3 × 3, a step size of 1, an activation function of a Sigmoid function, and the number of channels outputting the feature map is 3, and the trained pre-enhancement network may convert input images of various exposure levels into pre-enhancement images approaching the reference exposure level. Further, in order to reduce the amount of calculation for performing the pre-emphasis process, the residual block may be a Block Normalization (BN) that is not included.

In constructing the training data set for the pre-emphasis network, a tripod camera may be used to fix, take a set of images with foreground still but different exposures, and for example, a camera with a resolution of 69660 × 4640 may be used to take the images, 7 frames of images may be captured in succession, and the exposure level EV may be set to seven categories { + -3, + -2, + -1, 0} or { + -2, + -4/3, + -2/3, 0 }.

In some embodiments provided herein, the pre-enhanced sample image corresponding to a sample image at any exposure level may be an image obtained by fusing several images at different exposure levels and perfectly aligned with the sample image.

It should be noted that, in the following description, S = { I ] in one image group ₁ ,I ₂ ,...,I _n In the method, the images do not have foreground and background motion and are completely aligned, so that the image fusion can be carried out by using the conventional multi-exposure image fusion algorithm to obtain a fusion image I with the exposure level as the reference exposure level _R . In addition, when image fusion is performed using a conventional algorithm, the image may be down-sampled to a resolution of 3000 × 2000 in order to reduce the possibility of occurrence of misalignment and the like. Illustratively, the image may beAny image I in the set S _i (I =1,2.. Ang., n) as a training sample, image I was fused _R And the label is used as a corresponding label, so that one training data in the training data set of the pre-enhanced network is obtained.

On the basis, the difference of the synthesis effect of each existing multi-exposure fusion algorithm on different images is considered, the fusion results of each existing algorithm on the same group of images can be compared, and the optimal fusion image is used as a label in the training data set of the pre-enhancement network. Specifically, the existing multi-exposure image Fusion algorithm may include Mertens09, DSIFT, SPD-MEF, kou17, MEFOpt, DEM, FMMEF, IFCNN, PMGI, MEFNet, MESPD, U2Fusion, and DPEMEF, and several volunteers perform pairwise comparison on each of the Fusion images obtained by the above 13 algorithms, and each volunteer determines a subjectively optimal Fusion image, which may be an image selected by the largest number of people among the subjectively optimal Fusion images. Illustratively, FIG. 4 shows a set of static multi-exposure images and corresponding pre-emphasis images.

In order to train the pre-enhancement network, images in a training data set can be randomly cut down to obtain images with the resolution of 512 × 512 as sample images, and labels corresponding to the sample images are pre-enhancement sample images with the same resolution cut down according to the same method. It should be noted that, when performing pre-enhancement processing by using the trained pre-enhancement network, the resolution of the image input to the pre-enhancement network needs to be adjusted to 16n × 16n, where n is a positive integer, and does not need to be adjusted to 512 × 512, because the convolution operation has translation invariance, and the pre-enhancement network has the same processing effect on the whole image.

When training the pre-enhanced network, setting initial learning parameters, learning rate attenuation and training times for the pre-enhanced network, setting a weight initialization mode, setting a loss function as an L2 loss function, and setting a formula as

I.e. calculating a certain parameter of the output image and labelThe sum of squares of the differences, which is desired to be minimized, is set to a batch size of 4, an initial learning rate of 0.0001, a total of 2000 rounds of training are performed, and after 1500 rounds of training, the learning rate is reduced to 0.00001, and network parameters may be optimized using an Adam optimizer.

In some embodiments provided herein, the pre-emphasis network may include a first pre-emphasis network and a second pre-emphasis network, where the first pre-emphasis network is trained by using a sample image with an exposure level lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a tag, and the second pre-emphasis network is trained by using a sample image with an exposure level lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a tag.

On the basis of the above, the pre-enhancing processing on each original image based on the trained pre-enhancing network may include:

performing pre-enhancement processing on each original image with the exposure level lower than the reference exposure level based on the trained first pre-enhancement network;

It should be noted that, for original images with different exposure levels, the first pre-enhancement network and the second pre-enhancement network are respectively used for pre-enhancement processing, which can improve the pertinence of the pre-enhancement processing, thereby improving the effect of the pre-enhancement processing and obtaining a pre-enhancement image with higher quality.

In some embodiments provided herein, said fusing said reference original image and each of said transformed images may include:

and fusing the reference original image and each transformed image by using a trained fusion network, wherein the fusion network is obtained by training by using a plurality of incompletely aligned sample images with different exposure levels as training samples and using corresponding reference images as labels, the number of the incompletely aligned sample images is consistent with that of the original images, the reference images are pre-enhanced sample images corresponding to the reference sample images, and the reference sample image is one of the incompletely aligned sample images determined according to the preset rule.

Specifically, a sample image of training data constituting the fusion network may be from at least two static multi-exposure image groups, each of the static multi-exposure image groups belongs to the same scene but has a certain background and foreground motion, images belonging to different static multi-exposure image groups are not completely aligned, and each of the static multi-exposure image groups provides at least one sample image. Illustratively, when the training data set of the fusion network is constructed, two static multi-exposure image groups belonging to the same scene but having certain background and foreground motions can be acquired by acquiring one static multi-exposure image group, then slightly moving the camera to realize background motion, moving the main body to realize foreground motion, and then acquiring one static multi-exposure image group, wherein the static multi-exposure image group can be written as S ₁ ={I ₁₁ ,I ₁₂ ,...,I _1n And S ₂ ={I ₂₁ ,I ₂₂ ,...,I _2n In which, with the image group S ₁ The corresponding pre-enhanced image is I _1R And the group of images S ₂ The corresponding pre-enhanced image is I _2R Then, on the basis of the above, a preset number of sample images are taken as training samples to pre-enhance the image I _1R Or pre-enhancing the image I _2R Forming a training data of the fusion network for the label, wherein the sample images are respectively from the image group S ₁ And an image group S ₂ And each image group provides at least one sample image, if the reference sample image belongs to the image group S ₁ Then the label of the training data is I _1R Otherwise is I _2R 。

Illustratively, FIG. 5 shows a slave static multi-wayExposure image group S ₁ And S ₂ The process schematic diagram of the training data of the fusion network formed by the extracted images is provided, and the fusion network can be represented by I if the fusion network takes a low-exposure image as a reference original image ₁₁ And I _2n As training samples of training data 1, I _1R As a label of training data 1, represented by I _1n And I ₂₁ As training samples of training data 2, I _2R As labels for training data 2.

In order to train the fusion network, images in a training data set may be randomly cut down to obtain an image with a resolution of 512 × 512 as a sample image, and a label corresponding to the sample image is a reference image with the same resolution cut down by the same method. When the trained fusion network is used to perform image fusion, the resolution of the image input to the fusion network needs to be adjusted to 16n × 16n, where n is a positive integer, and does not need to be adjusted to 512 × 512, because the convolution operation has translation without deformation, and the fusion network has the same processing effect on the whole image.

When the fusion network is trained, setting initial learning parameters, learning rate attenuation and training times for the fusion network, setting a weight initialization mode, using L2 loss as a loss function, and adopting the formula as

That is, calculating the sum of squares of the differences of a certain parameter of the output image and the label, which is desired to be minimized, setting the batch size to 4, setting the initial learning rate to 0.0001, performing 10000 rounds of training altogether, and after 8000 rounds of training, reducing the learning rate to 0.00001, the network parameters can be optimized using an Adam optimizer.

In one possible implementation, the converged network may include an encoder, a merging layer, and a decoder connected in sequence.

On the basis of the above, the number of encoders is the same as the number of original images, for extracting features of images input to the encoders. Specifically, the encoder may be composed of a convolutional layer and a residual block.

And the merging layer is used for merging the features extracted by the encoders to obtain merged features. Specifically, the merging layer may be composed of one convolution layer and three residual blocks.

And the decoder is used for decoding according to the merging characteristics to obtain a fused image. Specifically, the decoder may be composed of a deconvolution layer and a residual block.

For example, assuming that two original images to be fused are obtained in step S101, one of the two original images is determined as a reference original object according to the preset standard, and the transformed image of the other original image is obtained through steps S102-S104, in order to realize the fusion of the two images, the fusion network includes two encoders with the same structure, fig. 6 shows a possible convergence network, which, in connection with fig. 6, comprises an encoder 1, an encoder 2, a merging layer and a decoder, the encoder 1 is used for extracting the characteristics of a reference original image, the encoder 2 is used for extracting the characteristics of a transformed image, any encoder comprises a convolution layer 01, a convolution layer + residual block 11, a convolution layer + residual block 12, a convolution layer + residual block 13 and a convolution layer + residual block 14 which are connected in sequence, the merging layer comprises a convolutional layer 02 and residual blocks 31-33 which are connected in sequence, the decoder comprises a deconvolution layer + residual block 21, a deconvolution layer + residual block 22, a deconvolution layer + residual block 23, a deconvolution layer + residual block 24, a convolutional layer 03, a residual block 34 and a convolutional layer 04 which are connected in sequence, the sizes of convolution kernels of convolution layers 01-03 are all 3 x 3, the step lengths are 1, the activation functions are ReLU functions, convolution kernels of convolution layers in convolution layers + residual blocks 11-14 are all 3 x 3, the step lengths are all 2, the activation functions are all ReLU functions, convolution kernels of convolution layers 04 are all 3 x 3, the step lengths are 1, the activation functions are Sigmoid functions, deconvolution kernels of deconvolution layers in deconvolution layers + residual blocks 21-24 are all 4 x 4, the step lengths are all 2, the activation functions are all ReLU functions, convolution kernels of residual blocks in a fusion network are all 3 x 3, the step lengths are all 2 and no BN layer is included.

As shown in fig. 6, the converged network structure uses the encoder-merging layer-decoder structure as the network backbone, and on the basis of the above, in order to improve the feature recovery level and the training convergence speed, a jump connection relationship between the encoder and the decoder may be established.

In a possible implementation manner, each of the encoders includes a plurality of encoder hidden layers connected in sequence, the decoder includes a plurality of decoder hidden layers connected in sequence, and each of the encoder hidden layers is connected to a decoder hidden layer having the same number of input channels as the number of output channels.

Fig. 7 shows a schematic structure diagram of another possible fusion network, in which jump connection lines between an encoder hidden layer and a decoder hidden layer are shown as dashed arrows in fig. 7, an encoder of the fusion network may include the encoder hidden layers 1 to 5, a decoder of the fusion network may include 6 decoder hidden layers 1 to 6, the number of output channels of each of the encoder hidden layer and the decoder hidden layer is shown in fig. 7, and illustratively, an encoder hidden layer 11 with the number of output channels of 32 is connected to a decoder hidden layer 25 with the number of input channels of 32, so that a feature pattern output by the decoder hidden layer 24 and the encoder hidden layer 11 can be connected in a channel dimension as a feature pattern input to the decoder hidden layer 25.

In some embodiments of the present application, the preset rule is to use an image with a lower exposure level of the at least two images as a reference image, or use an image with a higher exposure level of the at least two images as a reference image.

On the basis of the above, the fusion network may include a fusion network based on low exposure and a fusion network based on high exposure, and it should be noted that the fusion network may perform autonomous learning according to the reference image to determine the features to be utilized and the features to be discarded, so that the two fusion networks may output fusion images with different qualities. It should be noted that, when an image is captured by an existing camera, automatic exposure is performed, so as to avoid that the captured image contains a large number of overexposed regions as much as possible, and therefore, in a set of original images to be fused, the underexposure degree of an original image with low exposure is often greater than the overexposed degree of an original image with high exposure, and a fusion network based on low exposure may amplify noise in the underexposure region in the original image with low exposure and cannot match with the original image with high exposure, so that a multi-exposure image fusion method including a fusion network based on high exposure has better performance and higher quality of an output fusion image when processing the original image to be fused, and when fusing a set of original images to be fused, in which the underexposure degree of the original image with low exposure is much less than the overexposed degree of the original image with high exposure, a multi-exposure image fusion method including a fusion network based on low exposure can obtain a fusion image with higher quality.

The multi-exposure image fusion device provided in the embodiment of the present application is described below, and the multi-exposure image fusion device described below and the multi-exposure image fusion method described above may be referred to in correspondence with each other.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a multi-exposure image fusion apparatus disclosed in the embodiment of the present application.

As shown in fig. 8, the apparatus may include:

the image fusion processing device comprises an image to be fused acquisition unit 101, a fusion processing unit and a fusion processing unit, wherein the image to be fused acquisition unit is used for acquiring at least two original images to be fused;

a pre-enhanced image generation unit 102, configured to perform pre-enhancement processing on each original image based on a trained pre-enhanced network to obtain a corresponding pre-enhanced image, where the pre-enhanced network is obtained by taking sample images with various exposure levels as training samples and taking the corresponding pre-enhanced sample image as a label training, and the exposure level of the pre-enhanced sample image is a reference exposure level;

a homography matrix estimation unit 103, configured to perform homography matrix estimation on each of the remaining pre-enhanced images by using the pre-enhanced image corresponding to the reference original image as a reference, to obtain respective homography matrices of the remaining pre-enhanced images, where the reference original image is one of the original images determined according to a preset rule;

an image homography transformation unit 104, configured to, for each remaining pre-enhanced image, perform homography transformation on an original image corresponding to the remaining pre-enhanced image based on a homography matrix of the remaining pre-enhanced image, to obtain a transformed image corresponding to each remaining pre-enhanced image;

and an image fusion unit 105, configured to fuse the standard original image and each of the transformed images to obtain a fusion image, with a goal that the fusion image approaches the reference exposure level.

In some embodiments provided herein, the process of fusing the reference original image and each of the transformed images by the image fusion unit 105 may include:

the image fusion unit 105 fuses the reference original image and each of the transformed images by using a trained fusion network, where the fusion network is obtained by training a plurality of non-completely aligned sample images with different exposure levels as training samples and corresponding reference images as labels, where the number of the non-completely aligned sample images is consistent with the number of the original images, the reference image is a pre-enhanced sample image corresponding to a reference sample image, and the reference sample image is one of the non-completely aligned sample images determined according to the preset rule.

In one possible implementation, the converged network may include an encoder, a merging layer, and a decoder connected in sequence;

In a possible implementation manner, each of the encoders includes a plurality of encoder hidden layers connected in sequence, the decoder includes a plurality of decoder hidden layers connected in sequence, and each encoder hidden layer is connected to a decoder hidden layer having the same number of input channels as the number of output channels thereof.

In some embodiments provided herein, the pre-enhanced sample image corresponding to a sample image at any exposure level may be an image obtained by fusing several images at different exposure levels and aligned with the sample image.

In some embodiments provided by the present application, the pre-emphasis network may include a first pre-emphasis network and a second pre-emphasis network, where the first pre-emphasis network is obtained by training using a sample image with an exposure level lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a label, and the second pre-emphasis network is obtained by training using a sample image with an exposure level not lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a label.

On the basis of the above, the process of performing the pre-enhancement processing on each original image by the pre-enhanced image generating unit 102 based on the trained pre-enhancement network may include:

the pre-enhanced image generation unit 102 performs pre-enhancement processing on each original image with exposure level lower than the reference exposure level based on the trained first pre-enhanced network;

the pre-enhanced image generation unit 102 performs pre-enhancement processing on each original image with an exposure level not lower than the reference exposure level based on the trained second pre-enhancement network.

In some embodiments provided by the present application, the preset rule may be that an image with a lower exposure level of the at least two images is used as a reference image, or an image with a higher exposure level of the at least two images is used as a reference image.

The multi-exposure image fusion device provided by the embodiment of the application performs homography transformation on an original image based on a homography matrix estimated from a pre-enhanced image before fusing the images, so that background alignment among images to be fused is realized, and particularly, the homography matrix estimated based on the pre-enhanced image is utilized.

The multi-exposure image fusion device provided by the embodiment of the application can be applied to multi-exposure image fusion equipment, such as a terminal: mobile phones, computers, etc. Alternatively, fig. 9 shows a block diagram of a hardware structure of the multi-exposure image fusion apparatus, and referring to fig. 9, the hardware structure of the multi-exposure image fusion apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring at least two original images to be fused;

and fusing the reference original image and each transformed image by taking the approach of the fused image to the reference exposure level as a target to obtain a fused image.

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring at least two original images to be fused;

In order to evaluate the multi-exposure image fusion scheme provided by the embodiment of the application, the same original image to be fused is subjected to image fusion by using the existing algorithm and the scheme of the application, and each image fusion result is quantitatively evaluated, wherein the existing algorithm may include Hu13, oh15, SPD-MEF, FMMEF, MESPD, MEFNet, sen12 and DSIFT. Illustratively, 52 sets of double-exposure images with reference images can be fused using the present application and the existing algorithms, and each fusion result is quantitatively evaluated, and table 1 shows the average of the quantitative evaluation results of each image fusion algorithm, wherein the peak signal-to-noise ratio (PSNR) and the Structural Similarity (SSIM) are used to evaluate the visual quality of the fused image, the multi-exposure fusion structural similarity (MEF-SSIM), and the fusion quality (Q) are used to evaluate the fusion image _W ) And Visual Information Fidelity (VIF) is an index for evaluating the similarity of the fusion image to the static multi-exposure image, the multi-exposure fusion structure similarity of the dynamic scene (MEF-SSIM) _d ) The method is used for evaluating the fusion quality in a dynamic scene, and the larger the index data is, the better the fusion effect is.

TABLE 1

By using the scheme of the application and the existing algorithms Hu13, SPD-MEF, FMMEF, MESPD, MEFNet and DSIFT, image fusion is performed on multi-exposure images without reference images, for example, the original images to be fused may be any two of a multi-exposure image sequence of 20 dynamic scenes, and since the multi-exposure image sequence lacks a static multi-exposure image group of the same scene and does not have a corresponding reference image, the fusion quality in the dynamic scene is evaluated by using the multi-exposure fusion structural similarity (MEF-SSIMd) of the dynamic scenes, as shown in table 2.

TABLE 2

The application scheme and the existing algorithms Hu13, oh15, SPD-MEF, FMMEF, MESPD, MEFNet, sen12 and DSIFT are used for testing on a High Dynamic Range (HDR) data set, and the test results are shown in Table 3.

TABLE 3

According to the multi-exposure image fusion scheme, when a fusion network is trained, a dynamic data set is utilized, namely sample images serving as training samples are not completely aligned, background and foreground movement exists, a high-quality reference image is used as a label, the reference image has no unnatural illumination and no color distortion, so that a fusion image obtained by fusion of the trained fusion network can have a natural illumination effect and no color distortion, more original image information is reserved, and ghost artifacts do not exist.

Respectively performing network training on the existing static data set and the dynamic data set provided by the application, and performing testing on the existing static data set SICE, the dynamic data set DMEF provided by the application and the existing MEFOpt test set, wherein the testing results are shown in Table 4

TABLE 4

The multi-exposure image fusion scheme provided by the application can obtain a better image fusion result by training on the dynamic data set DMEF, is suitable for different test data sets, and has certain generalization.

In addition, the results of image fusion with or without the alignment operation are compared with each other based on the low-exposure image and the high-exposure image, and as shown in table 5, if the image fusion is performed without performing the pre-enhancement processing, the homography matrix estimation, and the homography transformation, it is difficult to obtain a high-quality fused image.

TABLE 5

Comparing image fusion results when the number of original images to be fused is 2 or 3 with low-exposure images and high-exposure images as references, as shown in table 6, although increasing the number of images to be fused can improve the quality of image fusion to a certain extent, the number of network parameters, the running time and the memory consumption can be increased, and fusing two original images can meet most image fusion requirements.

TABLE 6

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-exposure image fusion method, comprising:

acquiring at least two original images to be fused;

2. The method according to claim 1, wherein fusing the reference original image and each of the transformed images comprises:

3. The method of claim 2, wherein the converged network comprises an encoder, a merging layer, and a decoder connected in sequence;

4. The method of claim 3, wherein each of the encoders comprises a plurality of encoder hidden layers connected in sequence, and the decoder comprises a plurality of decoder hidden layers connected in sequence, each encoder hidden layer being connected to a decoder hidden layer having a number of input channels equal to the number of output channels of the encoder hidden layer.

5. The method according to claim 1, wherein the pre-enhanced sample image corresponding to the sample image at any exposure level is an image obtained by fusing a plurality of images at different exposure levels and completely aligned with the sample image.

6. The method according to claim 1, wherein the pre-emphasis network comprises a first pre-emphasis network and a second pre-emphasis network, wherein the first pre-emphasis network is trained by using a sample image with an exposure level lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a label, and the second pre-emphasis network is trained by using a sample image with an exposure level lower than the reference exposure level as a training sample and using a corresponding pre-emphasis sample image as a label;

7. The method according to any one of claims 1 to 6, wherein the preset rule is that the image with the lower exposure level of the at least two images is used as the reference image, or the image with the higher exposure level of the at least two images is used as the reference image.

8. A multi-exposure image fusion apparatus, comprising:

a pre-enhanced image generation unit, configured to perform pre-enhancement processing on each original image based on a trained pre-enhanced network to obtain a corresponding pre-enhanced image, where the pre-enhanced network is obtained by using sample images with various exposure levels as training samples and using the corresponding pre-enhanced sample image as a label training, and the exposure level of the pre-enhanced sample image is a reference exposure level;

9. A multi-exposure image fusion apparatus characterized by comprising: a memory and a processor;

the memory is used for storing programs;

the processor, for executing the program, realizes the steps of the multi-exposure image fusion method according to any one of claims 1 to 7.

10. A storage medium having stored thereon a computer program for implementing the steps of the multi-exposure image fusion method according to any one of claims 1 to 7 when executed by a processor.