CN114708172A

CN114708172A - Image fusion method, computer program product, storage medium, and electronic device

Info

Publication number: CN114708172A
Application number: CN202210163313.3A
Authority: CN
Inventors: 蒋霆; 李鑫鹏; 刘帅成
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-07-05

Abstract

The application relates to the technical field of image processing, and provides an image fusion method, a computer program product, a storage medium and an electronic device. The image fusion method comprises the following steps: acquiring n frames of images to be fused with incompletely same exposure degrees; fusing the channel images belonging to the same channel, and determining a fusion result image according to at least one frame of channel fusion image obtained by fusion; for channel images belonging to the same target channel in n frames of images to be fused, calculating a channel fusion image by the following steps: extracting a basic feature map of n frames of channel images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps; calculating n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; and performing weighted fusion on the n frames of channel images by using the n frames of fusion masks. The method is beneficial to improving the quality of the fusion result image and avoiding the phenomena of grey picture, lack of details and the like.

Description

Image fusion method, computer program product, storage medium, and electronic device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image fusion method, a computer program product, a storage medium, and an electronic device.

Background

When a high-brightness area and areas with relatively low brightness, such as shadows and backlight, which are irradiated by a strong light source, coexist in a shooting scene, a phenomenon that a bright area becomes white due to overexposure and a dark area becomes black due to underexposure often occurs in an image collected by a camera, and the phenomenon seriously affects the image quality. There is a limitation in the appearance of a camera to the brightest and darker areas in the same scene, which is commonly referred to as "dynamic range".

High-Dynamic Range (HDR) is a technology for restoring an image to the details of a scene as much as possible when there is strong brightness contrast in a captured scene. A typical application of this technique is: by utilizing a specific fusion algorithm, a plurality of frames of images with different exposure degrees collected aiming at the same scene are synthesized into one HDR image in a weighting fusion mode, so that the HDR image has good detailed expression no matter in highlight parts or shadow parts.

However, the weight selected by the existing fusion algorithm often lacks robustness, so that when the exposure span of the image to be fused is large, the synthesized image is prone to have the problems of gray picture, lack of details and the like, and the real HDR effect cannot be realized.

Disclosure of Invention

An object of the embodiments of the present application is to provide an image fusion method, a computer program product, a storage medium, and an electronic device, so as to improve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an image fusion method in the present application includes: acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel; fusing n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps: extracting a basic feature map of the n frames of channel images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps; calculating n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; each branch network is used for calculating a frame of fusion mask corresponding to a frame of channel image according to a frame of sub-feature map, and each frame of fusion mask comprises a weight for fusing the corresponding frame of channel image; and performing weighted fusion on the n frames of channel images by using the n frames of fusion masks to obtain the channel fusion images.

According to the method, when the channel images of the target channel are fused, the weight (fusion mask) for fusion is calculated by using the neural network model, and the neural network model is obtained through training of training data and is not a preset rule determined according to experience, so that the calculated weight has better robustness, the quality of the obtained fusion result image (HDR image) is improved when the exposure degree span of the image to be fused is larger, and the phenomena of grey picture, lack of details and the like are avoided.

In addition, the neural network model used in the method is divided into a main network and branch networks, the main network is shared by n frames of channel images, so that inter-frame information communication is facilitated, and the precision of the fusion mask is improved, and each branch network is used for calculating the fusion mask corresponding to one frame of channel image, so that parameter optimization is facilitated for the frame of channel image, and the precision of the fusion mask can be improved. The higher the calculated fusion mask accuracy, the better the fusion effect is naturally.

In one implementation of the first aspect, the backbone network is a network with attention mechanism including at least one of channel attention, frame attention, and spatial attention; and/or the branch network is a network with an attention mechanism, and the attention mechanism of the branch network comprises at least one of channel attention and space attention; the channel attention means that corresponding weights are given to data in different channels of the feature map, the frame attention means that corresponding weights are given to data in different frames of the feature map, and the spatial attention means that corresponding weights are given to data at different spatial positions of the feature map.

In the implementation mode, different attention mechanisms are added into the neural network, so that the accuracy of the calculated fusion mask is improved.

In one implementation manner of the first aspect, the backbone network includes a multi-attention module, and the multi-attention module includes a channel attention unit and a frame attention unit that are connected in sequence; the channel attention unit is used for calculating the weight of the input feature map of the multi-attention module in the channel dimension, and multiplying the calculated weight by data in different channels of the input feature map of the multi-attention module to obtain an output feature map of the channel attention unit; the frame attention unit is used for transposing the channel dimension and the frame dimension in the output feature map of the channel attention unit, calculating the weight of the transposed feature map in the channel dimension, multiplying the calculated weight with data in different channels of the transposed feature map to obtain a weighted feature map, and transposing the channel dimension and the frame dimension in the weighted feature map to obtain the output feature map of the multi-attention module.

The multi-attention module in the implementation mode comprises channel attention and frame attention, so that the inter-channel and inter-frame information exchange and sharing are facilitated to be strengthened, inter-frame relation is strengthened, and the calculated precision of the fusion mask is improved.

In an implementation manner of the first aspect, the branch network includes a multi-receptive-field module, and the multi-receptive-field module includes m feature extraction branches and a feature fusion unit; the system comprises a multi-receptive-field module, a characteristic extraction unit, a characteristic fusion unit and a multi-receptive-field module, wherein m is an integer larger than 1, each characteristic extraction branch is used for carrying out characteristic extraction on an input characteristic diagram of the multi-receptive-field module to obtain an output characteristic diagram of the multi-receptive-field module, receptive fields corresponding to the characteristic extraction branches are different, and the characteristic fusion unit is used for fusing the output characteristic diagrams of the m characteristic extraction branches to obtain the output characteristic diagram of the multi-receptive-field module.

The multi-sensing field module in the implementation mode comprises a plurality of feature extraction branches corresponding to different sensing fields, so that the consistency of spatial information is favorably improved, and the phenomenon of fusion layering (namely discontinuous image picture and fault) in the channel fusion image is prevented.

In one implementation of the first aspect, at least m-1 of the m feature extraction branches comprise hole convolution layers; wherein, when m is larger than 2, the void rates of the void convolution layers in the at least m-1 feature extraction branches are different; when the number of the feature extraction branches including the void convolutional layer is m-1, the remaining one of the m feature extraction branches only includes a common convolutional layer or is a direct connection branch.

In the implementation manner, different receptive fields are implemented by using the hole convolutions with different hole rates, and a feature extraction branch only including a common convolution layer (the common convolution can be regarded as the hole convolution with the hole rate of 1) or a direct connection branch (used for forming a residual structure with other branches) can be added.

In an implementation manner of the first aspect, each of the at least m-1 feature extraction branches is configured to calculate a weight of an output feature map of the own hole convolution layer in a spatial dimension, and multiply the calculated weight with data at different spatial positions of the output feature map of the own hole convolution layer to obtain the own output feature map.

And a spatial attention mechanism can be combined in the multi-sensing-field module to further improve the accuracy of the calculated fusion mask.

In an implementation manner of the first aspect, the branch network includes a multi-receptive-field module, and the multi-receptive-field module includes m feature extraction units connected in sequence; the multi-receptive-field module comprises a plurality of characteristic extraction units, wherein m is an integer larger than 1, the first characteristic extraction unit in the m characteristic extraction units is used for carrying out characteristic extraction on an input characteristic diagram of the multi-receptive-field module to obtain an output characteristic diagram of the multi-receptive-field module, each subsequent characteristic extraction unit is used for carrying out characteristic extraction on an output characteristic diagram of the last characteristic extraction unit to obtain an output characteristic diagram of the multi-receptive-field module, the output characteristic diagram of the last characteristic extraction unit is the output characteristic diagram of the multi-receptive-field module, and the corresponding receptive fields of the characteristic extraction units are different.

In the previous implementation, multiple receptive fields are realized by setting multiple feature extraction branches, and the feature extraction branches are in parallel relation. In the above implementation, these feature extraction branches may be changed to a serial relationship (and their names are changed to feature extraction units, respectively), and multiple receptive fields may be implemented as well.

In an implementation manner of the first aspect, the backbone network is further configured to downsample the n-frame channel images; the calculating of the n frames of fusion masks corresponding to the n frames of channel images by using the n branch networks in the neural network model includes: calculating n frames of low-resolution fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; and performing up-sampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as the n frames of channel images.

In the implementation manner, the channel image of the target channel is firstly downsampled to reduce the resolution, and then is subjected to subsequent processing until the fusion mask with the low resolution is obtained, and then the fusion mask with the original resolution is obtained through upsampling. Therefore, most of operations in the whole mask calculation process can be performed on the image with low resolution, so that the mask calculation efficiency is obviously improved, and the image fusion efficiency is further improved.

In an implementation manner of the first aspect, the upsampling the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as that of the n frames of channel images includes: and performing edge-preserving smooth filtering and up-sampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as the n frames of channel images.

When the fusion mask with the original resolution is calculated by using the fusion mask with the low resolution, the above implementation manner not only performs upsampling, but also performs edge-preserving smooth filtering (for example, guided filtering), so that the reduction of the mask quality caused by the upsampling is favorably suppressed, and the precision of the fusion mask is improved.

In an implementation manner of the first aspect, if each frame of image to be fused includes multiple channels, and the multiple channels are not all target channels, n frames of channel images belonging to the same non-target channel in the n frames of images to be fused are weighted and fused by using a fusion mask calculated when the channel images of the target channels are fused.

In the implementation mode, for the channel images of the non-target channel, the fusion mask used by the channel images of the target channel can be directly used during fusion, so that the fusion process is simplified, and the fusion efficiency is improved. For example, the target channel may select a relatively important channel (such as the Y channel of the YUV image) in the image to be fused, and the non-target channel may select a relatively minor channel (such as the U, V channel of the YUV image) in the image to be fused, so that the fusion mask used by the non-target channel does not have a great influence on the image quality of the fusion result image even if the fusion mask is not calculated according to the own channel image.

In a second aspect, the present application provides an image fusion apparatus, including: the image acquisition component is used for acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel; the image fusion component is used for fusing n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fusion image, and determining a fusion result image according to the at least one frame of channel fusion image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps: extracting a basic feature map of the n frames of channel images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps; calculating n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; each branch network is used for calculating a frame of fusion mask corresponding to a frame of channel image according to a frame of sub-feature map, and each frame of fusion mask comprises a weight for fusing the corresponding frame of channel image; and performing weighted fusion on the n frames of channel images by using the n frames of fusion masks to obtain the channel fusion images.

In a third aspect, an embodiment of the present application provides a computer program product, which includes computer program instructions, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the method provided in the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory in which computer program instructions are stored, and a processor, where the computer program instructions are read and executed by the processor to perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 illustrates steps of an image fusion method provided by an embodiment of the present application;

FIG. 2 illustrates sub-steps that may be included in step S120 of FIG. 1;

fig. 3 illustrates a specific workflow of an image fusion method provided by an embodiment of the present application;

FIG. 4 illustrates one configuration of a multi-attention module provided by an embodiment of the present application;

FIG. 5 illustrates a structure for computing weights in a multi-attention module;

FIG. 6 shows one configuration of a multi-field module provided by an embodiment of the present application;

FIG. 7 shows a structure for computing weights in a multi-receptive field module;

FIG. 8 shows functional components included in an image fusion apparatus provided in an embodiment of the present application;

fig. 9 shows a structure of an electronic device provided in an embodiment of the present application.

Detailed Description

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly, a machine is used for identifying the world, and the computer vision technology generally comprises technologies such as face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction, computational photography, robot navigation and positioning and the like. With the research and development of artificial intelligence technology, the technology is applied to many fields, such as security protection, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, intelligent medical treatment, face payment, face unlocking, fingerprint unlocking, human evidence verification, smart screens, smart televisions, cameras, mobile internet, network, beauty, makeup, medical beauty, intelligent temperature measurement and the like. The image fusion method in the embodiment of the application also utilizes the technologies of aspects such as artificial intelligence and the like.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

Fig. 1 illustrates steps of an image fusion method provided in an embodiment of the present application, which may be, but is not limited to, performed by the electronic device illustrated in fig. 9, and reference may be made to the following description about fig. 9 regarding the structure of the electronic device. Referring to fig. 1, the image fusion method includes:

step S110: and acquiring n frames of images to be fused with incompletely same exposure degrees.

Wherein n is an integer greater than 1, the images to be fused refer to images that need to be fused with each other, and the fusion is aimed at obtaining images with HDR effect. The exposure degree of the image to be fused can be quantitatively expressed by a certain means, for example, an exposure compensation value (EV) or the like. Note that, in step S110, it is only required that the exposure degrees of the n frames of images to be fused are not exactly the same, but it is not required that the exposure degrees of the n frames of images to be fused are completely different: for example, when n is 3, 3 frames of images of EV-1, EV0, EV +1 are a satisfactory set of images to be fused; when n is 4, the 4-frame images of EV-1, EV0, EV0 and EV +1 are also a satisfactory set of images to be fused, wherein the image of EV0 appears 2 times, which indicates that the image of EV0 is important and should have a higher percentage in the fusion result.

The present application does not limit how to obtain n frames of images to be fused that satisfy the above requirements: for example, it may be that an image captured in real time is received from a camera of the electronic device as an image to be fused (for example, EV values of the camera are set to-1, 0, and 1, respectively, to continuously capture 3 frames of images); for another example, the image to be fused that is stored in advance may be directly read from the memory of the electronic device, and so on. It should be further understood that the n frames of images to be fused are directed to the same scene in content, otherwise it is not meaningful to fuse them.

Step S120: and fusing the n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image.

Each frame of image to be fused comprises at least one channel, and the number of the channels contained in each frame of image to be fused is the same. For example, the n frames of images to be fused may all be grayscale images, that is, all include 1 channel; for another example, the n frames of images to be fused may all be YUV images, that is, all include 3 channels; for another example, the n frames of images to be fused may all be RGB images, i.e., each contain 3 channels, and so on.

If each frame of image to be fused only contains one channel, the channel image of the channel is the image to be fused, in step S120, n frames of channel images are fused, that is, n frames of images to be fused are fused, and the obtained channel fused image is the fusion result image.

If each frame of image to be fused comprises two or more channels, the channels are divided for image fusion to obtain channel fusion images with the same number as the channels, and the channel fusion images are spliced together according to the channels to which the channel fusion images belong to obtain a fusion result image. For example, if n frames of images to be fused are YUV images, n frames of channel images of the Y channel should be fused to obtain a channel fusion image of the Y channel, n frames of channel images of the U channel should be fused to obtain a channel fusion image of the U channel, n frames of channel images of the V channel should be fused to obtain a channel fusion image of the V channel, and finally, 3 frames of channel fusion images are spliced together as new Y, U, V channel images, respectively, so that a fusion result image can be obtained, and obviously, the fusion result image is still a YUV image.

The number of the target channels may be one or more, and if there are remaining channels in the image to be fused besides the target channels, these remaining channels may be called non-target channels, and it may be predetermined which channels are target channels. For example, if the image to be fused is a grayscale image, only one channel thereof is necessarily also the target channel; if the image to be fused is a YUV image, the Y channel may be selected as the target channel, where U, V is a non-target channel, or Y, U, V channels may be all selected as target channels, where no non-target channel exists.

The difference between the target channel and the non-target channel is that for n frames of channel images of each target channel, the steps S121 to S123 in fig. 2 should be performed to fuse into a corresponding one frame of channel fusion image, and for n frames of channel images of each non-target channel (if there is a non-target channel), the step S124 in fig. 2 may be performed to fuse into a corresponding one frame of channel fusion image. Obviously, steps S121 to S124 all belong to the substeps of step S120.

The steps S121 to S123 are the main fusion scheme provided in the present application, and can significantly improve the effect of image fusion (mainly referred to as HDR effect), and some implementation manners of these steps can also significantly improve the efficiency of image fusion. Therefore, as an optional strategy, in order to improve the image fusion effect as much as possible, all channels included in the image to be fused may be used as target channels, or, if efficiency and other factors are considered, only one or a few channels of the image to be fused that most significantly affect the fusion result may be used as target channels (for example, Y channels in YUV images), and the rest channels may be used as non-target channels (for example, U, V channels in YUV images).

The steps in fig. 2 are explained in detail below, and will be primarily described in conjunction with the flow shown in fig. 3.

Step S121: and extracting the basic characteristic diagram of the n frames of channel images by using a main network in the pre-trained neural network model, and splitting the basic characteristic diagram into n frames of sub-characteristic diagrams.

Step S122: and calculating n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model.

Steps S121 and S122 are explained in conjunction. The neural network model used in the step is trained in advance (before step S120 is executed), and the training process is described later, if there are a plurality of target channels, one neural network model may be trained for each target channel, or in order to save computational resources, a plurality of target channels may share one neural network model.

The neural network model is not limited to which neural network is specifically adopted, and may be, for example, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), an Artificial Neural Network (ANN), or a combination of these neural networks. The neural network model at least comprises two parts, namely a main network and n branch networks, wherein the tail part of the main network is connected with the head part of each branch network, and the main network and the branch networks are also neural networks respectively and have trainable parameters.

The method comprises the steps of utilizing a main network to extract features of n frames of channel images of a target channel to obtain a corresponding basic feature map, then splitting the basic feature map into n frames of sub-feature maps, and enabling each frame of sub-feature map to serve as the input of a branch network.

Referring to fig. 3, the uppermost white rectangle represents n frames of channel images of the target channel (n is 3 in fig. 3), denoted as x _ H, which may be input as a whole when input to the backbone network, and has a shape of n × 1 × W × H, where n is the number of frames, 1 is the number of channels included in each frame, and W and H are the width and height of the channel images, respectively. The rectangular parallelepiped below the multi-attention module (described later) represents the calculated base feature map in the shape of n × c × w × h, where n is the number of frames, c is the number of channels contained per frame (c > 1), and w and h are the width and height of the base feature map, respectively. The n cuboids with the shape of 1 × c × w × h below the splitting module represent n frames of sub-feature maps, and it can be seen that the splitting module splits the frame dimension of the basic feature map, each frame of sub-feature map can be considered to correspond to one frame of channel image, but the "correspondence" here should not be understood that each frame of sub-feature map is calculated only according to one frame of channel image.

With each branch network, a corresponding frame of fusion mask can be calculated based on the input frame of sub-feature map, and the calculated fusion mask has n frames for n branch networks in total. As described above, the sub-feature map corresponds to the channel image, so that each frame of the fusion mask calculated from the sub-feature map also corresponds to one frame of the channel image.

The fusion mask can be regarded as a kind of weight image, that is, each pixel value in the fusion mask represents the weight used by the corresponding frame of channel image when fusion is performed, and the resolution of the fusion mask is the same as that of the channel image. How to perform the fusion of the channel images using the fusion mask is explained again in step S123.

Optionally, to facilitate using the fusion mask, the weights in the fusion mask may also be normalized, that is, after normalization, one pixel position x in the fusion mask is selected, n weights of n frames of the fusion mask at x are all 1, and a value of each weight is located in an interval [0,1 ]. When step S123 is described later, for simplicity, the weights in the fusion mask are not considered to be normalized.

Referring to fig. 3, n gray rectangles below a filtering/upsampling module (described later) represent n frames of fusion masks, denoted as m _ H, where the shape of each frame of fusion mask is 1 × 1 × W × H, i.e., the shape of one frame of fusion mask is identical to the shape of one frame of channel image.

It should be noted that although the step S122 is to calculate the fusion mask by using the branch networks, it is not limited that the branch networks output the fusion mask, and each branch network may directly output the corresponding fusion mask, or may further perform an operation on the result output by the branch networks to obtain the corresponding fusion mask. For example, in fig. 3, the n smaller gray rectangles below the multisensory module (described later) are the branch network outputs, while the n frame fusion mask is located below the filtering/upsampling module, which is not already a branch network category.

Step S123: and performing weighted fusion on the n frames of channel images by using the n frames of fusion masks to obtain a channel fusion image.

First, taking the linear fusion manner as an example, the linear fusion can be expressed by the following formula:

wherein, Y^*Representing a channel fusion image, Y_iRepresenting the ith frame channel image, M_iIndicates frame i fusion mask (with Y)_iCorresponding fusion mask), the symbol "", indicates that the pixel values of the corresponding locations are multiplied, as described above, the fusion mask and the channel image have the same resolution, so both can perform an [ < u > operation. It can be seen that the linear fusion is actually a process of taking n frames of fusion masks as weight images and performing weighted summation on the n frames of channel images to obtain channel fusion images, and if the weights in the fusion masks are normalized, it can be ensured that pixels in the channel fusion images and pixels in the channel images haveThe same value ranges, for example, are all [0,255 ]]。

Of course, linear fusion is not the only weighted fusion method, and for example, pyramid fusion can also be used, and the steps thereof are briefly described as follows:

step A: and constructing a corresponding Laplacian pyramid according to each frame of channel image to obtain n Laplacian pyramids.

The number of layers of the gaussian pyramid and the laplacian pyramid is the same, and the resolution of the pyramid image in the same layer in the two pyramids is also the same, and regarding the structure mode of the gaussian pyramid and the laplacian pyramid, reference may be made to the prior art, which is not explained in detail here, and in addition, the sequence of constructing the gaussian pyramid and constructing the laplacian pyramid is not limited, or may also be performed simultaneously.

Follow the above notation if M_iRepresenting the ith frame fusion mask, the n Gaussian pyramids can be represented as

Wherein

Representing the ith Gaussian pyramid (based on M)_iBuild) layer j pyramid images. If Y is_iRepresenting the ith frame channel image, the n laplacian pyramids may be represented as

Wherein

Represents the ith Laplacian pyramid (based on Y)_iBuild) layer j pyramid images. I and j above should take all values in their respective value ranges, i is taken all integers in 1 to n, and j is taken all integers in 1 to L on the assumption that the gaussian pyramid and the laplacian pyramid are both L layers (L is an integer greater than 1).

And B: and performing weighted fusion on the n Laplacian pyramids by using the n Gaussian pyramids to obtain the fused Laplacian pyramid.

And weighting and fusing the n frames of pyramid images in the layer in the n laplacian pyramids by using the n frames of pyramid images in the layer in the n laplacian pyramids. For example, when linear fusion is employed, the fusion process can be expressed as follows:

wherein Y is^jRepresenting pyramid images at the j-th layer in the fused Laplacian pyramid, the entire fused Laplacian pyramid can be denoted as

j takes all integers from 1 through L.

And C: and performing image reconstruction by using the fused Laplacian pyramid to obtain a channel fusion image.

Based on how the image reconstruction is performed by using the laplacian pyramid, reference may be made to the prior art, which is not explained in detail here, and the result of the reconstruction is the channel fusion image to be calculated in step S123.

In contrast, pyramid fusion does not directly utilize the fusion mask to perform linear fusion on the channel image itself, but utilizes the gaussian pyramid of the fusion mask to perform layered linear fusion on the laplacian pyramid of the channel image. According to the construction process of the laplacian pyramid, in the laplacian pyramid, the pyramid image of each layer can be regarded as feature maps extracted from the channel images, and the feature maps represent image details of different frequencies in the channel images. Therefore, the weighted fusion of the pyramid images on the same layer is equivalent to the independent image fusion on the subspace corresponding to each frequency, so that the fusion difficulty is low, and the fusion effect is better than that of the fusion directly on the original channel image, thereby being beneficial to improving the quality of the channel fusion image. Of course, the algorithm complexity is higher by adopting pyramid fusion than by adopting linear fusion, so that different fusion modes can be adopted according to requirements in specific implementation.

Referring to fig. 3, fig. 3 shows a linear fusion manner, where n white rectangles are drawn in addition to n gray rectangles at m _ H, the white rectangles representing n frames of an image to be fused, "×" and "+" representing multipliers and adders for linear fusion, and the lowermost white rectangle in fig. 3 represents a channel fusion image of a target channel, which has a shape of 1 × 1 × W × H, the same as that of any one frame of the channel image.

With respect to fig. 3, there are two additional points to be explained:

first, the filtering/upsampling module, the multiplier for linear fusion and the adder are not included in the neural network model because they do not contain trainable model parameters (but may contain some hyper-parameters), but it is of course possible to treat them as part of the neural network model without substantially affecting the scheme.

Secondly, only one basic structure of the neural network model is drawn in fig. 3, but the neural network model may also include many additional structures on the basis of the basic structure, such as pooling layers, convolutional layers, etc., and these additional structures are not drawn to avoid the structure of the model appearing to be too complicated, but may be added to the model as required during implementation. Similar situations exist for fig. 4-7, and are not described one by one.

Step S124: and fusing the n frames of channel images by using other methods to obtain a channel fused image.

For the channel images belonging to the same non-target channel, other methods may be adopted for fusion, and the "other methods" in step S124 refer to methods other than steps S121 to S123, for example, any image fusion method for obtaining an HDR image is known.

Furthermore, it is also possible to perform weighted fusion of n-frame channel images belonging to the same non-target channel directly using a fusion mask calculated when the channel images of the target channel are fused. For example, the image to be fused is a YUV image, the target channel is a Y channel, and n frames of fusion masks for the Y channel are calculated by executing steps S121 to S123, so that n frames of channel images for a U channel or a V channel can be directly fused by using the n frames of fusion masks, and the fusion manner can refer to step S123, for example, linear fusion or pyramid fusion can be selected. For example, if there are two target channels, n frames of fusion masks can be calculated by using the channel images of the two target channels, that is, 2n frames of fusion masks are total, then the fusion masks corresponding to each two frames are averaged to obtain n frames of fusion masks, and then the n frames of fusion masks are used to fuse the channel images of the non-target channels.

The fusion mode simplifies the image fusion process of the non-target channel and improves the fusion efficiency. For example, the target channel may select a more important channel (such as the Y channel of the YUV image) in the image to be fused, and the non-target channel may select a relatively less important channel (such as the U, V channel of the YUV image) in the image to be fused, so that the fusion mask used by the non-target channel does not have too great an influence on the image quality of the fusion result image even if the non-target channel is not calculated according to the own channel image.

The image fusion method in fig. 1 and 2 is briefly summarized below: according to the method, when the channel images of the target channel are fused, the neural network model is used for calculating the weight (namely the fusion mask) for fusion, and the neural network model is obtained through training of training data and is not a preset rule determined according to experience, so that the calculated weight has good robustness, the quality of the obtained fusion result image (HDR image) is improved when the exposure degree span of the image to be fused is large, and the phenomena of grey picture, lack of details and the like are avoided.

Here, if the exposure degree is expressed by the EV value, the large exposure degree span can be understood as: after the n frames of images to be fused are arranged according to the ascending or descending order of the EV values, the difference distance of the EV values of two adjacent frames of images to be fused is larger, or the difference distance of the EV values of the first frame and the last frame of images to be fused is larger. It should be understood that the image fusion method in the embodiment of the present application has an improved effect on the quality of the fused image even if the exposure degree span of the image to be fused is not large.

In addition, the neural network model used in the method is divided into a main network and branch networks, the main network is shared by n frames of channel images, so that inter-frame information communication of the images to be fused is facilitated, and the precision of the fusion mask is improved, and each branch network is used for calculating the fusion mask corresponding to one frame of channel image, so that parameter optimization is facilitated for the frame of channel image, and the precision of the fusion mask can be improved. In short, the higher the calculated fusion mask accuracy, the better the image fusion effect is.

Next, based on the above embodiment, the downsampling and upsampling mechanisms in the image fusion method are described:

in one implementation, a downsampling module is further included in the backbone network, and the downsampling module may be disposed at a start position of the backbone network and used for downsampling the n frames of channel images, as shown in fig. 3. The down-sampling will result in a low resolution channel image, and it is noted that "low resolution" herein is in a relative sense rather than an absolute sense, meaning that the resolution is lower than the original resolution of the channel image, and the term "low resolution" is to be understood hereinafter. With continued reference to fig. 3, the white rectangle below the downsampling module represents the low resolution channel image, denoted x _ l, whose shape is n × 1 × W × H, and W < W and H < H, i.e. whose resolution is lower than the original channel image.

The subsequent calculation of the neural network model is performed based on the low-resolution channel images, and the outputs of the n branch networks (i.e., the outputs of the neural network model) are also n frames of low-resolution fusion masks, which correspond to the n frames of channel images one-to-one, and the pixel values therein also represent weights for fusion of the channel images, but have a resolution lower than the fusion mask mentioned in step S122. Referring to fig. 3, the shape of the base feature map is n × c × w × h, the shape of the sub-feature map is 1 × c × w × h, and the shape of the output m _ l of the neural network model is n × 1 × w × h (in the figure, n masks of 1 × 1 × w × h are located below the multi-field module), whose resolution is lower than the original resolution of the channel image without exception.

Since the n frames of low-resolution fusion masks are not convenient for weighted fusion, it is finally necessary to perform upsampling on the n frames of low-resolution fusion masks to obtain n frames of fusion masks (i.e., the fusion masks mentioned in step S122) having the same resolution as the channel images, and then perform the subsequent fusion step S123. In fig. 3, the upsampling operation is performed by a filtering/upsampling module (regarding the role of "filtering", explained later), and it is easily seen that after upsampling, the shape of the mask becomes 1 × 1 × W × H, i.e. the original resolution has been restored. When the up-sampling is carried out, the n frames of low-resolution fusion mask can be taken as a whole to be input into a filtering/up-sampling module, and the module can complete the up-sampling of the n frames of images in batches.

Further, the inventors have found that the upsampling operation may cause a reduction in mask quality to some extent, and therefore, in an alternative, when upsampling the n frames of low-resolution fusion mask, the upsampling operation may also perform edge-preserving smoothing filtering on the n frames of low-resolution fusion mask, so as to improve the quality of the obtained n frames of fusion mask. Therefore, in fig. 3, an up-sampling module with a filtering function is adopted instead of a simple up-sampling module (of course, a simple up-sampling module is also feasible).

Edge-preserving smoothing filtering is a class of image filtering algorithms, such as bilateral filtering and guided filtering, which can preserve edge information in an image while smoothing image content. The filtering may be performed before upsampling, and for example, in fig. 3, m _ l may be first subjected to pilot filtering (pilot map selection x _ l), and then the filtering result is upsampled to obtain m _ h.

Alternatively, Filtering and Upsampling may be performed simultaneously (herein, "simultaneously" should be understood that the processes of Upsampling and Filtering are mixed together, the boundary between the two processes is not clear, and should not be understood as "parallel"), for example, the Filtering/Upsampling module in fig. 3 may select a Guided combined Guided Upsampling (GFU) module, the GFU module includes 3 inputs, x _ l, m _ l, and x _ h, respectively, and outputs m _ h, two low-resolution mapping matrices a and B are calculated inside the GFU module according to x _ l and m _ l, and then the two low-resolution mapping matrices a and B are upsampled to obtain two original-resolution mapping matrices a and B, and finally x _ h is used as a guide map and is mapped to m _ h by using the formula m _ h ═ x _ h + B. The guide map in the GFU module can greatly reduce the edge information loss caused by upsampling because the higher resolution x _ h is adopted instead of x _ l.

It should be understood that the down-sampling mechanism and the up-sampling mechanism in the image fusion process are used together, and if the down-sampling is not performed, the up-sampling is naturally not necessary. For example, in some implementations different from that shown in fig. 3, the down-sampling module is not included in the main network, and it is not necessary to provide the filtering/up-sampling module after the branch network, and the branch network can directly output the n-frame fusion mask with the same resolution as the channel image.

In the following, on the basis of the above embodiment, the attention mechanism in the image fusion method is described:

in one implementation, the neural network model used in the image fusion method has a mechanism of attention. When a person observes an image, different attention degrees are shown to different parts in the image, namely, important parts have higher attention degree and unimportant parts have lower attention degree. Thus, for a specific image processing task (e.g., classification, segmentation, fusion, recognition), features of different parts in an image differ in importance, and the so-called attention mechanism is to express the difference in importance of features of different parts in an image by some technical means. For example, features in different parts of an image are given different weights, a feature given a greater weight indicating a higher importance, and a feature given a lesser weight indicating a lower importance.

In the neural network model, the attention mechanism can be realized by adopting a structure of adding a multiplier to two branches, wherein one branch of the two branches is used for inputting the feature map to the multiplier, the other branch is used for calculating the weight of the feature of different parts in the feature map and inputting the weights to the multiplier, and the multiplier is used for realizing the weighting of the feature. How different parts of the feature map are divided will create different types of attention mechanisms.

In the solution of the present application, the feature map has 4 dimensions, and taking the basic feature map in fig. 3 as an example, the shape is n × c × w × h, where n represents the frame dimension, c represents the channel dimension, w represents the abscissa dimension, and h represents the ordinate dimension. W and h can also be considered as one dimension, i.e. the spatial dimension, when the feature map has 3 dimensions in total. Thus, the following three attention mechanisms are possible in the solution of the present application:

channel attention, i.e., the attention mechanism of channel dimensions, refers to giving corresponding weights to data in different channels of the feature map;

frame attention, namely an attention mechanism of frame dimensions, means that corresponding weights are given to data in different frames of a feature map;

spatial attention, i.e., the attention mechanism of the spatial dimension, refers to giving corresponding weights to data at different spatial positions of the feature map.

In the neural network model, the attention mechanism may be introduced into both the trunk network and the branch network, or both the trunk network and the branch network. The main network may have at least one of a channel attention mechanism, a frame attention mechanism and a space attention mechanism, and the branch network may have at least one of a channel attention mechanism and a space attention mechanism. Referring to fig. 3, since the input sub-feature map for each branch network has only one frame, and it makes sense to weight data in different frames according to the definition of frame attention, there is no need to consider frame attention for the branch network.

By adding different attention mechanisms in the neural network, the accuracy of the calculated fusion mask is improved, but it should be understood that the image fusion method provided by the application can be realized even if a structure related to the attention mechanism is not added in the neural network model. Specific examples will be given later to illustrate how the attention mechanism can be introduced in the backbone network and/or the branch network.

In one implementation, the backbone network includes a multi-attention module including a channel attention unit and a frame attention unit connected in sequence, as shown in fig. 4. The location of the multi-attention module in the backbone network is not limited, for example in fig. 3, the location of the multi-attention module is after the down-sampling module.

The input feature map of the channel attention unit is an input feature map of the multi-attention module (the "feature map" herein should be understood in a broad sense, for example, if the multi-attention module is disposed at the initial position of the backbone network, the image to be fused is also regarded as the feature map), and the channel attention unit is configured to calculate a weight of the input feature map of the multi-attention module in a channel dimension, and multiply the calculated weight with data in different channels of the input feature map of the multi-attention module to obtain an output feature map of the channel attention unit. Comparing the definition of the channel attention, the channel attention unit has the channel attention.

Referring to fig. 4, the input feature map of the channel attention unit has a shape of n × c × w × h (convolution layers are disposed between the down-sampling module and the multi-attention unit, and the number of channels x _ l is increased from 1 to c, but not shown in fig. 3), one branch of the channel attention unit directly inputs the feature map to the multiplier, and the other branch calculates a weight having a shape of n × c × 1 × 1 based on the feature map, that is, each channel of the feature map calculates a corresponding weight, and the multiplier is responsible for multiplying the data in each channel of the feature map by the weight corresponding to the channel, so that the data in the more important channel is given a larger weight, and the data in the less important channel is given a smaller weight.

Further, the branch of the channel attention unit for calculating the weight may adopt, but is not limited to, the structure in fig. 5, where the structure includes a global pooling layer, a first full-connection layer, a ReLU function, a second full-connection layer, and a Sigmoid function, which are connected in sequence, an input of the global pooling layer is an input of the channel attention unit, and an output of the Sigmoid function is a weight to be calculated by the branch. The right side of fig. 5 shows the shape of the feature map output by each layer/function, wherein the first fully connected layer can change the number of channels of its input feature map from c to c/r, the second fully connected layer can change the number of channels of its input feature map from c/r to c, and r is a preset hyper-parameter.

The input feature map of the frame attention unit is an output feature map of the channel attention unit, the frame attention unit is used for transposing channel dimensions and frame dimensions in the input feature map of the frame attention unit, calculating weights of the transposed feature map in the channel dimensions, multiplying the calculated weights with data in different channels of the transposed feature map to obtain a weighted feature map, and transposing the channel dimensions and the frame dimensions in the weighted feature map again to obtain an output feature map of the frame attention unit, namely the output feature map of the multi-attention module.

Although the frame attention unit weights the data in different channels, the weighting operation is performed after the data is transposed, the original channel dimension in the feature map is changed into the frame dimension and the original frame dimension is changed into the channel dimension through the transposing operation, that is, the weighting operation is performed on the data in different channels of the feature map after the transposing, which is to say, the weighting operation is performed on the data in different frames of the feature map before the transposing, so that the definition of the frame attention is known, and the frame attention unit has the frame attention.

Referring to fig. 4, the input feature map of the frame attention unit has a shape of n × c × w × h, and after being transposed, the shape of the input feature map is changed to c × n × w × h, one branch of the frame attention unit directly inputs the feature map to the multiplier, the other branch calculates a weight having a shape of c × n × 1 × 1 based on the feature map, that is, a corresponding weight is calculated for each channel (the number of channels at this time is n) of the feature map, the multiplier is responsible for multiplying data in each channel of the feature map (actually, data in the original frame) by the weight corresponding to the channel, so that data in a more important channel is given a larger weight, data in a relatively unimportant channel is given a smaller weight, and finally dimensions in the feature map are restored to the original order by being transposed again.

It is readily seen that by transposing, the frame attention unit can multiplex the structure of the channel attention unit, thereby simplifying the network design. The branch of the frame attention unit for calculating the weight may be, but is not limited to, adopt the structure in fig. 5, and will not be repeated.

The multi-attention module in the implementation mode comprises channel attention and frame attention, wherein the channel attention strengthens differences of features in different channels, the frame attention strengthens differences of features in different frames, and the multi-attention module is arranged in a backbone network shared by n frames of images to be fused, so that the multi-attention module is beneficial to strengthening information exchange and sharing among channels and frames, strengthening inter-frame relation and further improving the accuracy of the calculated fusion mask.

It should be understood that in other implementations, even if channel attention and frame attention are only introduced in the backbone network, the channel attention unit and the frame attention unit do not have to be connected together, and may be provided separately at different locations of the backbone network, for example.

In one implementation, each branch network includes a multi-receptive field module, as shown in fig. 3. The position of the multi-field module in the branch network is not limited, for example, in fig. 3, the position of the multi-field module is at the starting position of each branch network.

The multi-receptive-field module comprises m characteristic extraction branches and a characteristic fusion unit, wherein m is an integer larger than 1, each characteristic extraction branch is used for carrying out characteristic extraction on an input characteristic diagram of the multi-receptive-field module to obtain an output characteristic diagram of the characteristic extraction branch, and the characteristic fusion unit is used for fusing the output characteristic diagrams of the m characteristic extraction branches to obtain the output characteristic diagram of the multi-receptive-field module.

Fig. 6 shows a structure of a multi-sense-field module, in fig. 6, an input of the multi-sense-field module is a feature map with a shape of 1 × c × w × h, the module includes 5 feature extraction branches (m ═ 5) and a feature fusion unit, the feature fusion unit includes a splicing structure and an adder, the splicing structure is used to splice output feature maps of the upper 4 feature extraction branches into a whole, the adder is used to add the feature map generated by splicing and an output feature map of the lowest feature extraction branch, and the output feature map is the aforementioned low-resolution fusion mask with a shape of 1 × 1 × w × h.

It should be understood that some structures are omitted from the feature fusion unit in fig. 6 and not shown: for example, the shape of the output feature map of the upper 4 feature extraction branches may be 1 × c × w × h, the shape after splicing is 1 × 4c × w × h (assuming splicing by channels), and before inputting to the adder, the convolution layer may be reduced to 1 × c × w × h (in order to keep the same shape as the other input shape of the adder); for example, the original shape of the output feature map of the adder is 1 × c × w × h, and the convolution layer may be provided to reduce the dimension to 1 × 1 × w × h and then output.

The receptive fields corresponding to the m characteristic extraction branches in the multi-receptive field module are different. The "receptive field" is understood to mean an area of the original image corresponding to one pixel in the feature map, and the larger the receptive field, the better the feature in the feature map describes the larger object in the original image, and the smaller the receptive field, the smaller the feature in the feature map describes only the smaller object in the original image. In the present disclosure, the "original image" may be understood as a channel image or an image to be fused, and the receptive field corresponding to the feature extraction branch may be understood as an intermediate feature map of the feature extraction branch (e.g., a feature map output by the hole convolution layer in fig. 6) or a receptive field corresponding to the output feature map.

The multi-receptive-field module comprises a plurality of characteristic extraction branches corresponding to different receptive fields, so that the characteristics extracted by the multi-receptive-field module can comprehensively describe different objects in the channel image on each scale, thereby being beneficial to improving the consistency of spatial information, preventing the phenomenon of fusion layering in the channel fusion image and further improving the quality of the fusion result image. The inventors have considered that the fusion layering phenomenon is caused by the fact that extracted features cannot effectively describe objects in the channel images, for example, different local regions of the same object are considered to be different objects.

How each feature extraction branch realizes different receptive fields is specifically described below:

mode 1: using void convolution layers of different void fractions

If m is larger than 2, the hole convolution layers are arranged in m-1 of the m feature extraction branches, the hole rates of the hole convolution layers in the m-1 feature extraction branches are different, and the remaining one feature extraction branch adopts a direct connection branch (directly transmits the input feature map of the multi-receptive field module backwards). In particular, if m is 2, only one feature extraction branch is provided with a hole convolution layer according to the above rule, and the hole rate thereof may be different from 1.

For example, in fig. 6, the upper 4 feature extraction branches all include a hole convolution layer, and the hole rates are 2, 4, 8, and 16 (increasing by powers of 2), so that the 4 feature extraction branches have successively increasing receptive fields (which are all larger than the receptive field of the input feature map of the multi-receptive field module), and the remaining one branch is a direct connection branch, whose receptive field is equal to the receptive field of the input feature map of the multi-receptive field module, thereby forming 5 feature extraction branches with different receptive fields. And the remaining branch is provided with a direct connection branch, so that a residual error structure can be formed in the multiple receptive field module, and the calculation precision of the fusion mask can be improved.

Alternatively, the above straight-connected branch may be replaced by a branch including only a normal convolution layer (which may be considered as a special hole convolution layer with a hole rate of 1), and optionally, the size of the convolution kernel of the normal convolution may be consistent with the hole convolution in each of the other feature extraction branches.

As another alternative, the hole convolution layers are arranged in the m feature extraction branches, and the hole rates of the hole convolution layers in the m feature extraction branches are made to be different from each other, so that multiple receptive fields can be realized.

Furthermore, an attention mechanism can be introduced into the multi-sensing-field module to improve the calculation accuracy of the fusion mask. Taking the case where the m-1 feature extraction branches are provided with a hole convolution layer and the remaining one feature extraction branch is a direct connection branch as an example, assuming that the attention mechanism introduced is spatial attention, since the direct connection branch does not include any network structure, the spatial attention may be added to the remaining m-1 feature extraction branches. Specifically, after the calculation of the hole convolution is performed by each feature extraction branch, the weight of the output feature map of the hole convolution layer itself in the spatial dimension may be further calculated, and the calculated weight may be multiplied by data at different spatial positions of the output feature map of the hole convolution layer itself to obtain the output feature map of the hole convolution layer itself.

Referring to fig. 6, taking the uppermost feature extraction branch as an example, the input feature pattern shape is 1 xcxw × h, assuming that the shape of the hole convolution is not changed, the output feature pattern shape of the hole convolution layer is still 1 xcxw × h, the remaining part of the feature extraction branch is further divided into two sub-branches, one of which directly inputs the feature pattern to the multiplier, and the other of which calculates the weight with the shape of 1 × 1 × w × h based on the feature pattern, that is, a corresponding weight is calculated for each spatial position (i.e., pixel position) of the feature pattern, and the multiplier is responsible for multiplying the data at the spatial position of the feature pattern by the weight corresponding to the position, so that the data at the more important spatial position is given a larger weight, and the data at the relatively unimportant spatial position is given a smaller weight.

The sub-branch of the feature extraction branch for calculating the weight may adopt, but is not limited to, the structure in fig. 7, which includes two pooling layers (parallel connection), a splicing structure, a convolution layer, and a Sigmoid function connected in sequence, where the input of the two pooling layers is the input of the sub-branch, and the output of the Sigmoid function is the weight to be calculated by the sub-branch. Wherein the pooling operations of the two pooling layers are different, the left pooling layer is for maximum pooling of channel dimensions (data in each channel is taken as a maximum), and the right pooling layer is for average pooling of channel dimensions (data in each channel is taken as an average). The shape of the profile output for each layer/structure/function is shown on the right side of fig. 7 (the shape of the output profile for the left pooling layer is on the left).

It should be understood that if the remaining one feature extraction branch is not a direct branch, but only a branch containing a normal convolutional layer, spatial attention may also be added thereto; if the hole convolution layer is provided in all of the m feature extraction branches, spatial attention may be added to all of the feature extraction branches. It should be further noted that the attention mechanism is not required, and the multi-sense module can still perform feature extraction normally after removing the attention mechanism (e.g., removing the sub-branches and multipliers for calculating weights in fig. 6).

Mode 2: using common convolutional layers of different convolutional kernel sizes

If m is larger than 2, setting a common convolution layer in m-1 of the m feature extraction branches, wherein the convolution kernel sizes of the common convolution layers in the m-1 feature extraction branches are different, and the remaining one feature extraction branch adopts a direct connection branch. In particular, if m is 2, only one feature extraction branch is provided with a normal convolution layer according to the above rule, and the convolution kernel size thereof can be freely set (larger than 1 × 1).

For example, there are 4 feature extraction branches, one of which is a direct connection branch, and the remaining 3 branches may adopt convolution kernels of 3 × 3, 5 × 5, and 7 × 7, respectively.

As another alternative, common convolution layers are arranged in the m feature extraction branches, and the sizes of convolution kernels of the common convolution layers in the m feature extraction branches are made to be different from each other, so that multiple receptive fields can be realized.

It should be understood that the attention mechanism may also be introduced into the feature extraction branch of the mode 2, and the specific implementation may refer to the mode 1, which is not repeated.

Note that mode 1 and mode 2 can also be combined, since both the void rate and convolution kernel size have an effect on the size of the receptive field. Moreover, each feature extraction branch may have a different receptive field by other factors, for example, a convolutional layer with a different step size, a pooling layer with different pooling parameters, and the like.

In the above explanation, both the mode 1 and the mode 2 are explained based on the structure that the multi-receptive-field module includes a plurality of feature extraction branches connected in parallel, however, in some other implementations, these feature extraction branches may be changed into a serial relationship, and the effect of multi-receptive field may also be achieved. At this time, the multi-sensing field-receiving module includes m feature extraction units connected in series (i.e., connected in series), where m is an integer greater than 1, a first feature extraction unit of the m feature extraction units is configured to perform feature extraction on an input feature map of the multi-sensing field-receiving module to obtain an output feature map of the multi-sensing field-receiving module, each subsequent feature extraction unit is configured to perform feature extraction on an output feature map of a previous feature extraction unit to obtain an output feature map of the multi-sensing field-receiving module, an output feature map of a last feature extraction unit is an output feature map of the multi-sensing field-receiving module, and the corresponding field of reception of each feature extraction unit is different.

It is obvious from the functional description of the feature extraction units that each feature extraction unit is roughly equivalent to one of the feature extraction branches in the foregoing, and the structure of the feature extraction branch (including the fact that attention can also be added thereto) can be directly adopted when the feature extraction unit is implemented, and will not be repeated, and certainly, there is no corresponding feature extraction unit for the directly connected branch. In addition, since the feature extraction units are connected in series, the feature fusion unit is not needed to be arranged in the multi-sense-field module.

The multi-sensing-field module with the series structure inside is also favorable for improving the consistency of spatial information, preventing the fusion layering phenomenon from occurring in the channel fusion image, and further improving the quality of the fusion result image.

Finally, on the basis of the above embodiment, how the neural network model in the image fusion method is trained is briefly described. The model can adopt unsupervised training and also can adopt supervised training.

First, taking supervised training as an example, n training images with different exposure degrees (for simplicity, a single-channel image is not taken as an example) can be used as a training sample, after the training sample is input into the neural network model, the fusion result image can be calculated by using the n fusion masks calculated by the model, note that, since the training image is assumed to be a single-channel image, the fusion result image can be directly obtained after fusion, and in addition, the fusion masks and the fusion result image are in the training stage and are not confused with the image in the step S120. Then, the fusion result image and the standard result image corresponding to the training sample can be used for calculating the prediction loss, and the form of the loss function is not limited, wherein the standard result image has a good HDR effect and can be determined in advance by combining other image fusion algorithms manually. Finally, parameters in the neural network model may be updated based on the calculated predicted losses, for example, using a back propagation algorithm.

For the unsupervised training, the updating of the model parameters can be realized by calculating blind evaluation indexes such as MEF-SSIM and the like, and the detailed description is not provided.

Fig. 8 shows functional components included in the image fusion apparatus 200 according to the embodiment of the present application. Referring to fig. 8, the image fusion apparatus 200 includes:

the image acquisition component 210 is used for acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel;

the image fusion component 220 is configured to fuse n channel images belonging to the same channel in the n to-be-fused images to obtain at least one channel fusion image, and determine a fusion result image according to the at least one channel fusion image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps: extracting a basic feature map of the n frames of channel images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps; calculating n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; each branch network is used for calculating a frame of fusion mask corresponding to a frame of channel image according to a frame of sub-feature image, and each frame of fusion mask comprises a weight for fusing a corresponding frame of channel image; and performing weighted fusion on the n frames of channel images by using the n frames of fusion masks to obtain the channel fusion images.

In one implementation of the image fusion apparatus 200, the backbone network is a network with attention mechanism including at least one of channel attention, frame attention, and spatial attention; and/or the branch network is a network with an attention mechanism, and the attention mechanism of the branch network comprises at least one of channel attention and space attention; the channel attention means that corresponding weights are given to data in different channels of the feature map, the frame attention means that corresponding weights are given to data in different frames of the feature map, and the spatial attention means that corresponding weights are given to data at different spatial positions of the feature map.

In one implementation of the image fusion apparatus 200, the backbone network includes a multi-attention module, and the multi-attention module includes a channel attention unit and a frame attention unit connected in sequence; the channel attention unit is used for calculating the weight of the input feature map of the multi-attention module in the channel dimension, and multiplying the calculated weight by data in different channels of the input feature map of the multi-attention module to obtain an output feature map of the channel attention unit; the frame attention unit is used for transposing the channel dimension and the frame dimension in the output feature map of the channel attention unit, calculating the weight of the transposed feature map in the channel dimension, multiplying the calculated weight with data in different channels of the transposed feature map to obtain a weighted feature map, and transposing the channel dimension and the frame dimension in the weighted feature map to obtain the output feature map of the multi-attention module.

In one implementation of the image fusion apparatus 200, the branch network includes a multi-receptive-field module, and the multi-receptive-field module includes m feature extraction branches and a feature fusion unit; the system comprises a multi-receptive-field module, a characteristic extraction unit, a characteristic fusion unit and a multi-receptive-field module, wherein m is an integer larger than 1, each characteristic extraction branch is used for carrying out characteristic extraction on an input characteristic diagram of the multi-receptive-field module to obtain an output characteristic diagram of the multi-receptive-field module, receptive fields corresponding to the characteristic extraction branches are different, and the characteristic fusion unit is used for fusing the output characteristic diagrams of the m characteristic extraction branches to obtain the output characteristic diagram of the multi-receptive-field module.

In one implementation of the image fusion apparatus 200, at least m-1 of the m feature extraction branches comprise a hole convolution layer; wherein, when m is larger than 2, the void rates of the void convolution layers in the at least m-1 feature extraction branches are different; when the number of the feature extraction branches including the void convolutional layer is m-1, the remaining one of the m feature extraction branches only includes a common convolutional layer or is a direct connection branch.

In one implementation of the image fusion apparatus 200, each of the at least m-1 feature extraction branches is configured to calculate a weight of an output feature map of its own hole convolution layer in a spatial dimension, and multiply the calculated weight with data at different spatial positions of the output feature map of its own hole convolution layer to obtain its own output feature map.

In one implementation of the image fusion apparatus 200, the branch network includes a multi-field module, and the multi-field module includes m feature extraction units connected in sequence; wherein m is an integer greater than 1, a first feature extraction unit of the m feature extraction units is used for performing feature extraction on the input feature map of the multi-receptive-field module to obtain an output feature map of the first feature extraction unit, each subsequent feature extraction unit is used for performing feature extraction on the output feature map of the last feature extraction unit to obtain an output feature map of the last feature extraction unit, the output feature map of the last feature extraction unit is the output feature map of the multi-receptive-field module, and the corresponding receptive fields of the feature extraction units are different.

In one implementation of the image fusion apparatus 200, the backbone network is further configured to down-sample the n-frame channel images; the image fusion component 220 calculates n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model, including: calculating n frames of low-resolution fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; and performing up-sampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as the n frames of channel images.

In an implementation manner of the image fusion apparatus 200, the image fusion component 220 performs upsampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as that of the n frames of channel images, including: and performing edge-preserving smooth filtering and up-sampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as the n frames of channel images.

In one implementation of the image fusion apparatus 200, if each frame of the image to be fused includes a plurality of channels, and the plurality of channels are not all target channels, the image fusion component 220 performs weighted fusion on n frames of channel images belonging to the same non-target channel in the n frames of images to be fused by using a fusion mask calculated when the channel images of the target channels are fused.

The image fusion apparatus 200 according to the embodiment of the present application, the implementation principle and the resulting technical effects of which have been described in the foregoing method embodiments, and for brief description, the corresponding contents in the method embodiments may be referred to where the apparatus embodiments are not mentioned in part.

Fig. 9 shows a structure of an electronic device 300 provided in an embodiment of the present application. Referring to fig. 9, the electronic device 300 includes: a processor 310, a memory 320, and a communication interface 330, which are interconnected and in communication with each other via a communication bus 340 and/or other form of connection mechanism (not shown).

The processor 310 includes one or more (only one is shown), which may be an integrated circuit chip having signal processing capability. The Processor 310 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Network Processor (NP), or other conventional processors; the Processor may also be a dedicated Processor, including a Graphics Processing Unit (GPU), a Neural-Network Processing Unit (NPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, and a discrete hardware component. Also, when there are a plurality of processors 310, some of them may be general-purpose processors, and the other may be special-purpose processors.

The Memory 320 includes one or more (Only one is shown in the figure), which may be, but not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like.

The processor 310, as well as possibly other components, may access, read, and/or write data to the memory 320. In particular, one or more computer program instructions may be stored in the memory 320, and may be read and executed by the processor 310 to implement the image fusion method provided by the embodiment of the present application.

Communication interface 330 includes one or more (only one shown) that may be used to communicate directly or indirectly with other devices for the purpose of data interaction. Communication interface 330 may include an interface to communicate wired and/or wireless.

It will be appreciated that the configuration shown in fig. 9 is merely illustrative and that electronic device 300 may include more or fewer components than shown in fig. 9 or have a different configuration than shown in fig. 9. For example, the electronic device 300 may further include a camera for capturing an image or a video, and both the captured image or a frame in the video may be used as the image to be fused in step S110; for another example, the electronic device 300 may not be provided with the communication interface 330 if it is not necessary to communicate with another device.

The components shown in fig. 9 may be implemented in hardware, software, or a combination thereof. The electronic device 300 may be a physical device such as a cell phone, a video camera, a PC, a laptop, a tablet, a server, a robot, etc., or may be a virtual device such as a virtual machine, a container, etc. The electronic device 300 is not limited to a single device, and may be a combination of a plurality of devices or a cluster including a large number of devices.

The embodiment of the present application further provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor, the image fusion method provided by the embodiment of the present application is executed. The computer readable storage medium may be embodied as the memory 320 in the electronic device 300 in fig. 9, for example.

The embodiment of the present application further provides a computer program product, where the computer program product includes computer program instructions, and when the computer program instructions are read and executed by a processor, the image fusion method provided by the embodiment of the present application is executed.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An image fusion method, comprising:

acquiring n frames of images to be fused with incompletely same exposure degrees; wherein n is an integer greater than 1, each frame of image to be fused comprises at least one channel, and the at least one channel comprises a target channel;

fusing n frames of channel images belonging to the same channel in the n frames of images to be fused to obtain at least one frame of channel fused image, and determining a fused result image according to the at least one frame of channel fused image; for n frames of channel images belonging to the same target channel in the n frames of images to be fused, calculating a channel fusion image corresponding to the n frames of channel images by the following steps:

extracting a basic feature map of the n frames of channel images by using a backbone network in a pre-trained neural network model, and splitting the basic feature map into n frames of sub-feature maps;

calculating n frames of fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model; each branch network is used for calculating a frame of fusion mask corresponding to a frame of channel image according to a frame of sub-feature map, and each frame of fusion mask comprises a weight for fusing the corresponding frame of channel image;

and performing weighted fusion on the n frames of channel images by using the n frames of fusion masks to obtain the channel fusion images.

2. The image fusion method of claim 1, wherein the backbone network is a network with attention mechanism comprising at least one of channel attention, frame attention, and spatial attention; and/or the branch network is a network with an attention mechanism, and the attention mechanism of the branch network comprises at least one of channel attention and space attention;

the channel attention means that corresponding weights are given to data in different channels of the feature map, the frame attention means that corresponding weights are given to data in different frames of the feature map, and the spatial attention means that corresponding weights are given to data at different spatial positions of the feature map.

3. The image fusion method according to claim 2, wherein the backbone network comprises a multi-attention module, the multi-attention module comprising a channel attention unit and a frame attention unit connected in sequence;

the channel attention unit is used for calculating the weight of the input feature map of the multi-attention module in the channel dimension, and multiplying the calculated weight by data in different channels of the input feature map of the multi-attention module to obtain an output feature map of the channel attention unit;

the frame attention unit is used for transposing the channel dimension and the frame dimension in the output feature map of the channel attention unit, calculating the weight of the transposed feature map in the channel dimension, multiplying the calculated weight with data in different channels of the transposed feature map to obtain a weighted feature map, and transposing the channel dimension and the frame dimension in the weighted feature map to obtain the output feature map of the multi-attention module.

4. The image fusion method according to any one of claims 1-3, wherein the branch network comprises a multi-receptive-field module comprising m feature extraction branches and a feature fusion unit;

the system comprises a multi-receptive-field module, a characteristic extraction unit, a characteristic fusion unit and a multi-receptive-field module, wherein m is an integer larger than 1, each characteristic extraction branch is used for carrying out characteristic extraction on an input characteristic diagram of the multi-receptive-field module to obtain an output characteristic diagram of the multi-receptive-field module, receptive fields corresponding to the characteristic extraction branches are different, and the characteristic fusion unit is used for fusing the output characteristic diagrams of the m characteristic extraction branches to obtain the output characteristic diagram of the multi-receptive-field module.

5. The image fusion method of claim 4, wherein at least m-1 of the m feature extraction branches comprise a hole convolution layer;

wherein, when m is larger than 2, the void rates of the void convolution layers in the at least m-1 feature extraction branches are different;

when the number of the feature extraction branches including the void convolutional layer is m-1, the remaining one of the m feature extraction branches only includes a common convolutional layer or is a direct connection branch.

6. The image fusion method of claim 5, wherein each of the at least m-1 feature extraction branches is configured to calculate a weight of an output feature map of the self hole convolution layer in a spatial dimension, and multiply the calculated weight with data at different spatial positions of the output feature map of the self hole convolution layer to obtain the self output feature map.

7. The image fusion method according to any one of claims 1 to 3, wherein the branched network comprises a multi-receptive-field module comprising m feature extraction units connected in sequence;

wherein m is an integer greater than 1, a first feature extraction unit of the m feature extraction units is used for performing feature extraction on the input feature map of the multi-receptive-field module to obtain an output feature map of the first feature extraction unit, each subsequent feature extraction unit is used for performing feature extraction on the output feature map of the last feature extraction unit to obtain an output feature map of the last feature extraction unit, the output feature map of the last feature extraction unit is the output feature map of the multi-receptive-field module, and the corresponding receptive fields of the feature extraction units are different.

8. The image fusion method according to any one of claims 1-7, wherein the backbone network is further configured to down-sample the n-frame channel images;

the calculating of the n frames of fusion masks corresponding to the n frames of channel images by using the n branch networks in the neural network model includes:

calculating n frames of low-resolution fusion masks corresponding to the n frames of channel images by using n branch networks in the neural network model;

and performing up-sampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as the n frames of channel images.

9. The image fusion method of claim 8, wherein the upsampling the n-frame low-resolution fusion mask to obtain the n-frame fusion mask with the same resolution as the n-frame channel image comprises:

and performing edge-preserving smooth filtering and up-sampling on the n frames of low-resolution fusion masks to obtain the n frames of fusion masks with the same resolution as the n frames of channel images.

10. The image fusion method according to any one of claims 1 to 9, wherein if each frame of the images to be fused includes a plurality of channels, and the plurality of channels are not all target channels, n frames of channel images belonging to the same non-target channel among the n frames of images to be fused are subjected to weighted fusion using a fusion mask calculated when the channel images of the target channels are fused.

11. A computer program product comprising computer program instructions which, when read and executed by a processor, perform the method of any one of claims 1 to 10.

12. A computer-readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the method of any one of claims 1-10.

13. An electronic device, comprising: a memory having stored therein computer program instructions which, when read and executed by the processor, perform the method of any of claims 1-10.