CN115760657A

CN115760657A - Image fusion method and device, electronic equipment and computer storage medium

Info

Publication number: CN115760657A
Application number: CN202110994987.3A
Authority: CN
Inventors: 吴展豪; 谢小燕; 程宝平
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2023-03-07

Abstract

The embodiment of the application discloses an image fusion method, an image fusion device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: acquiring an initial fusion image and a mask image; carrying out image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image; carrying out image fusion processing on the first channel image and the mask image by using an image fusion model to obtain an intermediate fusion image; and carrying out channel merging processing on the intermediate fusion image and the second channel image to generate a target fusion image. Therefore, before image fusion is carried out, the second channel images related to the tone are removed, and only the first channel images are used for fusion processing, so that the problem that the tone information of the obtained target fusion image is changed due to the fact that the image fusion model learns the characteristics related to the tone can be avoided, the illumination shadow effect of the target object in the target fusion image and the illumination shadow effect of the background image tend to be consistent, and the image fusion effect is improved.

Description

Image fusion method and device, electronic equipment and computer storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image fusion method and apparatus, an electronic device, and a computer storage medium.

Background

With the gradual maturity and development of internet technology, more and more people like to use pictures to show their daily lives, and therefore, the requirements of people on the quality and the authenticity of pictures are higher and higher. Some people can use various image processing software (such as Photoshop, beautiful figure show, and the like) to finely adjust and optimize the picture, so that the aesthetic feeling and the overall harmony of the picture are improved.

For example, a party that is a sham of a friend may have their picture added to the pool of the party to compensate for the unfortunate situation; or in order to satisfy some prosperities, the personal photo is added beside some tourist photos or important characters, and the image is adjusted to be real enough to be unable to distinguish true from false at a glance, so as to achieve the effect of entertainment. In addition, there are also a number of movie videos that require such post video special effects processing. The above-mentioned cases can achieve certain effects through specialized image processing software, but require a lot of time and effort. Therefore, how to efficiently and conveniently synthesize a sufficiently real image becomes a pain point of some multimedia applications at the present stage.

Disclosure of Invention

The application provides an image fusion method, an image fusion device, electronic equipment and a computer storage medium, which can improve the image fusion effect.

The technical scheme of the application is realized as follows:

in a first aspect, an embodiment of the present application provides an image fusion method, where the method includes:

acquiring an initial fusion image and a mask image;

carrying out image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image;

carrying out image fusion processing on the first channel image and the mask image by using an image fusion model to obtain an intermediate fusion image;

and carrying out channel merging processing on the intermediate fusion image and the second channel image to generate a target fusion image.

In a second aspect, embodiments of the present application provide an image fusion apparatus including an acquisition unit, a separation unit, a fusion unit, and a merging unit, wherein,

the acquiring unit is configured to acquire an initial fusion image and a mask image;

the separation unit is configured to perform image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image;

the fusion unit is configured to perform image fusion processing on the first channel image and the mask image by using an image fusion model to obtain an intermediate fusion image;

the merging unit is configured to perform channel merging processing on the intermediate fused image and the second channel image to generate a target fused image.

In a third aspect, embodiments of the present application provide an electronic device, which includes a memory and a processor, wherein,

the memory for storing a computer program operable on the processor;

the processor is configured to execute the image fusion method according to the first aspect when the computer program is executed.

In a fourth aspect, the present application provides a computer storage medium storing a computer program, which when executed by at least one processor implements the image fusion method according to the first aspect.

According to the image fusion method, the image fusion device, the electronic equipment and the computer storage medium, the initial fusion image and the mask image are obtained; carrying out image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image; carrying out image fusion processing on the first channel image and the mask image by using an image fusion model to obtain an intermediate fusion image; and carrying out channel merging processing on the intermediate fusion image and the second channel image to generate a target fusion image. Therefore, before image fusion is carried out, the second channel images related to the tone are removed, and fusion processing is carried out only by using the first channel images, so that the problem that the tone information of the target fusion image is changed due to the fact that the image fusion model learns the tone-related characteristics can be avoided, the illumination shadow effect of the target object in the target fusion image and the illumination shadow effect of the background image tend to be consistent, the image fusion effect is improved, and the target fusion image which is real enough can be output.

Drawings

Fig. 1 is a schematic flowchart of an image fusion method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a training method for an image fusion model according to an embodiment of the present disclosure;

fig. 3 is a schematic detailed flowchart of an image fusion method according to an embodiment of the present disclosure;

fig. 4 is a schematic network structure diagram of an image fusion model according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a grid structure of an illumination model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image fusion apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under specific ordering or sequence if allowed, so that the embodiments of the present application described herein can be implemented in other orders than illustrated or described herein.

With the gradual maturity and development of internet technology, more and more people like to use pictures to show their daily lives, so people have higher and higher requirements for the quality and authenticity of pictures. Some people can use various image processing software to finely adjust and optimize the pictures, so as to improve the aesthetic feeling and the overall harmony of the pictures, such as Photoshop, beautiful figure show and the like. For example, a party that is a sham of a friend may have their picture added to the pool of the party to compensate for the unfortunate situation; or in order to satisfy some prosperities, the personal photo is added beside some tourist photos or important characters, and the image is adjusted to be real enough to be unable to distinguish true from false at a glance, so as to achieve the effect of entertainment. In addition, there are also a number of movie videos that require such post video special effects processing. The above-mentioned cases can achieve certain effects through specialized image processing software, but require a lot of time and effort. Therefore, how to efficiently and conveniently synthesize sufficiently real images becomes a pain point for some multimedia applications at the present stage.

However, in the conventional method, only the target object is usually added to the background image, and then the image fusion effect is achieved by blurring the target object and adjacent pixel points in the background image, but the method needs the target object and the background image to have sufficiently close lighting conditions to achieve the false-to-true effect, and is very obtrusive in most cases.

In the related art, there is currently provided an image synthesis method, first, a first target area image is acquired from a target area image group, and a first background image is acquired from a background image group; secondly, calculating the domain of the first target area image and the domain of the first background image through a domain judgment model; according to the domain of the first target area image and the domain of the first background image, the first target area image and the first background image are converted into the same domain through a domain conversion model to obtain at least one of a second target area image and a second background image, so that the foreground and the background are transferred into the same domain, and finally the picture is synthesized. The method adjusts and optimizes the fused image by judging whether the target image and the background image in the pre-fused image belong to the same domain through a domain judgment model, but generally, the attributes of the two domains can be enabled to be as close as possible only by adjusting the hue, the saturation and the total brightness value, but the fact that the generated images have different shadows due to different illumination sources of the two images is not considered, so that the quality of the finally generated image is influenced. In addition, the method can easily learn unnecessary tone detail information during training, such as changing the color of clothes from blue to black, and the like, but the change does not help the task of fusing images and can easily cause convergence.

Based on this, the embodiment of the present application is expected to provide an image fusion method, which can make the illumination shadows of the target object and the background image in the target fusion image tend to be consistent, so that the fusion result is more real; in addition, unnecessary tone detail information is prevented from being learned by eliminating tone channels, so that model training is easier to converge. That is to say, the method can realize the end-to-end image fusion with the variable shadow, and can output the target fusion image with sufficient reality only by inputting the target image to be fused and the background image.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In an embodiment of the present application, referring to fig. 1, a flowchart of an image fusion method provided in an embodiment of the present application is shown. As shown in fig. 1, the method may include:

s101, acquiring an initial fusion image and a mask image.

S102, carrying out image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image.

It should be noted that the image fusion method provided in the embodiment of the present application may be applied to an image fusion device or an electronic device integrated with the image fusion device. Here, the electronic device may be, for example, a computer, a smart phone, a tablet computer, a notebook computer, a palm top computer, a Personal Digital Assistant (PDA), a navigation device, a wearable device, a server, and the like, which are not particularly limited in this embodiment.

It should be noted that, for the mask image, the target object in the initial fusion image is specifically masked; so that in the subsequent image fusion process, the correlation processing is mainly performed on the target object, but the background content of the image is not changed basically. The initial fusion image specifically refers to an image in which the target object is directly added to the background image and is not subjected to fusion processing (it is understood that the non-fusion processing herein refers to processing for making the image transition natural, but includes some other processing, such as adjusting the target object to a proper size). That is, the embodiment of the present application may be directed to a usage scenario in which a target object needs to be fused into a background image.

In some embodiments, the image channels may include a Hue (H) channel, a Saturation (S) channel, and a brightness (V) channel; the first channel image represents an SV channel image composed of a saturation channel and a brightness channel, and the second channel image represents an H channel image of a hue channel.

It can be understood that, when image fusion is performed, the tone transformation between the target object and the background image is very little, and the final fusion result is hardly affected, and the biggest reason for the inconsistency of the image style and the light and shadow effect is due to the change of the saturation and the brightness. Specifically, the most significant is the luminance, and the saturation changes slightly with the change in luminance. Therefore, the channel separation processing is performed on the initial fusion image, and the initial fusion image is divided into a first channel image and a second channel image; wherein the first channel image represents an SV channel image composed of a saturation channel and a luminance channel, and the second channel image represents an H channel image of only a tone channel.

Further, since the initial fusion image is usually an image of a non-HSV format, for example, the image format of the initial fusion image is usually an image of Red, green, blue, RGB (Red, green, blue, RGB) format or Blue, green, red (BGR) format, format conversion of the initial fusion image is also required at this time. Thus, in some embodiments, before the performing the image channel separation process on the initial fused image, the method may further include: and carrying out image format conversion on the initial fusion image so as to enable the converted initial fusion image to be an image in a target format.

It should be noted that, in the embodiment of the present application, the target format specifically refers to an HSV format.

In some embodiments, the acquiring the initial fusion image and the mask image may include:

acquiring a target image to be fused and a background image;

obtaining the mask image according to the region outside the target object in the target image to be fused;

and obtaining the initial fusion image according to the target object in the target image to be fused and the background image.

It should be noted that, for example, a person needs to be merged into a group photo, then the original image where the person is located is the target image to be merged, the person is the target object, and the group photo is the background image. According to the method and the device, when the mask image is determined, the region outside the target object is determined as the mask region to obtain the mask image, so that unnecessary change of the background image in the subsequent fusion process is avoided.

It should be noted that, since only the target object in the target image to be fused needs to be fused in the background image, in order to obtain the initial fused image through the target object and the background image, it may be assumed that the region of the target image to be fused other than the target object is blank pixels.

S103, carrying out image fusion processing on the first channel image and the mask image by using the image fusion model to obtain an intermediate fusion image.

After the first channel image and the second channel image are obtained, the first channel image and the mask image may be input to the image fusion model together, so that the intermediate fusion image is output by the image fusion model.

In some embodiments, the image fusion model may include a first feature network, a stitching network, and a second feature network; correspondingly, the performing image fusion processing on the first channel image and the mask image by using the image fusion model to obtain an intermediate fusion image may include:

performing feature extraction processing on the first channel image and the mask image by using a first feature network to obtain a multi-scale and multi-dimensional feature image;

carrying out array splicing processing on the first channel image and the mask image by using a splicing network to obtain a spliced image;

and performing attention learning and feature operation processing on the feature images and the spliced images by using a second feature network to obtain the intermediate fusion image.

It should be noted that the image fusion model may include a first feature network, a stitching network, and a second feature network. The first characteristic network is used for carrying out characteristic extraction processing on a first channel image and a mask image input into the first characteristic network to obtain a multi-dimensional and multi-scale characteristic image; the splicing network is used for carrying out array splicing processing on the first channel image and the mask image input into the splicing network to obtain a spliced image; and the second feature network performs attention learning and feature operation processing on the feature image output by the first feature network and the spliced image output by the spliced network to obtain an intermediate fusion image, wherein the attention learning mainly includes the illumination shadow feature learning and the feature fusion processing, and can be realized by arranging an attention module in the second feature network. Due to the fact that learning of characteristics such as illumination shadow is conducted, the light and shadow effect of the target object in the obtained intermediate fusion image is enabled to be consistent with the light and shadow effect of the background image, and the fusion image is natural and real.

Further, in some embodiments, the first feature network may be a High Resolution Net (HR-Net) network, and the splicing network may be a merge (Concat) network.

It should be noted that, by using the HR-Net network to perform the feature extraction processing on the first channel image and the mask image, a multi-scale and multi-dimensional feature image can be obtained, and the specific number of the feature images is related to the number of convolutions of the HR-Net network. Therefore, the multi-scale and multi-dimensional feature image has richer detail features and overall features, so that the subsequent second feature network can better learn the related features such as the illumination shadow.

It should be further noted that, the Concat network is used to perform an array stitching process on the first channel image and the mask image, and the obtained stitched image may be represented by R ₁ ^H×W×3 And (4) showing.

Further, for the second feature network, in some embodiments, the second feature network may include a feature subnetwork, a first convolution module, a second convolution module, and a feature operation module;

correspondingly, the performing attention learning and feature operation processing on the feature image and the stitched image by using the second feature network to obtain an intermediate fusion image may include:

performing feature extraction processing on the feature image and the spliced image through a feature sub-network to obtain an attention image;

performing convolution processing on the attention map image through a first convolution module to obtain a generated image;

performing convolution processing on the attention map image through a second convolution module to obtain an attention mask image, and obtaining a reverse attention mask image according to the attention mask image;

and performing characteristic operation processing on the generated image, the attention mask image, the reverse attention mask image and the first channel image through a characteristic operation module to obtain an intermediate fusion image.

It should be noted that, the embodiment of the present application may introduce an attention module in the feature sub-network, and the attention mechanism of the attention module may make the network pay more attention to learning some important features of the image, such as features of illumination, shadow, etc.; in this way, the characteristic subnetwork learns the illumination characteristics of the input characteristic image and the spliced image, so that the illumination shadow effect of the target image in the subsequent generated image and the illumination shadow effect of the background image tend to be consistent.

The attention image output by the feature sub-network is convolved by the first convolution module to obtain a generated image; and performing convolution processing on the attention map image through a second convolution module to obtain an attention mask image (also called an attention weight mask M), wherein the attention mask image can indicate a saturation value and a brightness value needing to be transformed. And then, carrying out reverse selection on the attention mask image to obtain a reverse attention mask image (also called as an attention weight mask (1-M)). Since the feature sub-network focuses on learning features such as light and shadow, the generated image and the attention mask image can both obtain important features related to light shadow and the like.

In this way, because the related art usually processes the edge pixels of the target object and the background image to make the transition natural, but because the background image and the background image to be fused have different light and shadow effects, for example, the target image to be fused is shot in the backlight, and the background image is shot in the frontlight, the fusion effect is very abrupt; in the image fusion process, the embodiment of the application focuses on learning of the light and shadow effect characteristics, so that a fusion image with better effect can be obtained.

It is further noted that in some embodiments, the feature sub-network may be a U-Net network; the U-Net network can enable the image to have a better segmentation effect.

It should be further noted that, in some embodiments, the convolution kernel of the first convolution layer is 1 × 1, and the number of channels is 3, which may be represented by Conv 1 × 1 × 3; the convolution kernel of the second convolution layer is 1 × 1, the number of channels is 1, and can be represented by Conv 1 × 1 × 1.

Further, for a feature operation module, in some embodiments, the feature operation module may include a first multiplication module, a second multiplication module, and an addition module;

correspondingly, the performing, by the feature operation module, feature operation processing on the generated image, the attention mask image, the reverse attention mask image, and the first channel image to obtain an intermediate fusion image may include:

carrying out element multiplication processing on the generated image and the attention mask image by using a first multiplication module to obtain a first fusion image;

a second multiplication module is used for carrying out element multiplication processing on the reverse attention mask image and the first channel image to obtain a second fusion image;

and performing element addition processing on the first fusion image and the second fusion image by using an addition module to obtain an intermediate fusion image.

It should be noted that, the feature operation module performs multiplication and addition processing on the input image to finally obtain an intermediate fusion image. Specifically, element multiplication processing can be performed on the generated image and the attention mask image through a first addition module of the feature operation module to obtain a first fusion image; element multiplication processing is carried out on the reverse attention mask image and the first channel image through a second multiplication module of the characteristic operation module to obtain a second fusion image; and then, element addition processing is carried out on the first fusion image and the second fusion image through an addition module of the characteristic operation module to obtain an intermediate fusion image.

And S104, carrying out channel merging processing on the intermediate fusion image and the second channel image to generate a target fusion image.

It should be noted that, after the intermediate fused image is obtained, the fused image and the initially separated second channel image may be subjected to channel merging, so as to finally obtain the target fused image.

It should be further noted that, in the embodiment of the present application, the image fusion model can well learn characteristics such as illumination shadows, so that in the obtained target fusion image, the light and shadow effect of the target object and the light and shadow effect of the background image tend to be consistent; in addition, in the process of image fusion, the influence of an H channel is eliminated, and only the image of an SV channel is used for fusion processing, so that the situation that unnecessary tone information is learned by an image fusion model and the tone information of the fused image is changed is avoided.

In addition, in the embodiment of the application, because the obtained target fusion image is in the HSV format, format conversion can be performed according to actual requirements, for example, the target fusion image is converted into an image in other formats such as an RGB format or a BGR format.

The embodiment of the application provides an image fusion method, which comprises the steps of obtaining an initial fusion image and a mask image; carrying out image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image; carrying out image fusion processing on the first channel image and the mask image by using an image fusion model to obtain an intermediate fusion image; and carrying out channel merging processing on the intermediate fusion image and the second channel image to generate a target fusion image. In this way, before image fusion, the second channel images related to the color tones are removed, and only the first channel images are used for fusion processing, so that the problems that the image fusion model learns the characteristics related to the color tones, so that the model is difficult to converge, the color tone information of the target fusion image is changed, and the like can be avoided, the illumination shadow effect of the target object in the finally obtained target fusion image and the illumination shadow effect of the background image tend to be consistent, the image fusion effect is improved, and the target fusion image which is real enough can be output.

In another embodiment of the present application, an embodiment of the present application further provides a training method for an image fusion model. Referring to fig. 2, a flowchart of a training method of an image fusion model provided in the embodiment of the present application is shown. As shown in fig. 2, the method may include:

s201, obtaining a training sample set.

S202, training the preset network model by using the training sample set to obtain an image fusion model.

It should be noted that the training sample set for training the preset network model includes at least one group of sample pictures; each group of sample pictures comprises a sample background image, a sample target image to be fused and a sample light source position of the sample background image. The image fusion model in the foregoing embodiment can be obtained by training a preset network model through a training sample set.

It should be further noted that, for the sample light source position, for example, the embodiment of the present application may construct a light source model, in which the sample light source position of the sample background image is determined. The light source model may have three dimensions perpendicular to each other, and the value of each dimension represents the illumination position of the image in the dimension, so that the light source position may be a 3 × 1 three-dimensional matrix.

In some embodiments, for each set of sample pictures, the method may further comprise:

obtaining a sample mask image according to a region outside a target object in a target image to be fused of a sample;

obtaining a sample initial fusion image according to a target object in a sample target image to be fused and a sample background image;

carrying out image format conversion on the sample initial fusion image to obtain a sample initial fusion image in a target format;

correspondingly, the training the preset network model by using the training sample set to obtain the image fusion model may include:

and training the preset network model by using the sample mask image and the sample initial fusion image in the target format to obtain an image fusion model.

It should be noted that, similar to the step of performing image fusion on the first channel image and the mask image in the foregoing embodiment, in the model training phase, it is also necessary to obtain a sample mask image and a sample initial fusion image according to the sample target image to be fused and the sample background image, and perform image format conversion on the sample initial fusion image to obtain a sample initial fusion image in the target format. In the embodiment of the present application, the target format may be an HSV format.

It should be further noted that, similar to the image fusion process, in the model training stage, the initial sample fusion image also needs to be subjected to channel separation processing to obtain a first sample channel image and a second sample channel image; the first channel image of the sample is an SV channel image composed of an S channel and a V channel, and the second channel image of the sample is an H channel image. Thus, taking a set of sample images as an example, the preset network model is mainly trained by using the sample first channel image and the sample mask image to obtain the image fusion model.

Further, in some embodiments, the training the preset network model by using the sample mask image and the sample initial fusion image in the target format to obtain the image fusion model may include:

carrying out image fusion processing on the sample mask image and the sample initial fusion image in the target format through a preset network model to obtain a sample fusion image;

determining a loss function according to the sample light source position of the sample background image and the sample fusion image;

in the process of training the preset network model by using the sample mask image and the sample initial fusion image in the target format, when the loss function reaches a preset loss value, determining the trained preset network model as the image fusion model.

It should be noted that, the process of performing image fusion processing on the sample mask image and the sample initial fusion image in the target format through the preset network model to obtain the sample fusion image can be understood by referring to the process of performing image fusion processing on the mask image and the first channel image through the image fusion model to obtain the fusion image in the foregoing embodiment. In the embodiment of the application, in the model training stage, the feature related to the image light and shadow effect can be mainly learned by the preset network model by adding the attention mechanism.

In addition, the difference between the model training phase and the model application phase (i.e. the phase of actually using the image fusion model to perform image fusion) is that, for the model training phase, the position of the sample light source in the sample background image needs to be additionally input into the preset network model for the subsequent calculation of the loss function.

It should be further noted that, after the sample fusion image is obtained, the loss function is determined according to the sample light source positions of the sample fusion image and the sample background image. Specifically, in the process of training a preset network model by using a sample mask image and a sample initial fusion image in a target format, the training effect of the network model is determined according to a loss function, the network parameters of the preset network model are adjusted according to the loss function, and when the loss function reaches a preset loss value, the trained preset network model is determined as the image fusion model.

Further, in some embodiments, the preset network model may further include a light source location network; the determining the loss function according to the sample light source position of the sample background image and the sample fusion image may include:

determining the light source position of the sample fusion image through a light source position network to obtain the fusion light source position of the sample fusion image;

based on the fusion light source position and the sample light source position, a loss function is determined.

It should be noted that, in the embodiment of the present application, the loss function may be determined based on the sample light source position, the fusion light source position of the sample fusion image, and the sample fusion image, so as to ensure consistency between the light and shadow effect of the output image of the network model and the light and shadow effect of the input background image.

And for the fusion light source position of the sample fusion image, inputting the sample fusion image into a preset network model through a light source position network to be processed, so as to obtain the fusion light source position of the sample fusion image.

It should be further noted that the position of the fusion light source is the same as the position of the sample light source, and is also a three-dimensional matrix, and each dimension of the three-dimensional matrix represents an illumination position in one direction, which represents the light source position of the sample fusion image in the light source model.

That is to say, the embodiment of the present application may provide a light source model, and before training the model, determine the sample light source position of the sample background image in the light source model; and after the sample fusion image is obtained, determining the fusion light source position of the sample fusion image in the light source model by using the light source determination network.

Further, for the loss function, the determination may be made based on the sample fused image, the fused light source position, and the sample light source position. Specifically, the method is obtained by constructing a sample attention mask image, a sample generation image, a sample first channel image, a fusion light source position and a sample light source position, wherein the sample attention mask represents an attention mask image obtained in a model training stage, the sample generation image represents a generation image obtained in the model training stage, the sample first channel image represents an image formed by an S channel and a V channel obtained by performing channel separation processing on a sample initial image, the fusion light source position represents a light source position determined in a sample fusion image, and the sample light source position represents a light source position determined in a sample background image.

In a specific example, it is preferable to use a normalized Mean Square Error (MSE) for the foreground image as a Loss function (Loss) of the preset network model, and the specific formula is as follows:

where α and λ are empirical parameters, α =100 and λ =0.5 may be set according to the empirical values; l is _rec A loss value representing a loss function; (H, W) represents a pixel point coordinate in the image, H =0,1, \8230;, H, W =0,1, \8230;, W; wherein H and W and denote the height and width of the image, respectively; m _h,w Representing a sample attention mask image;

representing a sample generation image; i is _h,w Representing a sample first channel image; l is ^pred Indicating the fusion light source position; l denotes the sample light source position.

It should be further noted that, in the embodiment of the present application, when the preset network model is trained, a mask of the target image to be fused of the sample is provided, so that in the learning process, the background image of the sample remains substantially unchanged, and the size of the target object in the target image to be fused of the sample causes fluctuation of different orders of magnitude of the overall error, and thus, a loss function may be set for the target object of the sample serving as the foreground image. And the loss function can use the position of the sample light source and the position of the fusion light source as comparison, and the closer the two positions are, the better the training effect of the model is.

Additionally in some embodiments, the method may further comprise: and under the condition that the loss function does not reach the preset loss value, continuing to execute the step of training the preset network model by using the sample mask image and the sample initial fusion image in the target format until the loss function reaches the preset loss value.

That is to say, in the embodiment of the present application, if the loss function does not reach the preset loss value, it indicates that the model needs to be trained further, and at this time, the training is continued to be performed on the preset network model by using the sample mask image and the sample initial fusion image in the target format until the loss function reaches the preset loss value.

In addition, in practice, there may be a case that the loss function cannot reach the preset loss value all the time, so that the embodiment of the present application may further set the iteration number, and when the iteration number of the preset network model reaches the preset iteration threshold, the trained preset network model may also be determined as the image fusion model at this time.

The embodiment of the application provides an image fusion method, which mainly explains the training process of an image fusion model, and obtains the image fusion model by training a preset network model through a training sample set; the influence of a light source and the shadow is fully considered in the training process, the original shadow of the target object can be changed while the images are fused, the output fused image is more real and robust, the model can better learn the characteristics by separating the tone image and adding an attention mechanism, a better output result is obtained, and the accuracy of the model is improved.

In another embodiment of the present application, referring to fig. 3, a detailed flowchart of an image fusion method provided in the embodiment of the present application is shown. Fig. 4 shows a network structure diagram of an image fusion model provided in an embodiment of the present application, and the detailed process may be implemented based on the image fusion model shown in fig. 4, and the detailed process may include:

s301, acquiring a target image to be fused and a background image.

It should be noted that, with reference to the foregoing embodiment, the image fusion model according to the embodiment of the present application may only include the first feature network, the stitching network, and the second feature network in the foregoing embodiment; the part of the target image to be fused for channel separation processing and the part of the intermediate fusion image and the second channel image for channel merging processing can also be placed in the image fusion model, and even the part of the mask image and the initial fusion image can be placed in the image fusion model. Since these steps do not involve changing network parameters, and therefore the final image fusion effect is not affected by any of these steps, the embodiment of the present application does not specifically limit the specific network structure of the image fusion model.

Here, the embodiments of the present application will be described in detail by taking an example in which the target image to be fused and the background image are input into the image fusion model, and then the target fusion image is output by the image fusion model.

It should be noted that, for example, a person is merged into another image, the image in which the person is located is the target image to be merged, and the other image is the background image. Here, in the target image to be fused, the portion other than the person is a blank pixel by default (note that the blank pixel does not mean that the pixel value is 0), and in the present embodiment, the person is the target object in the foregoing embodiment.

And S302, obtaining an initial fusion image and a mask image.

It should be noted that, after the target image to be fused and the background image are input into the image fusion model, an initial fusion image and a mask image can be obtained first; the mask image covers other areas except the target object, namely, the target object is mainly processed without basically changing the background image in the embodiment of the application; the initial fusion image is obtained by directly adding the target object into the background image.

It should be further noted that, referring to fig. 4, the mask image and the initial fusion image, in fig. 4, it is obvious from the initial fusion image that the light and shadow effects of the target object and the background image are seriously inconsistent and very abrupt.

And S303, converting the initial fusion image into an HSV format.

S304, obtaining an SV channel image and an H channel image.

It should be noted that, in the embodiment of the present application, after the initial fusion image is obtained, the image format of the initial fusion image needs to be converted into an HSV format. For example: and converting the image format of the initial fusion image from the RGB format or the BGR format into the HSV format.

It should be further noted that, the channel separation processing is performed on the initial fusion image in HSV format, so that an H channel image and an SV channel image composed of an S channel and a V channel can be separated therefrom, and the SV channel image (denoted as I) is input into a subsequent network structure for feature extraction. It should be noted that this operation can effectively avoid the network from learning unnecessary detail information of the color tone, and avoid the network from changing the color tone of the image, thereby causing unnecessary changes. The inventor analyzes the HSV channel, and finds that the hue transformation between the target image to be fused and the background image is very little, and the final fusion result is hardly influenced, and the biggest reason for inconsistent image style and light and shadow effect is due to the change of saturation and brightness, wherein the influence is the brightness, and the saturation is slightly transformed along with the change of brightness.

Therefore, the embodiment of the application uses the SV channel image separated from the initial fusion image for subsequent processing.

S305, obtaining a characteristic image through an HR-Net network.

It should be noted that, in the embodiment of the present application, the SV channel image I and the mask image are input to the HR-Net network, and are subjected to convolution processing by the HR-Net network, so that a multi-scale and multi-dimensional feature image (where the number of the feature images is related to the number of convolution layers of the HR-Net network) can be obtained, so that details of the overall feature image and dimension information are richer, better local and global features can be obtained, and it is beneficial for a subsequent U-Net network to better learn the illumination shadow feature of the background image, so that the illumination shadow effect of the target object in the finally generated target fusion image tends to be consistent with the illumination shadow effect in the background image.

And S306, obtaining a spliced image through a Concat network.

It should be noted that, in the embodiment of the present application, the SV channel image I and the mask image may be subjected to array stitching through the Concat network to obtain a stitched image (denoted as R: R) ₁ ^H×W×3 Wherein H represents the height of the image; w represents the width of the image; 3 denotes the number of channels of the image, i.e., S channel, V channel, and mask channel).

And S307, obtaining an attention mask image and generating an image through a U-Net network.

After the feature images are obtained through the HR-Net network and the stitched images are obtained through the Concat network, the feature images and the stitched images are input into the U-Net network together to obtain the attention mask images and the generated images.

It should be noted that since the size of the multidimensional feature map output by the HR-Net network is four times smaller than the original (i.e., the image input to the HR-Net network), for example: if the original size is 224 × 224, the size of the feature image is 112 × 112; therefore, when the feature image and the stitched image are input into the U-Net network together, the feature image and the stitched image are not input at the same time, but the feature image needs to be inserted into the vector after convolution of the second layer of the U-Net network, so that the scale can be kept consistent.

It should be further noted that, in the embodiment of the present application, an attention module is introduced into the U-Net network, and the image fusion model is made to pay more attention to some important features, such as features related to illumination, by an attention mechanism of the attention module, so that the image fusion result is more excellent.

The output image available through the U-Net network can be written as: r ₂ ^H×W×3 For R ₂ ^H×W×3 Which are processed by one convolution of 1 × 1 × 1 and one convolution of 1 × 1 × 3, respectively (refer to fig. 4); thus, a corresponding generated image (denoted as I) can be obtained by convolution of 1X1 ^SV ) The convolution with 1 × 1 × 3 can obtain a corresponding Attention Mask image (referred to as M), that is, an Attention weight Mask image M. Where the generated image is still an image composed of SV channels, the attention mask image may indicate the saturation values and luminance values that need to be transformed.

It will be appreciated that the attention module in the U-Net network is based on, such that image I is generated ^SV The attention mask image M has features related to light shading and the like.

S308, obtaining the reverse attention mask image.

It should be noted that, from the attention mask image, a corresponding reverse attention mask image (denoted as: 1-M), that is, a reverse attention weight mask image (1-M), can be generated. Specifically, the reverse attention mask image is the reverse selection of the attention mask image.

S309, generating a fusion image.

It should be noted that the image I is to be generated ^SV The result obtained by multiplying the attention mask image M is added to the result obtained by multiplying the SV image I and the reversed attention mask image (1-M) to obtain a fused image I ^pred It will be appreciated that the fused image is still an image of both S-channel and V-channel components (for the application phase of the model, the fused image represents the intermediate fused image; for the training phase of the model, the fused image represents the sample fused image). Specifically, see the following formula (2):

I ^pred ＝I×(1-M)+I ^SV ×M (2)

and S3010, combining the channels to generate a target fusion image.

It should be noted that the H-channel image separated in step S204 and the generated intermediate fusion image are subjected to channel merging, and a target fusion image is finally obtained. As can be seen from fig. 4, the light and shadow effect of the person is very natural in the target fusion image compared to the initial fusion image.

S3011, processing the fused image on the convolution layer and the full-link layer to obtain the position of the fused light source.

It should be noted that, in the estimation stage of the image fusion model, the foregoing steps S301 to S3010 represent a specific workflow diagram of the image fusion model, that is, an overall process procedure for the target image to be fused and the background image that need to be fused. In the process of training the image fusion model, besides processing the target image to be fused and the background image in the preset network model (the model in the training process is called the preset network model, and the model which has completed training is called the image fusion model), the background light source position L of the background image needs to be additionally input.

Exemplarily, referring to fig. 5, a schematic network structure diagram of an illumination model provided in an embodiment of the present application is shown. As shown in fig. 5, the expression image is an image that needs to determine the position of the light source, x, y, and z represent three dimensions, respectively, and the position of the light source represents the actual position of the light source of the image, and is quantified in the illumination model.

In the embodiment of the present application, the background light source position L of the background image may be determined from the illumination model. In addition, in order to facilitate subsequent calculation of the loss function, the ranges of the three dimensions of the illumination model are limited to 0 to 2, but in practical application, other ranges may also be set in combination with requirements, which is not specifically limited in the embodiment of the present application; for example, in the illumination model, (1, 1) this position illumination source is trained with a completely black map since it is within the image volume; as another example, for the background image shown in FIG. 4, the illumination source is at (0, 1, 2) this location.

The specific expression of the light irradiation position L is shown in the following formula (3):

L＝[i,j,k] ^T i,j,k∈[0,2], (3)

where i, j, k denote the position of the light source in the x, y and z dimensions, respectively.

In the process of training the model, after the fused image is obtained, the generated fused image is additionally input into a light source position network (the light source position network can comprise a three-layer convolution and a full-connection layer), and a fused light source position L is output ^Pred ，L ^Pred Is a 3 x1 matrix (as in equation (3) above) that represents the illumination position in the three dimensions x, y and z in the fused image.

It should be further noted that, during the model training process, the background light source position L of the background image and the fusion light source position L of the fusion image ^Pred All are input as truth labels and are used for checking the training effect of the model.

It should be further noted that, in the embodiment of the present application, a mask of a target image to be fused is provided in the input of the preset network model, so that in the network learning process of the preset network model, a background image of the image to be fused basically remains unchanged, and a target object is used as a foreground image, whose size may cause error loss caused by the whole to fluctuate on different orders of magnitude. Therefore, the embodiment of the present application preferably uses the normalized mean square error for the foreground image as the loss function of the preset network model, and the specific formula is shown in the foregoing formula (1).

In this way, when the preset network model is trained, parameters of the preset network model are adjusted according to the loss value Lrec of the loss function, and whether the preset conditions are met or not is determined; the preset condition indicates that the iteration times of the preset network model reach a preset iteration threshold value or the value of the loss function reaches a preset loss value, and if the preset condition is met, the newly obtained preset network model is determined as an image fusion model; otherwise, continuing training the preset network model until the preset conditions are met, and obtaining the image fusion model.

S3012, outputting the target fusion image (and fusion light source position).

It should be noted that, in the model training phase, the target fusion image and the fusion light source position H are output simultaneously ^pred According to H ^pred Determining a loss function; in the inference stage, namely the model is trained, only the target fusion image needs to be output in the actual use process.

In summary, the embodiment of the present application provides an image fusion method capable of changing shadows, and the specific flow is shown in fig. 3, and the model network structure is shown in fig. 4.

Firstly, inputting a target image to be fused and a background image to obtain an initial fusion image and a mask image. The image to be fused is converted into the image in the HSV format from the BGR image, the image formed by the H channel image and the image formed by the SV channel are separated, and the image I formed by the SV channel is input into a subsequent network to extract features. Experiments prove that the operation can effectively avoid the network model from learning unnecessary tone detail information, and avoid unnecessary changes caused by the fact that the network changes the tone of the target image in training, so that the network is difficult to converge. Through HSV channel analysis, it can be known that the hue transformation between the target image to be fused and the background image is very little, the result is hardly influenced, and the biggest reason for inconsistent image style and light and shadow effect is caused by the change of saturation and brightness, wherein the influence is the brightness, and the saturation is slightly transformed along with the change of brightness. Therefore, the embodiment of the present application performs H-channel separation on the initial fusion image. It is worth mentioning that in the training process, the light source position L of the image background image needs to be additionally input, and a specific illumination model is shown in fig. 5. Wherein each dimension ranges between 0 and 2. The specific expression of L is shown in formula (3).

And then, combining the SV channel image I and the mask image and inputting the combination into an HR-Net network, and mainly obtaining a multi-scale and multi-dimensional characteristic image through the HR-Net network, so that the details and dimension information of the integral characteristic image are richer, the U-Net network is convenient to learn the illumination shadow characteristics of the background better, and the illumination shadow effect of a target object in a subsequently generated image is consistent with the illumination shadow effect of the background. Meanwhile, carrying out array splicing on the SV channel image I and the mask of the target image to be fused to obtain an image R ₁ ^H×W×3 And inputting the data into the U-net network. It should be noted that the output feature map passing through the HR-Net network is 4 times smaller than the original image, so that the feature image needs to be inserted into the vector after convolution of the second layer of the U-Net network to keep the scale consistent.

Then, aiming at the U-net network, outputting a result R ₂ ^H×W×C (C is the number of channels, here 3) the generated image I is obtained by a 1x1x3 convolution and a 1x1x1 convolution, respectively ^SV And an attention mask M. Multiplying the two images and adding the result of multiplying the original SV channel image I by the attention mask (1-M) to obtain an image I ^pred . By adding an attention mechanism, the model can be made to pay more attention to certain important features, such as illumination-related features, so that the result is more excellent. The specific formula is shown in formula (2).

Image I to be generated ^pred Additionally inputting a three-layer convolution and a full connection layer, outputting a matrix H ^pred ，H ^pred Is a 3 x1 matrix, which respectively represents the illumination positions of three dimensions and is input as a truth label in the training process.

Because the image fusion method provided by this embodiment provides a mask image, during the network learning process, the background image of the initial fusion image is basically kept unchanged, and the size of the foreground target image (i.e., the target object) may cause the overall error loss to fluctuate by different orders of magnitude, for this reason, the embodiment of this application uses the normalized Mean Square Error (MSE) for the foreground target image as the loss of the network, and the specific formula is as formula (3).

And finally, adding the initially separated H channels again, combining the channels and outputting a final target fusion image.

That is to say, the image fusion method provided by the embodiment of the application considers the conditions of the illumination source and the shadow, can change the original shadow of the target image while fusing the images to enable the output image result to be more real and robust, and enables the model to be easier to learn better features and enable the output result to be better by separating the tone image and adding the attention mechanism.

In summary, the tone channel is removed, and only the picture combined by the saturation channel and the brightness channel is used as input, so that the whole network model avoids learning the characteristics of color and tone, and other characteristics are better learned, and the result is better; in addition, the model structure provided by the embodiment of the application can realize end-to-end image fusion, namely, after the target image to be fused and the background image are input, the target fusion image can be directly output through the image fusion model; in addition, the embodiment of the application enables the images input into the U-Net network subsequently to have multi-scale and multi-dimensional detail information through a high-resolution convolutional network (HR-Net network), so that the network can learn better illumination shadow characteristics.

The embodiment provides an image fusion method, and the specific implementation of the foregoing embodiment is described in detail through the foregoing embodiment, and it can be seen that, compared with the related art, on one hand, the embodiment of the present application trains a model by using a picture of a combination of saturation and luminance channels as an input, thereby avoiding that the model learns features related to hue, while the related art needs to input an original image, which inevitably learns unnecessary features, thereby causing an influence; on the other hand, the embodiment of the application can enable the illumination shadows of the target image and the background image in the image to be fused to be consistent, so that the fusion result is more vivid, and the related technology only enables the color domains of the fused image to be consistent without considering the illumination shadow effect.

In yet another embodiment of the present application, referring to fig. 6, a schematic structural diagram of an image fusion apparatus 60 provided in this embodiment of the present application is shown. As shown in fig. 6, the image fusion apparatus 60 may include an acquisition unit 601, a separation unit 602, a fusion unit 603, and a merging unit 604, wherein,

the acquiring unit 601 is configured to acquire an initial fusion image and a mask image;

the separation unit 602 is configured to perform image channel separation processing on the initial fusion image to obtain a first channel image and a second channel image;

the fusion unit 603 is configured to perform image fusion processing on the first channel image and the mask image by using an image fusion model to obtain an intermediate fusion image;

the merging unit 604 is configured to perform channel merging processing on the intermediate fused image and the second channel image to generate a target fused image.

In some embodiments, the image channel includes a hue channel, a saturation channel, and a brightness channel; wherein the first channel image represents an SV channel image composed of the saturation channel and the brightness channel, and the second channel image represents an H channel image of the hue channel.

In some embodiments, the obtaining unit 601 is further configured to obtain a target image to be fused and a background image; obtaining the mask image according to the region outside the target object in the target image to be fused; and obtaining the initial fusion image according to the target object in the target image to be fused and the background image.

In some embodiments, as shown in fig. 6, the image fusion apparatus 60 may include a conversion unit 605 configured to perform image format conversion on the initial fusion image, so that the converted initial fusion image is an image in a target format.

In some embodiments, the image fusion model comprises a first feature network, a stitching network, and a second feature network; the fusion unit 603 is specifically configured to perform feature extraction processing on the first channel image and the mask image by using the first feature network to obtain a multi-scale and multi-dimensional feature image; carrying out array splicing processing on the first channel image and the mask image by using the splicing network to obtain a spliced image; and performing attention learning and feature operation processing on the feature images and the spliced images by using the second feature network to obtain the intermediate fusion image.

In some embodiments, the second feature network comprises a feature subnetwork, a first convolution module, a second convolution module, and a feature computation module; the fusion unit 603 is further specifically configured to perform feature extraction processing on the feature image and the stitched image through the feature sub-network to obtain an attention image; performing convolution processing on the attention diagram image through the first convolution module to obtain a generated image; performing convolution processing on the attention diagram image through the second convolution module to obtain an attention mask image, and obtaining a reverse attention mask image according to the attention mask image; and performing feature operation processing on the generated image, the attention mask image, the reverse attention mask image and the first channel image through the feature operation module to obtain the intermediate fusion image.

In some embodiments, the feature operation module comprises a first multiplication module, a second multiplication module, and an addition module; the fusion unit 603 is further specifically configured to perform element multiplication processing on the generated image and the attention mask image by using the first multiplication module to obtain a first fusion image; the second multiplication module is used for carrying out element multiplication processing on the reverse attention mask image and the first channel image to obtain a second fusion image; and carrying out element addition processing on the first fusion image and the second fusion image by using the addition module to obtain the intermediate fusion image.

In some embodiments, the first feature network is a high resolution HR-Net network, the splice network is a Merge Concat network, and the feature sub-network is a U-Net network; the convolution kernel of the first convolution layer is 1 multiplied by 1, and the number of channels is 3; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is 1.

In some embodiments, as shown in fig. 6, the image fusion apparatus 60 may include a training unit 606 configured to obtain a training sample set, where the training sample set includes at least one group of sample pictures, and each group of sample pictures includes a sample background image, a sample target image to be fused, and a sample light source position of the sample background image; and training a preset network model by using the training sample set to obtain the image fusion model.

In some embodiments, the training unit 606 is further configured to obtain a sample mask image according to a region outside the target object in the target image to be fused; obtaining a sample initial fusion image according to a target object in the sample target image to be fused and the sample background image; performing image format conversion on the sample initial fusion image to obtain a sample initial fusion image in a target format; and training the preset network model by using the sample mask image and the sample initial fusion image in the target format to obtain the image fusion model.

In some embodiments, the training unit 606 is further configured to perform image fusion processing on the sample mask image and the sample initial fusion image in the target format through the preset network model to obtain a sample fusion image; determining a loss function according to the sample light source position of the sample background image and the sample fusion image; and determining the trained preset network model as the image fusion model when the loss function reaches a preset loss value in the process of training the preset network model by using the sample mask image and the sample initial fusion image in the target format.

In some embodiments, the training unit 606 is further configured to continue to perform the training of the preset network model by using the sample mask image and the sample initial fusion image in the target format until the loss function reaches the preset loss value, if the loss function does not reach the preset loss value.

In some embodiments, the preset network model comprises a light source location network; the training unit 606 is further configured to perform light source position determination on the sample fusion image through the light source position network to obtain a fusion light source position of the sample fusion image; and determining the loss function based on the fusion light source position and the sample light source position.

It is understood that in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and may also be a module, or may also be non-modular. Moreover, each component in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or partly contributes to the prior art, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Accordingly, the present embodiment provides a computer storage medium storing a computer program which, when executed by at least one processor, implements the steps of the image fusion method of any one of the preceding embodiments.

Based on the above-mentioned composition of an image fusion device 60 and computer storage media, refer to fig. 7, which shows a schematic diagram of a composition structure of an electronic device 70 provided in an embodiment of the present application. As shown in fig. 7, may include: a communication interface 701, a memory 702, and a processor 703; the various components are coupled together by a bus system 704. It is understood that the bus system 704 is used to enable connected communication between these components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled in fig. 7 as the bus system 704. The communication interface 701 is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

a memory 702 for storing a computer program capable of running on the processor 703;

a processor 703 for executing, when running the computer program, the following:

acquiring an initial fusion image and a mask image;

It will be appreciated that the memory 702 in the subject embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous chained SDRAM (Synchronous link DRAM, SLDRAM), and Direct memory bus RAM (DRRAM). The memory 702 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The processor 703 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 703. The Processor 703 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702 and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 703 is further configured to, when running the computer program, perform the method of any one of the foregoing embodiments.

Referring to fig. 8, a schematic diagram of a component structure of another electronic device 70 provided in the embodiment of the present application is shown. As shown in fig. 8, the electronic device 70 at least includes the image fusion apparatus 60 according to any of the previous embodiments.

For the electronic device 70, before image fusion, the second channel image related to the color tone is removed, and only the first channel image is used for fusion processing, so that the problem that the color tone information of the obtained target fusion image is changed due to the fact that the image fusion model learns the characteristics related to the color tone can be avoided, the illumination shadow effect of the target object in the target fusion image and the illumination shadow effect of the background image tend to be consistent, and the image fusion effect is improved.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

It should be noted that, in the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image fusion method, characterized in that the method comprises:

acquiring an initial fusion image and a mask image;

2. The method of claim 1, wherein the image channels comprise a hue (H) channel, a saturation (S) channel, and a brightness (V) channel; wherein,

the first channel image represents an SV channel image composed of the saturation channel and the brightness channel, and the second channel image represents an H channel image of the hue channel.

3. The method of claim 1, wherein the acquiring the initial fused image and the mask image comprises:

acquiring a target image to be fused and a background image;

4. The method of claim 1, wherein prior to said image channel separation processing of said initial fused image, said method further comprises:

and carrying out image format conversion on the initial fusion image so as to enable the converted initial fusion image to be an image in a target format.

5. The method of claim 1, wherein the image fusion model comprises a first feature network, a stitching network, and a second feature network;

correspondingly, the image fusion processing is performed on the first channel image and the mask image by using the image fusion model to obtain an intermediate fusion image, and the method comprises the following steps:

performing feature extraction processing on the first channel image and the mask image by using the first feature network to obtain a multi-scale and multi-dimensional feature image;

carrying out array splicing processing on the first channel image and the mask image by using the splicing network to obtain a spliced image;

and performing attention learning and feature operation processing on the feature images and the spliced images by using the second feature network to obtain the intermediate fusion image.

6. The method of claim 5, wherein the second feature network comprises a feature subnetwork, a first convolution module, a second convolution module, and a feature computation module;

correspondingly, the performing attention learning and feature operation processing on the feature image and the stitched image by using the second feature network to obtain the intermediate fusion image includes:

performing feature extraction processing on the feature image and the spliced image through the feature sub-network to obtain an attention image;

carrying out convolution processing on the attention map image through the first convolution module to obtain a generated image;

performing convolution processing on the attention map image through the second convolution module to obtain an attention mask image, and obtaining a reverse attention mask image according to the attention mask image;

and performing feature operation processing on the generated image, the attention mask image, the reverse attention mask image and the first channel image through the feature operation module to obtain the intermediate fusion image.

7. The method of claim 6, wherein the feature operation module comprises a first multiplication module, a second multiplication module, and an addition module;

correspondingly, the performing, by the feature operation module, feature operation processing on the generated image, the attention mask image, the reverse attention mask image, and the first channel image to obtain the intermediate fusion image includes:

carrying out element multiplication processing on the generated image and the attention mask image by using the first multiplication module to obtain a first fusion image;

performing element multiplication processing on the reverse attention mask image and the first channel image by using the second multiplication module to obtain a second fusion image;

and performing element addition processing on the first fusion image and the second fusion image by using the addition module to obtain the intermediate fusion image.

8. The method of claim 6,

the first characteristic network is a high-resolution HR-Net network, the splicing network is a merged Concat network, and the characteristic sub-network is a U-Net network;

the convolution kernel of the first convolution layer is 1 multiplied by 1, and the number of channels is 3; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is 1.

9. The method of claim 1, further comprising:

acquiring a training sample set, wherein the training sample set comprises at least one group of sample pictures, and each group of sample pictures comprises a sample background image, a target image to be fused of a sample and a sample light source position of the sample background image;

and training a preset network model by using the training sample set to obtain the image fusion model.

10. The method of claim 9, further comprising:

obtaining a sample mask image according to the area outside the target object in the sample target image to be fused;

obtaining a sample initial fusion image according to a target object in the sample target image to be fused and the sample background image;

correspondingly, the training a preset network model by using the training sample set to obtain the image fusion model includes:

and training the preset network model by using the sample mask image and the sample initial fusion image in the target format to obtain the image fusion model.

11. The method according to claim 10, wherein the training the predetermined network model by using the sample mask image and the sample initial fusion image in the target format to obtain the image fusion model comprises:

performing image fusion processing on the sample mask image and the sample initial fusion image in the target format through the preset network model to obtain a sample fusion image;

and in the process of training the preset network model by using the sample mask image and the sample initial fusion image in the target format, when the loss function reaches a preset loss value, determining the trained preset network model as the image fusion model.

12. The method of claim 11, further comprising:

and under the condition that the loss function does not reach a preset loss value, continuing to execute the step of training the preset network model by using the sample mask image and the sample initial fusion image in the target format until the loss function reaches the preset loss value.

13. The method of claim 11, wherein the predetermined network model comprises a light source location network;

correspondingly, the determining a loss function according to the sample light source position of the sample background image and the sample fusion image comprises:

determining the light source position of the sample fusion image through the light source position network to obtain the fusion light source position of the sample fusion image;

determining the loss function based on the fusion light source position and the sample light source position.

14. An image fusion apparatus comprising an acquisition unit, a separation unit, a fusion unit, and a merging unit, wherein,

15. An electronic device, comprising a memory and a processor, wherein,

the memory for storing a computer program operable on the processor;

the processor, when running the computer program, is configured to perform the image fusion method according to any one of claims 1 to 13.

16. A computer storage medium, characterized in that the computer storage medium stores a computer program which, when executed by at least one processor, implements the image fusion method according to any one of claims 1 to 13.