CN115797171A

CN115797171A - Method and device for generating composite image, electronic device and storage medium

Info

Publication number: CN115797171A
Application number: CN202211430211.XA
Authority: CN
Inventors: 张子恺; 翟佳; 谢晓丹; 郭单; 王梓权
Original assignee: Beijing Institute of Environmental Features
Current assignee: Beijing Institute of Environmental Features
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-14

Abstract

The embodiment of the invention relates to the technical field of image processing, in particular to a method and a device for generating a composite image, electronic equipment and a storage medium, wherein the method comprises the following steps: generating a first image consisting of a foreground target and a background image and a foreground target mask of the foreground target in the first image based on an Alpha transparent mask marking method; extracting the background feature of the first image and the target feature of the foreground target mask by using a coding and decoding network, and fusing the background feature and the target feature to generate a harmonious second image; and generating a shadow for the foreground target in the second image by utilizing a shadow generation countermeasure network to generate a target image. According to the scheme, the quality of the synthesized image can be improved.

Description

Method and device for generating composite image, electronic device and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for generating a composite image, an electronic device, and a storage medium.

Background

Training and learning of artificial intelligence models rely on a large amount of data and rich features, and in practical applications, relevant data of certain specific types of targets are difficult to obtain. Therefore, a need exists to build a rich and diverse sample library and feature library based on small sample images, thereby providing data for deep learning and training of models.

In the related art, in order to obtain a rich sample library, a foreground object region in an original image is usually scratched and pasted to another background image to synthesize a new image. However, the target region and the background region in the synthesized new image often have the problems of dissonance and incoordination of visual features, and the quality of the synthesized image is poor.

Therefore, there is a need for a method, an apparatus, an electronic device and a storage medium for generating a composite image to solve the above problems.

Disclosure of Invention

Based on the problem of poor quality of images synthesized by the existing method, embodiments of the present invention provide a method and an apparatus for generating a synthesized image, an electronic device, and a storage medium, which can improve the quality of the synthesized image.

In a first aspect, an embodiment of the present invention provides a method for generating a composite image, including:

generating a first image consisting of a foreground target and a background image and a foreground target mask of the foreground target in the first image based on an Alpha transparent mask marking method;

extracting the background feature of the first image and the target feature of the foreground target mask by using a coding and decoding network, and fusing the background feature and the target feature to generate a harmonious second image;

and generating a shadow for the foreground target in the second image by utilizing a shadow generation countermeasure network to generate a target image.

In one possible design, the codec network includes a backbone network and a local network; the main network is used for extracting background features of the first image, and the local network is used for extracting target features of the foreground target mask.

In one possible design, the extracting, by using a codec network, a background feature of the first image and a target feature of the foreground target mask, and fusing the background feature and the target feature to generate a harmonious second image includes:

extracting background features of the first image by using the backbone network;

extracting target features of the foreground target mask by using the local network;

inputting the background feature and the foreground target feature into a coder-decoder, and applying the background feature to the target feature by using the coder-decoder to realize feature fusion;

and decoding the fused features into a second image by utilizing an output layer of the coding and decoding network.

In one possible design, the second image includes a background object, and the background object has a shadow;

the generating a shadow for the foreground target in the second image by using the shadow generation countermeasure network to obtain a target image includes:

utilizing the shadow generation countermeasure network to extract the characteristics of the background target, the shadow characteristics of the background target and the characteristics of the foreground target mask;

acquiring a mapping relation between the background target and the shadow of the background target based on the characteristics of the background target and the shadow characteristics of the background target;

and generating a shadow for the foreground target based on the mapping relation between the background target and the shadow of the background target and the characteristics of the foreground target mask to obtain a target image.

In a second aspect, an embodiment of the present invention further provides an apparatus for generating a composite image, including:

the system comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for generating a first image consisting of a foreground target and a background image and a foreground target mask of the foreground target in the first image based on an Alpha transparent mask marking method;

a second generation module, configured to extract, by using an encoding and decoding network, a background feature of the first image and a target feature of the foreground target mask, and fuse the background feature and the target feature to generate a harmonious second image;

and the third generation module is used for generating a shadow for the foreground target in the second image by utilizing the shadow generation countermeasure network to generate a target image.

In one possible design, the second generating module includes:

inputting the background features and the foreground target features into a coder and a decoder, and applying the background features into the target features by using the coder and the decoder to realize feature fusion;

In one possible design, the second image includes a background object, and the background object has a shadow; the third generating module includes:

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the processor implements the method described in any embodiment of this specification.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method described in any embodiment of the present specification.

The embodiment of the invention provides a method and a device for generating a synthetic image, electronic equipment and a storage medium, wherein the background feature of a first image and the target feature of a foreground target mask are extracted, and the extracted background feature and the target feature are fused, so that the harmony processing of a foreground target is realized, and the degree of coordination of pixel values at the fusion edge of the foreground target and the background is improved. In addition, by generating shadows for foreground objects, the realism of the composite image can be further improved. According to the scheme, the quality of the synthesized image can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a method for generating a composite image according to an embodiment of the present invention;

FIG. 2 is a first image generated by the method of the present invention, according to an embodiment of the present invention;

FIG. 3 is a foreground object mask of the object image in the first image shown in FIG. 2;

FIG. 4 is a target image obtained by performing the harmonization process and the shadow generation on the image shown in FIG. 2 according to the present invention;

FIG. 5 is a diagram of a hardware architecture of an electronic device according to an embodiment of the present invention;

fig. 6 is a block diagram of a synthetic image generating apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.

Specific implementations of the above concepts are described below.

Referring to fig. 1, an embodiment of the present invention provides a method for generating a composite image, where the method includes:

step 100, generating a first image composed of a foreground target and a background image and a foreground target mask of the foreground target in the first image based on an Alpha transparent mask marking method;

102, extracting background features of the first image and target features of the foreground target mask by using a coding and decoding network, and fusing the background features and the target features to generate a harmonious second image;

and 104, generating a shadow for the foreground target in the second image by using the shadow generation countermeasure network to generate a target image.

In the embodiment of the invention, the background feature of the first image and the target feature of the foreground target mask are extracted, and the extracted background feature and the target feature are fused, so that the harmony processing of the foreground target is realized, and the harmony degree of the pixel values at the fusion edge of the foreground target and the background is improved. In addition, by generating shadows for foreground objects, the realism of the composite image can be further improved. According to the scheme, the quality of the synthesized image can be improved.

The manner in which the various steps shown in fig. 1 are performed is described below.

First, in step 100, a first image composed of a foreground object and a background image and a foreground object mask of the foreground object in the first image are generated based on an Alpha transparent mask marking method.

In the step, firstly, a foreground target and a background image are automatically synthesized by using an Alpha transparent mask. Specifically, firstly, cutting a blank area at the edge of a mask image according to the minimum circumscribed rectangle of a foreground target, and synthesizing the cut mask image according to information such as a coordinate position, a zoom magnification, a rotation angle and the like to obtain a first image containing a background image and the foreground target; and according to the synthesized first image, obtaining a binary mask image of a target area in the first image, namely a foreground target mask, and providing input for a harmony algorithm. In addition, in order to ensure the image quality, when the target region is labeled and cut, a polygonal labeling tool is avoided from being used for labeling the foreground region, so that the interference possibly brought by the original environment in the foreground image, such as feature information introduced from the outside, such as object shadows and the like, is reduced, and after the feature information is introduced into a new background, the physical imaging rule can be violated, and the reality of the synthesized image is reduced.

It should be noted that, for the purpose of subsequent training and testing of the codec network, this step may generate a plurality of different first images to generate a data set, and divide the data set into a training set and a verification set. In some embodiments, the ratio of the number of training sets to validation sets may take 8.

Then, in step 102, a coding and decoding network is used to extract the background feature of the first image and the target feature of the foreground target mask, and the background feature and the target feature are fused to generate a harmonious second image.

In some embodiments, a codec network includes a backbone network and a local network; the main network is used for extracting background features of the first image, and the local network is used for extracting target features of the foreground target mask.

In the related art, the codec network usually only has a backbone network, and then all features in the first image are extracted by using the backbone network, which is not favorable for the fusion of the background features and the target features.

In this embodiment, the local network is also included in addition to the backbone network. The method comprises the steps of constructing a main network by utilizing a semantic segmentation architecture with input and output consistent resolution, pre-training the main network by utilizing an ImageNet data set, and extracting background features in a first image by using the trained main network. The input and output resolutions of the backbone network are consistent, and a high resolution output with a large reception field can be generated.

Generally, the trunk network takes the RGB image as an input, and therefore, the features of the foreground object mask cannot be obtained, that is, the trunk network cannot accurately calculate the specific features of the foreground object and the background image, which may negatively affect the quality of the generated image. Therefore, the embodiment designs an additional convolution layer, i.e. a local network, for the mask image of the foreground object on the basis of the backbone network. By setting the local convolution network, the coding and decoding network can accept N-channel input instead of only RGB images.

In some embodiments, the local network takes the foreground object mask as input and produces 64 output channels whose outputs are summed with the outputs of the convolutional layers of the backbone network. The core feature of this approach is to allow different learning rates to be set for the weights that process the foreground object mask.

In some embodiments, step 102 comprises: extracting background features of the first image by using a backbone network;

extracting target characteristics of the foreground target mask by using a local network;

inputting the background characteristic and the foreground target characteristic into a coder and a decoder, and applying the background characteristic to the target characteristic by using the coder and the decoder so as to realize the fusion of the characteristics;

In this embodiment, the use of a combined structure of an encoder and decoder based on a foreground object region prediction model of the encoder and decoder structure ensures a mapping from high resolution input to high resolution output. In addition, a cross connection is arranged between the encoder and the decoder, and the characteristics of the encoder and the characteristics of the decoder can be cascaded through the cross connection operation, so that the coordination of the image is facilitated, and the image blurring and the loss of texture details are avoided.

A feature of image harmonization is that the background area of the output image should remain unchanged with respect to the input composite image. When the codec network takes the foreground object mask as input, it easily learns to duplicate the background image. Therefore, in the process of training the codec network, the pixel-level error of the region will be close to zero, which means that the loss amplitude of training is different for training samples with different foreground target sizes, resulting in poor image training effect for small objects.

In some embodiments, the second image includes a background object, and the background object has a shadow; utilizing a shadow generation countermeasure network to generate shadows for foreground objects in the second image to obtain a target image, and the method comprises the following steps:

utilizing a shadow generation countermeasure network to extract the characteristics of a background target, the shadow characteristics of the background target and the characteristics of a foreground target mask;

and generating a shadow for the foreground target based on the mapping relation between the shadow of the background target and the characteristics of the mask of the foreground target to obtain a target image.

The embodiment introduces the shadow area by referring to the real physical imaging of the background target on the basis of synthesizing the foreground target, finally obtains the physical optical imaging close to the real condition, and improves the characteristic abundance degree of the synthesized image.

In this embodiment, the shadow generation countermeasure network includes a shadow generator and a shadow discriminator.

Wherein the attention block in the shadow generator is used to generate a real shadow and a corresponding attention map, the attention map being a matrix comprising elements from 0 to 1, representing the attention of different regions in the real world environment. The shadow generator is a U-shaped network consisting of 5 down-sampling layers and up-sampling layers, firstly generates a coarse afterimage, then finely adjusts the coarse afterimage by a thinning module, and finally outputs the improved afterimage and an input image.

And the shadow discriminator is used for discriminating the true degree of the virtual shadow so as to assist the training of the shadow generator. The shadow discriminator contains 4 consecutive convolutions, including effective padding, instance normalization and the Leaky ReLU operation. And then carrying out convolution to generate the final feature map, and activating by a sigmoid function. The final output of the shadow discriminator is the global average pool of the last feature map that was active. In the intelligent generation structure of the foreground shadow of the synthetic optical image, a shadow discriminator adopts a spliced virtual object shadow of the image, a virtual object mask and the image taking the virtual object shadow as input.

In order to illustrate the effect of the method of the present invention, the inventor tested the method, as shown in fig. 2, which is a first image synthesized by the method, where the image includes a foreground object and a background image; FIG. 3 is a foreground object mask of the foreground image; fig. 4 is a target image obtained after the harmony processing and the shadow generation are performed. As can be seen from 4, the generated target image is close to the real image, and the image quality is high.

As shown in fig. 5 and 6, an embodiment of the present invention provides a device for generating a composite image. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. From a hardware aspect, as shown in fig. 5, for a hardware architecture diagram of an electronic device in which a device for generating a composite image according to an embodiment of the present invention is located, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the electronic device in which the device is located may also include other hardware, such as a forwarding chip responsible for processing a message. As shown in fig. 6, a logical device is formed by reading a corresponding computer program in a non-volatile memory into a memory by a CPU of an electronic device in which the device is located and running the computer program. The present embodiment provides a generation apparatus of a composite image, including:

a first generating module 600, configured to generate a first image composed of a foreground object and a background image and a foreground object mask of the foreground object in the first image based on an Alpha transparent mask marking method;

a second generating module 602, configured to extract a background feature of the first image and a target feature of the foreground target mask by using a coding and decoding network, and fuse the background feature and the target feature to generate a harmonious second image;

a third generating module 604, configured to generate a shadow for the foreground object in the second image by using the shadow generation network, and generate an object image.

In an embodiment of the present invention, the first generating module 600 may be configured to perform step 100 in the above-described method embodiment, the second generating module 602 may be configured to perform step 102 in the above-described method embodiment, and the third generating module 604 may be configured to perform step 104 in the above-described method embodiment.

In some embodiments, a codec network includes a backbone network and a local network; the main network is used for extracting the background feature of the first image, and the local network is used for extracting the target feature of the foreground target mask.

In some embodiments, the second generating module 602 is configured to perform:

extracting background features of the first image by using a backbone network;

In some embodiments, the second image includes a background object, and the background object has a shadow; the third generating module 604 is configured to perform:

It is to be understood that the illustrated configuration of the embodiment of the present invention does not specifically limit the apparatus for generating a composite image. In other embodiments of the invention, a composite image generating device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Because the content of information interaction, execution process, and the like among the modules in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.

An embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement a method for generating a composite image according to any embodiment of the present invention.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, causes the processor to execute a method for generating a composite image according to any embodiment of the present invention.

Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion module connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion module to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of generating a composite image, comprising:

2. The method of claim 1, wherein the codec network comprises a backbone network and a local network; the main network is used for extracting background features of the first image, and the local network is used for extracting target features of the foreground target mask.

3. The method of claim 2, wherein the extracting, by using a codec network, the background feature of the first image and the target feature of the foreground target mask, and fusing the background feature and the target feature to generate a harmonious second image comprises:

4. The method of claim 1, wherein the second image comprises a background object, and the background object has a shadow;

5. An apparatus for generating a composite image, comprising:

and the third generation module is used for generating a shadow for the foreground target in the second image by using the shadow generation as the anti-network to generate a target image.

6. The apparatus of claim 5, wherein the codec network comprises a backbone network and a local network; the main network is used for extracting background features of the first image, and the local network is used for extracting target features of the foreground target mask.

7. The apparatus of claim 6, wherein the second generating means is configured to perform:

8. The apparatus of claim 5, wherein the second image comprises a background object, and the background object has a shadow; the third generation module is to perform:

9. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the method according to any of claims 1-4.

10. A storage medium having stored thereon a computer program, characterized in that the computer program, when executed in a computer, causes the computer to execute the method of any of claims 1-4.