CN113538456B

CN113538456B - Image soft segmentation and background replacement system based on GAN network

Info

Publication number: CN113538456B
Application number: CN202110692455.4A
Authority: CN
Inventors: 张冠华; 陈烁; 蒋林华; 曾新华; 庞成鑫; 宋梁
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2022-03-18
Anticipated expiration: 2041-06-22
Also published as: CN113538456A

Abstract

The invention discloses an image soft segmentation and background replacement system based on a GAN network. The system comprises two parts of image soft segmentation and background replacement. The image soft segmentation part is used for predicting the foreground and alpha values of an original image, and comprises five modules: the system comprises an input module, a full-text combination module, a residual error network module, a pyramid scene analysis module and a lightweight interactive branch module; the background replacement part is used for replacing the background and generating a high-resolution background replacement graph which comprises a generator model and a discriminator model. The invention has the beneficial effects that: the method can reduce the heavy task brought by the auxiliary image making in the image soft segmentation process, and can replace the background by combining the image generation on the premise of obtaining the high-precision segmented image.

Description

Image soft segmentation and background replacement system based on GAN network

Technical Field

The invention relates to an image soft segmentation and background replacement system based on a GAN network, and relates to the technical fields of deep learning, computer vision, supervised and unsupervised learning and the like.

Background

The large flood flow of the data information impacts the deep learning field, the processing capacity of the computer on the image is greatly improved, and a high-quality result is obtained. At present, more and more ways to acquire pictures are available, mobile phone shooting is the most common mode, and although a large number of pictures are available, each picture is unique, and it is difficult to combine objects in the pictures, i.e. to replace the background. The separation of the front and background has been a classical problem. The application of combining image soft segmentation and image generation is rare, so that the realization of image soft segmentation and background replacement based on the GAN network becomes a very meaningful research subject.

The prior art mainly comprises the following steps: firstly, image segmentation and image synthesis are carried out by various software tools, and hard segmentation is carried out by a fixed function provided by the software tools; solving unknown regions for image processing based on statistical information of background colors before sampling/propagation establishment; thirdly, manually marking the ternary diagram trimap, and performing model training and alpha prediction by using marking information in combination with deep learning.

The first technical route has high human participation, and the manual processing by using software has some problems, namely, the possibility of manual error marking is high when the background color is similar to the color of the target object; secondly, the separation operation of the hair boundaries of people, animals and the like is difficult, and the effect is poor; thirdly, the efficiency of manually processing mass data is too low. Therefore, the artificial software image processing is only suitable for the segmentation and synthesis in a few scenes.

The second technique is based on a sampling/propagation numerical statistical mode, requires trimap as auxiliary input, establishes color statistical information of known foreground and background through sampling based on a sampling method, and then solves alpha matte in an 'unknown' region. The purpose of the propagation-based method is to propagate alpha matte from foreground and background regions to unknown regions to solve the picture equations. The cost of the two modes is that trimap is made, and the results of the sampling and propagation modes are unpredictable and have common quality.

And the third method is a deep learning-based technology, soft segmentation is performed by combining trimap and a deep neural network, and synthetic model training is performed on various backgrounds under a data set containing ground truth masks. The method greatly improves the precision of soft segmentation and synthesis, but the cost is still trimap, and the model is too dependent on manual labeling, so that the robustness is not strong.

In summary, the prior art has the following disadvantages: firstly, the precision can not meet the requirement of ultrahigh resolution; secondly, the artificial participation degree is high, and the cost for manufacturing the trimap auxiliary graph is high; the model has high dependence and poor robustness; and fourthly, most of the focus points are applied to the quality and cost of soft segmentation in a way of combining image soft segmentation and synthesis.

Disclosure of Invention

Aiming at the problems that a high-quality image processing system is needed in the current image processing and the image background replacement in the prior art is time-consuming, high in labor labeling cost, low in precision and the like, the invention aims to provide a GAN network-based image soft segmentation and background replacement system, which can reduce the heavy task brought by auxiliary image making in the image soft segmentation process and can realize one-key high-resolution image background replacement by combining segmentation and generation on the premise of obtaining a high-resolution segmented image.

An image soft segmentation and background replacement system based on a GAN network is characterized by comprising an image soft segmentation part and a background replacement part. An image soft segmentation part for predicting the foreground of the image and the alpha value; and a background replacement section for generating a high-precision synthesized image.

One) image soft segmentation

The image soft segmentation part comprises five modules: the system comprises an input module, a full-text combination module, a residual error network module, a pyramid scene analysis module and a lightweight interactive branch module; wherein:

the input module is used for obtaining an original image I, a background image B and a target soft segmentation image S through data preprocessing; the target soft segmentation image S is obtained by corroding, expanding and Gaussian blurring the subject object extracted from the original image I;

the full-text combination module is used for firstly respectively coding an original image I, a background image B and a target soft segmentation image S into 512 x 256 feature maps, then respectively combining the background image B and the target soft segmentation image S by taking the original image I as a base to form two 512-channel feature maps, respectively extracting 64-channel feature maps through convolution, Batchnorm and ReLU, finally combining the base and the two 64-channel feature maps to form 384 channels, and extracting 256-channel feature maps through convolution, Batchnorm and ReLU to serve as the input of a next residual error network module;

the residual error network module comprises a main residual error module and two light-weight branch residual error modules connected behind the main residual error module; the main residual error module adopts a structure of a residual error network ResNet-101, and is characterized in that the last two layers of the ResNet-101 are replaced by full convolution layers with atrous contribution, and the output of the main residual error module belongs to shared residual error content; the two light-weight branch residual modules are respectively used for foreground prediction and alpha prediction; outputting by a residual error network module to obtain a deep characteristic map;

the pyramid scene analysis module PSP is used for solving the problems of internal data structure loss and lack of space consistency caused by pooling and convolution; after obtaining a deep feature map in a residual error network, using pyramids with four sizes, wherein kernel used for pooling is respectively 1 × 1, 2 × 2, 3 × 3 and 6 × 6, after pooling, performing convolution dimensionality reduction and bilinear difference upsampling on a group of 1 × 1, reducing the size of the feature map output by the residual error network, and then cascading the obtained feature maps, including feature maps before pooling, to complete multi-scale feature fusion; finally, the foreground prediction branch residual error module uses the ReLU to obtain a foreground prediction characteristic diagram, and the Alpha prediction branch uses Tanh to obtain an Alpha prediction diagram;

the light-weight interactive branch module is attached in front of the PSP module and used for receiving possible additional guidance information and supporting generalization on extreme special cases; allowing the user to operate on the original image, clicking inside the target object to generate an internal guide, and clicking on a positive or negative diagonal of the target object to generate an external guide; two-dimensional Gaussian functions are respectively placed near the inner point and the outer point to form two inner and outer guide heat maps, the inner and outer guide heat maps are further encoded into a feature map and combined to an output feature map of a residual error network, and a user can also select whether to execute interaction.

Two) background replacement

The background replacement part comprises a generator network and a discriminator network which jointly form an unsupervised GAN frame, the generator model and the discriminator model are finely adjusted based on the unsupervised GAN frame, the distribution of network learning real data is continuously optimized and generated, the resolution capability of the discrimination network is continuously improved, Nash equilibrium is finally reached, and the system can obtain a high-resolution background replacement image after training is finished; wherein:

the generator network combines the foreground to a new background to synthesize and generate a picture based on the foreground picture and alpha prediction obtained by the image soft segmentation part; the generator network comprises a guiding model and a guided model; the training set of the guiding model is a synthetic data set comprising several foregrounds F_′Labeled alpha matte, background B from coco dataset_′Performing generator model training on background B_′Introduction of rectification and Gaussian blur to prevent overfitting to avoid excessive bias of the system and to learn I_′And B_′Thereby obtaining G with supervised learning_teacherAs a guidance model; with G_teacherServing as 'pseudo ground-truth', performing model training in a real scene under the condition of comparing with the 'pseudo ground-truth', and performing self-supervision training on a guided model by adopting a real data set to obtain G_studentAs a guided model; the guiding model and the guided model share the same loss function, the first loss being given less weight; using an ADAM optimizer to avoid the network from falling into a local minimum, and finding a better minimum for real data nearby;

the discriminator network is used for training the label-free data of the real scene by using the countermeasure training based on the multi-scale discriminator and discriminating whether the foreground result is a real sample or a synthesized sample after being pasted on a new synthesized image formed on the background; the multi-scale discriminator discriminates on three different scales which are respectively as follows: original, 1/2 for original, 1/4 for original; each scale of the multi-scale discriminator uses 3 linear discriminators, each linear discriminator comprises a full convolution network which consists of a plurality of groups of convolutions, BatchNorm and Leaky ReLU;

compared with the prior art, the technical scheme of the invention has the advantages that:

the method overcomes the defect that the prior art depends on the ternary diagram, and reduces the human participation and the cost of manual annotation.

Secondly, a global combination module is provided, all different clues can be effectively combined, and the soft segmentation effect of the object is obviously improved.

Third, combining and using the atrous convergence and PSP scene analysis to obtain larger receptive field and global information, and fusing the characteristics of different scales to obtain clearer image soft segmentation. The easier it is to achieve global consistency and discriminate local details using a multi-scale discriminator.

And fourthly, providing a lightweight interactive branch, performing artificial interference guidance on the model, and improving the generalization of the system.

And fifthly, providing the GAN network to combine image soft segmentation and image generation to perform background replacement. And carrying out unsupervised game between the generator and the discriminator, optimizing the model, and finally generating a background replacement picture which has little difference with the real picture.

Drawings

FIG. 1. an atrous restriction network.

FIG. 2. pyramid scene parsing Module (PSP).

FIG. 3 is a network of image soft segmentation.

Fig. 4.GAN network flow diagram.

Detailed Description

The system comprises two parts, alpha prediction and background replacement of the image. The first partial image soft segmentation comprises five modules: the system comprises an input module, a full-text combination module, a residual error network module, a pyramid scene analysis module and a lightweight interactive branch module. The second partial image composition includes a generator model and a discriminator model.

Alpha prediction of an image. One picture contains 7 elements, foreground F (R, G, B), background B (R, G, B) and foreground mask alpha matte (α), so the image equation can be expressed as:

I_i＝α_iF_i+(1-α_i)B_i

to obtain high quality soft segmentation, the system needs to predict the accurate foreground and alpha matte. The first part introduces the following modules:

1. and an input module, namely preprocessing of data. The manual labeling of trimaps is expensive, and to overcome this drawback, a background map without target objects is added instead. The input requirements of the system are an image under static conditions, plus an image of the background only, the imaging process is simple and can support the taking of any camera set to lock exposure and focus, e.g. a smartphone camera. Assuming that the camera motion is small, a homography matrix is applied to align the background with the given input image. And finally, obtaining initial soft segmentation of the subject object through corrosion, expansion and Gaussian blur.

In conclusion, the data preprocessing obtains three parts of an original image (I), a background image (B) and a target soft segmentation image (S).

2. The modules are combined in full text. The system uses a new full-text combination network to effectively combine all clue characteristics. For example, when the color of the target object is similar to the background, the network should focus more on segmentation cues for the region rather than pixel differences, which avoids internal holes and blurring artifacts that may occur in soft segmentation. The specific implementation is as follows:

the I, B and S images are respectively coded into feature maps of 512 x 256. And combining B and S by taking an original image as a substrate to form two feature maps of 512 channels, extracting 64-channel feature maps respectively by convolution and BatchNorm and ReLU, connecting the substrate and the two 64 channels in parallel to form 384 channels, and reducing the substrate and the two 64 channels into 256-channel feature maps by convolution and BatchNorm and ReLU to be used as the input of the next residual error network module. Full-text composition systems facilitate generalization across different datasets and domains.

3. And a residual error network module. By taking the experience of ResNet, the system adopts a residual error network in the main module. The backbone network selects the architecture of ResNet-101, with the full connectivity layer and the max pooling layer removed of course, and introduces an aperture constraint in the last two phases to ensure that pixel level prediction is performed and an acceptable output resolution. Sparse prediction of the atrous convergence can obtain a larger receptive field, enable example soft segmentation boundaries to be clearer, and enable interaction with a following aggregation module.

The output of the main residual error network belongs to shared residual error content, and two light-weight branch residual error networks are connected behind the main residual error network and are respectively used for foreground prediction and alpha prediction. And the foreground prediction branch continues to pass through the residual block, is aggregated by a pyramid scene analysis module, and is connected with a group of convolutions, bilinear interpolation upsampling, BatchNorm and ReLU to obtain the final foreground heatmap. The Alpha prediction branch passes through a residual block, is connected with a pyramid scene analysis module, and is connected with a group of convolution, bilinear interpolation upsampling, BatchNorm and Tanh to obtain the final Alpha prediction, and the reason for using Tanh is that the Alpha matte value of each pixel needs to be between 0 and 1.

4. The Pyramid Scene Parsing module Pyramid Scene Parsing (PSP). The system selects the currently popular PSP model to handle the relationships between the scenes and aggregate global context information. Although the full-text combination module can fuse the characterization information at a shallow layer to a certain extent, the problems of internal data structure loss and spatial consistency caused by pooling and convolution need to be further improved by utilizing the PSP. The specific implementation is as follows:

and after the branch residual error network extracts the deep feature map, creating a spatial pool pyramid to fuse feature maps with different scales. The kernel used for pooling is 1 × 1, 2 × 2, 3 × 3, 6 × 6, respectively, and pooling modules of different scales are concerned with activating different regions of the map. After pooling, the data is subjected to convolution dimensionality reduction and bilinear interpolation upsampling by a group of 1 multiplied by 1, and then the data is restored to the output size of the branch network. And (4) performing cascade (cascade) on the obtained feature map before pooling to complete multi-scale feature fusion, and finally connecting a set of convolution. The PSP has strong context inference capability, and the feature extraction from multiple levels, including pixel level, super-pixel level and global, and the consideration of various ranges is integrated to have great help for soft segmentation.

5. A lightweight interactive branching module. To support generalization over extreme cases, the system attaches a lightweight branch before the PSP module for receiving possible additional guiding information. The user allows operations to be performed on the original image, clicking inside the target object to generate internal guidance, and clicking on the positive or negative diagonal of the target object to generate external guidance. Two-dimensional gaussians are respectively placed near the inner point and the outer point, two thermodynamic diagrams are made, and the system encodes the thermodynamic diagrams into characteristic diagrams and combines the characteristic diagrams into two branches of a residual error network. The interaction process is simple, but the adaptability of the model to extreme cases can be improved, and a user can also select whether to execute the interaction.

(II) background replacement (image synthesis). In order to synthesize a background replacement picture that is comparable to a real picture, the system uses an unsupervised GAN network for model training.

1. The Generator network Generator. The modules 1 to 5 can be collectively regarded as the work done by the generator model. And (4) obtaining a foreground picture and alpha prediction by the soft segmentation of the image completed in the first step, and pasting the foreground on a new background to synthesize the picture.

The generator uses a "guide and guided" model. The training set of the guiding model is a synthetic data set comprising several foregrounds (F)_′) And annotated alpha matte (α)_′) Against background (B) from coco dataset_′) And carrying out supervised learning. To avoid system over-dependence on learning I_′And B_′Difference of (2), to background B_′Introducing gamma correction and Gaussian blur to prevent overfitting, thereby obtaining G_teacherAs a guidance model. The loss function is as follows:

with G_teacherActing as a "pseudo-ground-truth" as a supervisor. Under the guidance of 'pseudo ground-truth', the guided model G_studentAnd carrying out self-supervision training by adopting a real data set. The loss function is as follows:

loss₂＝D_disc(αF+(1-α)B-1)²

the generator loss function of the training network is the minimum loss₁And loss₂But the first penalty is given less weight. The initial λ is set to 0.02, and the zoom out 1/2 is performed every five iterations. Network selection ADAM optimizer to avoid network trappingOf the local minima, and a better minimum for the real data is found nearby. The generator losses are as follows:

2. the Discriminator network. The discriminator needs to discriminate whether it is a true sample or a synthetic sample and perform parameter fine-tuning by back propagation. In order to improve the background replacement effect in a real scene, the system uses a multi-scale discriminator based on pix2pix hd. Each scale of the discriminator comprises 3 linear discriminators, each linear discriminator is a full convolution network and consists of a plurality of groups of convolutions, BatchNorm and Leaky ReLU. The 3 dimensions of the discriminator are respectively: original, 1/2 for original, 1/4 for original. The method has the advantages of being similar to PSP, the coarser scale receptive field is larger, the global consistency is easier to judge, and the finer scale receptive field is smaller, the detailed information such as color, texture and the like is easier to judge.

3. And finally, fine tuning the generator model and the discriminator model by using an unsupervised GAN frame, continuously optimizing the distribution of the real learning data of the generated network, continuously improving the resolution capability of the discrimination network, and finally achieving Nash equilibrium. After training is finished, the system can obtain a high-resolution background replacement picture.

The invention comprises 7 modules in two parts. The global combination module brings combination of different representation information, and improves the segmentation quality; the residual error network is combined with the aperture constraint and the PSP to execute pixel-level prediction, and multi-scale features can be fused; the lightweight interactive module can guide model training and improve adaptability; a system combining image soft segmentation and image generation based on a GAN network is provided and applied to background replacement. The above are all the key points and points to be protected.

All technical solutions formed by equivalent transformation or equivalent replacement fall within the protection scope of the present invention, and are not described in detail herein.

Claims

1. An image soft segmentation and background replacement system based on a GAN network is characterized by comprising an image soft segmentation part and a background replacement part; the image soft segmentation part is used for predicting the foreground and the alpha value of the image and executing soft segmentation operation; a background replacement section for generating a high-resolution composite image; wherein:

one) image soft segmentation

the input module inputs an original image I, a background image B and a target soft segmentation image S; obtaining an initial soft segmentation of the subject object by the target soft segmentation image S through corrosion, expansion and Gaussian blur;

a residual network module comprising a main residual module and two subsequent lightweight branch residual modules; the backbone network selects the architecture of ResNet-101, certainly deleting the fully connected layer and the maximum pooling layer, and introducing the aperture constraint in the last two stages to ensure that pixel-level prediction is performed and an acceptable output resolution is achieved; the main residual error module belongs to shared residual error content and aims to obtain a deeper feature map, and the two light-weight branch residual error networks behind the main residual error module are respectively used for foreground prediction and alpha prediction, so that two feature maps are finally obtained, and multi-scale feature fusion is carried out subsequently through the PSP (pyramid scene analysis) module;

the pyramid scene analysis module PSP obtains a deep feature map through a trunk residual error network and a branch residual error network with atrous convergence, then pyramids with four sizes are used, kernel used for pooling is respectively 1 × 1, 2 × 2, 3 × 3 and 6 × 6, after pooling, dimension reduction and bilinear difference value upsampling are carried out through a group of 1 × 1 convolution, feature map size output by the residual error network is reduced, then the obtained feature maps, including feature maps before pooling, are cascaded, and multi-scale feature fusion is completed; finally, the foreground prediction branch residual error module uses the ReLU to obtain a foreground prediction characteristic diagram, and the Alpha prediction branch uses Tanh to obtain an Alpha prediction diagram;

the light-weight interactive branch module is attached in front of the PSP (pyramid scene analysis) module and used for receiving possible additional guidance information and supporting generalization on extreme special cases; allowing the user to operate on the original image, clicking inside the target object to generate an internal guide, and clicking on a positive or negative diagonal of the target object to generate an external guide; respectively placing a two-dimensional Gaussian function near the inner point and the outer point to prepare two inner and outer guide heat maps, coding the inner and outer guide heat maps into a characteristic map and combining the characteristic map with an output characteristic map of a residual error network, and enabling a user to select whether to execute interaction or not;

two) background replacement

The background replacement part comprises a generator model and a discriminator model which jointly form an unsupervised GAN frame, the generator model and the discriminator model are finely adjusted by adopting the unsupervised GAN frame, the distribution of network learning real data is continuously optimized and generated, the resolution capability of the discrimination network is continuously improved, Nash equilibrium is finally reached, and the system can obtain a high-quality background replacement image after training is finished; wherein:

the generator network combines the foreground to a new background to synthesize and generate a picture based on the foreground picture and alpha prediction obtained by the image soft segmentation part; the generator network uses a "guide and guided" model; the training set of the guiding model is a synthetic data set comprising several foregrounds F_′Alpha matte of the tag, i.e. alpha_′Background B from coco dataset_′Performing generator model training on background B_′Introduction of rectification and Gaussian blur to prevent overfitting to avoid excessive bias of the system and to learn I_′And B_′Thereby obtaining G with supervised learning_teacherAs a guidance model; with G_teacherServing as 'pseudo ground-truth', performing model training in a real scene under the condition of comparing with the 'pseudo ground-truth', and performing self-supervision training on a guided model by adopting a real data set to obtain G_studentAs a guided model; the guiding model and the guided model share the same loss function, the first loss being given less weight; using an ADAM optimizer to avoid the network from falling into a local minimum, and finding a better minimum for real data nearby;

the discriminator network is used for training the label-free data of the real scene by using the countermeasure training based on the multi-scale discriminator and discriminating whether the foreground result is a real sample or a synthesized sample after being pasted on a new synthesized image formed on the background; the multi-scale discriminator discriminates on three different scales which are respectively as follows: original, 1/2 for original, 1/4 for original; each scale of the multi-scale discriminator uses 3 linear discriminators, each linear discriminator comprises a full convolution network which consists of a plurality of groups of convolutions, BatchNorm and Leaky ReLU.

2. The image soft segmentation and background replacement system of claim 1, wherein, in the generator,

guidance model G_teacherThe loss function of (a) is as follows:

guided model G_studentThe loss function of (a) is as follows:

loss₂＝D_disc(αF+(1-α)B-1)²

the generator loss function of the training network is the minimum loss₁And loss₂Summing; the initial λ is set to 0.02, and every five iterations of minification 1/2, the network selects an ADAM optimizer that avoids the network from falling into local minima, while nearbyFinding a better minimum value for the real data; the generator losses are as follows: