CN115457266A

CN115457266A - High-resolution real-time automatic green screen image matting method and system based on attention mechanism

Info

Publication number: CN115457266A
Application number: CN202211029515.5A
Authority: CN
Inventors: 李兆歆; 靳悦; 朱登明; 石敏; 王兆其
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-12-09

Abstract

The invention provides a high-resolution real-time automatic green screen image matting method and system based on an attention mechanism, which comprises the following steps: and the participation of manual adjustment is removed by combining a feature extraction model and an attention mechanism through a deep learning method, so that automatic processing is realized. The method is realized by adopting a lightweight encoder model and a method of firstly carrying out green curtain matting under low resolution and then restoring the original resolution through a high resolution processing module. In order to achieve a more refined image matting effect, a jump connection mode is adopted to relieve the phenomenon that low-level features are lost. And training of the network model is carried out by adopting a special green curtain image matting data set, so that the feature extraction module can also learn to remove relevant features of green overflow.

Description

High-resolution real-time automatic green screen image matting method and system based on attention mechanism

Technical Field

The invention relates to the technical field of image processing, in particular to a high-resolution real-time automatic green screen image matting method based on an attention mechanism.

Background

The green screen keying refers to taking pictures or videos by taking a green screen as a background so as to carry out later-stage keying synthesis. The green screen image matting technology can be used for making film and television plays such as Hollywood large films and the like, and is widely applied to scenes of daily life of people such as virtual studios, online live broadcast and the like. Because the green curtain image matting has more application scenes and extremely high commercial value, a plurality of scholars and enterprises at home and abroad deeply research the green curtain image matting problem.

The current green curtain image matting method can be divided into two main categories: traditional methods and deep learning based methods. The traditional green curtain image matting method comprises methods of chroma image matting, color difference image matting, brightness image matting, triangular image matting and the like, and most commercial software is realized based on the methods. Chroma matting refers to estimating the opacity of each pixel by estimating the chroma difference of a foreground object from a given key color according to the key color provided by a user. The color difference image matting method is used for green screen image matting according to color difference and is mainly suitable for scenes with background color proportion far larger than other two colors. The brightness matting method is based on the difference of image brightness values. This method is typically used to extract foreground objects that are very bright or self-illuminating, such as smoke or sparks. The triangle matting rule is to perform green screen matting by shooting the same foreground under different backgrounds. The deep learning-based method is mostly based on natural scenes for image matting, and comprises a trimap-based method, a background-based method and an automatic method. CF is an affinity-based method that considers local regions of an image to be contiguous, with neighboring pixels of similar color having similar Alpha values. Thus, the Alpha value of an unknown pixel can be estimated by sampling its known foreground and background neighbors. KNN is also an affinity-based method that considers local regions of an image to have smooth properties that can be used to propagate Alpha values from known regions to unknown regions. FBA proposes low-cost modifications to Alpha-matching networks to predict foreground and background colors. LFP learns remote context features outside the receptive field. BGMV2 (mobile) requires a background image as an additional input and is aligned with the input image. MODNet is an automatic real-time processing method for human images, however, the processing resolution ratio corresponding to real-time performance is low. AIM designs a unified semantic representation module to guide matting to produce more accurate results.

Most of commercial software capable of achieving green screen image matting generally adopts a traditional method to conduct green screen image matting, but a large number of parameters which need to be adjusted by professionals exist, and automatic processing is difficult. When the green screen image matting problem is processed, the higher the resolution is, the longer the required processing time is, and the real-time processing speed is difficult to achieve. In the deep learning-based method, both the trimap-based method and the background-based method require a user to additionally input trimap or background as a priori, and although the methods have high precision, the automatic processing is difficult to realize. In the deep learning-based method, an automatic method can realize automatic and high-resolution real-time processing, but the method is suitable for processing natural scene image matting, and when the method is used for green screen image matting, the problem that a shot object is stained with green (green overflow) due to the fact that a green screen is reflected to the shot object cannot be removed.

Disclosure of Invention

In order to solve the problems, the invention provides a high-resolution real-time automatic green screen image matting method based on an attention mechanism. And the light-weight MobileNet model and the attention module are combined to effectively extract the features, the participation of manual adjustment is removed, and the automatic processing is realized. And the real-time processing speed under high resolution is realized by adopting a lightweight characteristic encoder and a high resolution processing module. Adopt green curtain to scratch like special data set and carry out the network training for the network model learns the green relevant characteristic that overflows, thereby the effectual green phenomenon of spilling over of getting rid of.

When the inventor researches a high-resolution real-time automatic green screen image matting method, the inventor discovers that a large number of parameters are generally required to be adjusted by professionals in the traditional method in the prior art, the inventor discovers that the defect can be overcome through a deep learning method through natural image matting research, and the participation of manual adjustment can be removed by combining a feature extraction model and an attention mechanism, so that automatic processing is realized. The traditional method and the deep learning-based method in the prior art are generally low in speed when high-resolution materials are processed and difficult to achieve real-time, and the defect can be overcome by adopting a lightweight encoder model and a method of performing green screen matting under low resolution and then restoring the original resolution through a high-resolution processing module. In order to achieve a more refined image matting effect, a jump connection mode is adopted to relieve the phenomenon that low-level features are lost. The method based on deep learning in the prior art is found to be difficult to remove green overflow, and the defect can be overcome by training a network model by adopting a special green curtain image matting data set, so that a feature extraction module can also learn to remove relevant features of green overflow.

The invention specifically provides a high-resolution real-time automatic green screen image matting method based on an attention mechanism, which comprises the following steps:

step 1, constructing a neural network model comprising a feature encoder, a cavity space convolution pooling pyramid module, an attention module and a feature decoder;

step 2, the feature encoder samples the training image to obtain a low-definition image, extracts the image features of the low-definition image and generates intermediate features in the extraction process;

step 3, the void space convolution pooling pyramid module samples the image features in parallel by void convolution with different sampling rates, and the attention module extracts the features of the sampling results to obtain attention features;

step 4, the feature decoder decodes the attention feature according to the intermediate feature to obtain an intermediate result which comprises a foreground image of the low-definition image after green overflow is eliminated and a channel transparent image of the low-definition image;

step 5, taking the foreground image label and the channel transparent image label of the low-definition image as training targets, constructing a first loss based on the intermediate result, and training the neural network model;

step 6, adding a high-resolution processing module to the neural network model output after the training is finished to obtain a matting model, and restoring the intermediate result to the resolution which is the same as that of the training image by the high-resolution processing module to obtain a matting result which comprises a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;

step 7, constructing a second loss based on the matting result by taking the foreground image label and the channel transparent image label of the training image as training targets, and training the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.

According to the high-resolution real-time automatic green-curtain image matting method based on the attention mechanism, the training image is generated by selecting a foreground image with green overflow and a corresponding channel transparent image in a green-curtain data set and synthesizing the foreground image with a green-curtain background image to obtain the training image.

The attention mechanism-based high-resolution real-time automatic green screen image matting method is characterized in that the feature encoder comprises a plurality of convolution layers, and features behind each convolution layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;

the intermediate results also include an error map and implicit feature, hidden, of the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplacian penalty for the channel transparency map, L1 penalty and laplacian penalty for the foreground map, and L2 penalty for the error map.

The high-resolution real-time automatic green-curtain image matting method based on the attention mechanism is characterized in that the generation process of the training image is that a foreground image F with green overflow and a corresponding channel transparent image alpha are randomly selected from a green-curtain image matting data set containing green overflow, a background image B of a green-curtain background data set is randomly selected, and a synthetic image C is generated according to a synthetic formula after the three are unified in resolution:

C＝F×α+B(1-α)

performing down-sampling processing on the synthesized image C to obtain a low-resolution image C';

the Feature extraction process of the Feature encoder is shown in the following formula, wherein Feature _m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;

Feature _m ，Sh ortscuts＝MobileNetV2(C′)

the sampling of the cavity space convolution pooling pyramid module is as follows:

Feature _aspp ＝ASPP(Feature _m )

wherein Feature _aspp Representing the sampling result output by the void space convolution pooling pyramid module;

the attention module performs Feature extraction, feature, by the following equation _se Indicating this attention feature:

Feature _se ＝SE(Feature _aspp )

the feature decoder comprises a plurality of convolutional layers, each convolutional layer except the last layer is followed by a BN layer and a ReLU activation function, and bilinear upsampling is adopted before each convolutional layer and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSSample () represents upsampling, concat represents feature splicing, short endings _i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;

Feature _up ＝UpSample(Feature _se )

Feature _cat ＝Concat(Feature _up ，Shortcuts _i )

Feature _conv1 ＝ConvBlocki(Feature _cat )

the intermediate result also includes an error map and implicit feature, hidden, of the low-definition image, the first penalty being:

the first loss comprises a loss of a transparent map of the channel

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

And loss L to error map _err ；

Use of L1 loss to scale predicted low-clearance channel transparencies α _lr And the foreground icon label of the low-definition image is Ground Truth

The difference between, i represents the pixel position;

wherein

Represents a gradient;

wherein

Representing a method for solving a pyramid, wherein s represents a pyramid level;

the foreground icon representing the low-definition image is labeled group Truth, F _lr Representing the predicted low-definition foreground image, with a standard L2 penalty for the error map, where err represents the predicted error map,

a group Truth representing an error graph;

the output of the second stage model at high resolution is F _hr ，α _hr . As shown in the following equation, where DGF denotes the depth guided filtering module employed, F _lr ，α _lr H idden indicates the low-serum median result:

F _hr ，α _hr ＝DGF(C，C′，F _lr ，α _lr ，h idden)

the second loss L _hr Except for the first loss L as shown below _lr In addition, the loss of channel transparency map for high resolution output results is included

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

Wherein:

the invention also provides a high-resolution real-time automatic green screen image matting system based on an attention mechanism, which comprises the following components:

the module 1 is used for constructing a neural network model comprising a feature encoder, a cavity space convolution pooling pyramid module, an attention module and a feature decoder;

a module 2, configured to sample the training image by the feature encoder to obtain a low-definition image, extract image features of the low-definition image, and generate an intermediate feature in an extraction process;

the module 3 is used for enabling the void space convolution pooling pyramid module to carry out parallel sampling on the image features by void convolution with different sampling rates, and the attention module carries out feature extraction on sampling results to obtain attention features;

a module 4, configured to enable the feature decoder to decode the attention feature according to the intermediate feature, so as to obtain an intermediate result including the foreground map of the low-definition image after eliminating green overflow and the channel transparency map of the low-definition image;

a module 5, configured to construct a first loss based on the intermediate result and train the neural network model, with foreground map labels and channel transparent map labels of the low-definition image as training targets;

a module 6, configured to add a high resolution processing module to the neural network model output after the training is completed to obtain a matting model, where the high resolution processing module restores the intermediate result to a resolution that is the same as that of the training image to obtain a matting result that includes a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;

a module 7, configured to construct a second loss based on the matting result by using foreground image labels and channel transparency image labels of the training image as training targets, and train the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.

The high-resolution real-time automatic green curtain image matting system based on the attention mechanism is characterized in that the generation process of the training image is to select a foreground image with green overflow and a corresponding channel transparent image in a green curtain data set and synthesize the foreground image with a green curtain background image to obtain the training image.

The high-resolution real-time automatic green screen matting system based on the attention mechanism is characterized in that the feature encoder comprises a plurality of convolution layers, and features behind each convolution layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;

The high-resolution real-time automatic green screen image matting system based on the attention mechanism is characterized in that the generation process of the training image is to randomly select a foreground image F with green overflow and a corresponding channel transparent image alpha from a green screen image matting data set containing green overflow, randomly select a background image B of a green screen background data set, and generate a synthetic image C according to a synthetic formula after unifying the resolution of the three:

C＝F×α+B(1-α)

the Feature extraction process of the Feature encoder is shown as the following formula, wherein Feature _m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;

Feature _m ，Sh ortscuts＝MobileNetV2(C′)

the sampling of the void space convolution pooling pyramid module is as follows:

Feature _aspp ＝ASPP(Feature _m )

the attention module performs Feature extraction by the following formula, feature _se Indicating this attention feature:

Feature _se ＝SE(Feature _aspp )

the feature decoder comprises a plurality of convolutional layers, each convolutional layer except the last layer is followed by a BN layer and a ReLU activation function, and bilinear upsampling is adopted before each convolutional layer and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSample () represents upsampling, concat represents feature splicing, short clients _i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;

Feature _up ＝UpSample(Feature _se )

Feature _cat ＝Concat(Feature _up ，Shortcuts _i )

Feature _conv1 ＝ConvBlocki(Feature _cat )

the first penalty comprises a penalty for a transparency map of the channel

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

And loss L to error graph _err ；

The difference between, i represents the pixel position;

wherein

Represents a gradient;

wherein

Representing a system for solving a pyramid, and s representing a pyramid level;

the foreground icon representing the low-definition image is labeled group Truth, F _lr Representing the predicted low-definition foreground image, using a standard L2 penalty for the error map, where err represents the predicted error map,

a group Truth representing an error graph;

the output of the second stage model at high resolution is F _hr ，α _hr . As shown in the following equation, where DGF denotes the depth guided filtering module employed, F _lr ，α _lr H idden represents the low serum median result:

F _hr ，α _hr ＝DGF(C，C′，F _lr ，α _lr ，h idden)

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

Wherein:

the invention also provides a storage medium for storing a program for executing any one of the high-resolution real-time automatic green curtain image matting methods based on the attention mechanism.

The invention also provides a client used for the high-resolution real-time automatic green screen image matting system based on the attention mechanism.

According to the scheme, the invention has the advantages that:

the method is particularly effective for real-time automatic green screen image matting under high resolution, and qualitative comparison with the current newer various image matting methods CF, KNN, FBA, MODNet, LFP, BGMV2 (mobile) and AIM is listed in FIG. 2, and quantitative comparison is listed in Table 1. A number of comparative experiments indicate the effectiveness of the proposed solution.

The scheme of the invention is compared with the seven image matting methods, wherein CF and KNN are traditional methods for image matting, both are based on Trimap, and the remaining five methods are methods based on deep learning, wherein FBA and LFP are methods based on Trimap, BGMV2 (mobile) is a method based on Background, and MODNet and AIM are automatic image matting methods. As shown in fig. 2, the CF method performs poorly for the holes, and the KNN method performs better, but cannot remove the green overflow. The FBA and LFP methods can obtain finer effect, but also have the problem of difficulty in removing green overflow, and cannot be used for video matting and are difficult to process in real time because Trimap is required as an auxiliary input. The BGMV2 (mobile) method is dependent on the quality of the input background and is therefore more limited. The AIM and the MODNet methods are automatic methods, but the effect is not good, the high quality effect is difficult to achieve, and the MODNet only performs image matting on the portrait, and the limitation is large. The method can realize vivid green curtain image matting effect. As shown in table 1, the comparative experiments were performed using a plurality of metrics. The method has better performance on each index, and although the indexes of the FBA method and the LFP method are superior to the method, the two methods are both Trimap-based methods, require additional manual participation, cannot be real-time and cannot remove green overflow. In addition, as shown in table 2, these two models have large parameters and are slow in processing speed. In terms of speed, as shown in table 1, compared with other methods based on deep learning, the method has fewer model parameters, and can achieve a keying speed of about 80fps at 2K (the algorithm is run on a server configured as Nvidia GeForce RTX 3090). In summary, the scheme of the invention is as follows: the high-resolution real-time automatic green screen image matting method based on the attention mechanism is very effective.

Table 1 and other matting methods quantitatively compare

TABLE 2 comparison of model parameters and sizes

Drawings

FIG. 1 is a diagram of a network model architecture;

FIG. 2 is a qualitative comparison of matting methods of the present invention and other matting methods.

Detailed Description

The invention aims to solve the problems that most of the existing methods need human participation and are difficult to automatically process and the processing speed is low under high resolution, and provides a high-resolution real-time automatic green screen image matting method based on an attention mechanism, which comprises the following steps:

step 1, reading an input green screen image or a green screen video frame, and performing down-sampling on the green screen image or the green screen video frame;

step 2, a characteristic encoder is built, and characteristic extraction is carried out on the green screen image after down-sampling;

and 3, step 3: adding a cavity space convolution pooling pyramid ASPP module and an attention SE module behind an encoder to further process the extracted features;

and 4, step 4: a characteristic decoder is set up to process the acquired characteristic information to acquire high-grade characteristic information rich in semantic information and related to object matting, and the characteristic graph is restored to the resolution ratio after the down-sampling processing in the step 1 through up-sampling;

and 5: the feature encoder, the ASPP module, the attention module and the feature decoder form a first-stage training model, and the model outputs an intermediate result and is trained through a first-stage loss function;

step 6: the second stage of the model carries out high-resolution processing on the intermediate result output by the first stage so as to restore the original resolution;

and 7: training the second-stage model, training the DGF module, updating the first-stage training model at the same time, and obtaining a final green-curtain image matting result;

in the step 1, the green screen image or the green screen video frame is synthesized into the green screen image in the training stage, specifically, a corresponding foreground image and an Alpha image with green overflow are randomly selected from the special green screen data set, and a green screen background image is randomly selected for synthesis. In the test phase, the green screen image or video frame actually shot by the camera is referred to. The down-sampling adopts a bilinear method to reduce the resolution to half of the original resolution.

The step 2 of building the characteristic encoder refers to building an encoder structure based on a MobileNetV2 model provided by a PyTorch official party, the encoder can play a role of down-sampling at the same time, the encoder structure comprises a plurality of convolution layers, and the characteristic of each convolution layer is reserved as an intermediate characteristic to be reserved for subsequent jump connection.

The ASPP module in the step 3 is a cavity space convolution pooling pyramid, and the module enables the network receptive field to be increased on the premise that the resolution of the model is not changed. The attention module employs SENET, which allows the model to focus on the relationships between channels and automatically learn the importance of the different channel features.

And 4, building a feature decoder in the step 4 means building four convolutional layers, wherein each convolutional layer performs up-sampling on the feature of the previous layer and is connected (jump-connected) with the intermediate feature corresponding to the step 2. The up-sampling adopts a bilinear method to double the resolution.

The intermediate result in the step 5 refers to the foreground image, the Alpha image, the error image and the hidden feature hidden after the green overflow is removed under the low resolution. The loss function includes loss, gradient loss and Laplace loss of an Alpha image, loss and Laplace loss of a foreground image and loss of an error image, wherein the error image is helpful for improving the matting fineness degree. The foreground map refers to the target image. The Alpha graph refers to the black and white graph in fig. 2, error refers to the difference between the predicted Alpha graph and the group Truth of the Alpha graph, and the hidden feature refers to the output redundant feature for subsequent high-resolution processing.

The high resolution processing in step 6 is to add a depth guided filtering module (DGF) as a high resolution processing module after the first-stage model.

Except for the loss function used for training the second-stage model in the step 7, each loss term is calculated under high resolution for training. And the final green curtain image matting result is a foreground image and a corresponding Alpha image which are obtained after the green overflow is removed under the original resolution.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment is a high-resolution real-time automatic green screen image matting method based on an attention mechanism. The main experimental environment is GPU: nvidia GeForce RTX 3090, pytorre. The method is realized by the following steps:

the method comprises the following steps:

randomly selecting a foreground image F with green overflow and a corresponding Alpha image Alpha from a green screen keying data set containing green overflow, randomly selecting a background image B of a green screen background data set, unifying the resolution of the foreground image F with the background image B, the background image B with the green screen background data set and the Alpha image Alpha with the green screen background data set to obtain (H, W), and then generating a synthetic image C according to the following synthetic formula:

C＝F×α+B(1-α) (1)

and (2) performing down-sampling processing on the synthetic image to change the resolution to half of the original resolution, namely the resolution is (0.5H, 0.5W), wherein the down sampling () in the C represents the down-sampling method, the C 'represents the down-sampled image, and the C' is used as the input for building a network model.

C′＝DownSample(C) (2)

Step two:

and (4) building a network model of the first stage, namely realizing green screen image matting under low resolution. The lightweight MobileNet is combined with an attention mechanism to serve as a feature extraction encoder, then a decoder is built corresponding to the encoder, and the decoder combines the features of the corresponding encoder and the features of a decoder on the upper layer in a jumping connection mode to serve as input. The decoder outputs the foreground map, the corresponding Alpha map, the error map, and the feature hidden for subsequent use.

a. Feature encoding

In order to realize real-time processing speed, the feature encoder module adopts lightweight MobileNet as a feature extractor, and the feature extraction speed is improved on the premise of ensuring the feature extraction quality. The method adopts a MobileNet V2 model released by PyTorch officially and modifies the model. The modifications include using dilation on the last module to keep the output stride 16 and deleting the classifier module originally used for classification. As shown in the following equation, wherein Feature _m Representing features extracted after multiple convolution modules of MobileNetV2, shortcuts representing intermediate features for each volume block for subsequent skip join use.

Feature _m ，Shortscuts＝MobileNetV2(C′) (3)

In order to enhance the ability of the network model to obtain multi-scale context, an ASPP module is added behind the MobileNet encoder. The ASPP module refers to a hole space convolution pooling pyramid, which performs convolution parallel sampling on given input holes with different sampling rates, which is equivalent to capturing the context of an image in multiple proportions, i.e., increasing the network receptive field without changing the resolution. As shown in the following formula, wherein Feature _aspp And representing the characteristics of the hollow space after convolution and pooling of the pyramid.

Feature _aspp ＝ASPP(Feature _m ) (4)

b. Attention mechanism

In order to realize more accurate image matting effect and accurately extract a foreground target object, the network model provided by the method is added with an attention mechanism after primary feature extraction, and the basic idea of the attention mechanism in computer vision is to enable the model to ignore irrelevant information and pay attention to key information. There are various methods for attention, and the method adopts SENet (Squeeze-and-Excitation Network) to extract the attention feature. The SENET module is concerned with the relationship between channels, and hopes that the model can automatically learn the importance of different channel characteristics. One SEnet module can be divided into two steps of compression and excitation: compressing to obtain the Global compression characteristic quantity of the current Feature Map by executing Global Average Poolling on the Feature Map layer; and (4) exciting to obtain the weight of each channel in the Feature Map through two layers of fully-connected bottleeck structures, and taking the weighted Feature Map as the input of the next layer of network. As shown in the following formula, wherein Feature _a Features after attention extraction are represented.

Feature _se ＝SE(Feature _aspp ) (5)

c. Feature decoding

The decoder network is divided into four convolutional layers, each except the last layer being followed by a BN layer and a ReLU activation function. Before each convolutional layer, the scheme adopts bilinear up-sampling and is spliced with the jump connection characteristic from the encoder. Taking the first convolutional layer of the decoder as an example, the following formula is shown, where UpSample () represents the upsampling method used, concat represents the feature concatenation, shortcuts _i Representing the encoder characteristics of the corresponding skip connection and ConvBlock1 representing the first convolution module of the decoder.

Feature _up ＝UpSample(Feature _se ) (6)

Feature _cat ＝Concat(Featureu _p ，Shortcuts _i ) (7)

d. Model output and loss function

The output content of the decoded features by the decoder is divided into four items: the foreground map after green overflow, the corresponding Alpha map, error map and implicit feature hidden are removed for subsequent use. The resolution of the output results was (0.5H, 0.5W) at low resolution. The corresponding loss function is as follows:

loss function L _lr Comprises the following steps:

wherein:

for Alpha maps, first, the L1 penalty is used to scale the entire Alpha map α and its group Truth

I denotes the pixel position.

The second loss is called gradient loss.

The third loss is the pyramidal laplacian loss.

The standard L1 loss and laplace loss are also used for the predicted foreground F,

represents group Truth. Only the foreground visible loss is calculated, which means that the Ground Truth

Is greater than 0.

Standard L2 loss was used for error map.

The group Truth of the error map is shown.

Step three:

training is carried out on the basis of the first-stage model, after the model converges, a high-resolution processing module DGF is added on the basis of the first-stage model, and training is continued until the model converges.

a. First stage model training

And training according to the first-stage network model built in the step two. The training data set is a green curtain matting data set and a green curtain background data set, wherein the Ground route of the green curtain matting data set comprises an original green curtain image, a foreground image after green overflow is removed, a corresponding Alpha image and a foreground image with green overflow. The method respectively performs random data enhancement on a foreground image with green overflow, a corresponding Alpha image and a randomly selected green curtain background image and then synthesizes the images, wherein the data enhancement mode comprises operations of turning, rotating, translating, chrominance conversion, luminance conversion, saturation conversion and the like. And (4) downsampling the synthesized image according to the requirement, and inputting the synthesized image into a first-stage network model for training until convergence.

b. High resolution module optimization

The first stage network model can realize an automatic green screen matting method based on an attention mechanism, but the real-time processing speed is difficult to achieve when high-resolution materials are processed. The method thus divides model training into two phases. The first stage carries out automatic green curtain image matting under low resolution, and the second stage adds a light-weight high-resolution processing module DGF on the basis of convergence of the first stage, and restores the low-resolution processing result to high resolution, thereby realizing the real-time automatic green curtain image matting method under high resolution.

The high-resolution processing module adopts a depth guided filtering DGF module, and the traditional guided filtering algorithm can not only realize the edge smoothness of bilateral filtering, but also has good performance near the detected edge, and can be applied to scenes such as image enhancement, HDR compression, image matting, image defogging and the like. Depth-guided filtering is a guided filtering method implemented based on deep learning, which can be used to efficiently generate a high-resolution output under a corresponding low-resolution output and high-resolution guide map. As shown in the following equation, where DGF denotes the depth guided filtering module employed, F _lr ，α _lr To representFirst stage model output results at low resolution, F _hr ，α _hr Representing the output of the model at the second stage at high resolution.

F _hr ，α _hr ＝DGF(C，C′，F _lr ，α _lr ，hidden) (16)

c. Model output and loss function

The output of the second-stage model is a foreground graph with high resolution and green overflow and a corresponding Alpha graph, and the loss function comprises a plurality of loss functions under high resolution besides the loss function of the first stage, as shown in the following formula.

Loss function L _hr Comprises the following steps:

the following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above embodiments.

the module 2 is used for enabling the feature encoder to sample the training image to obtain a low-definition image, extracting the image features of the low-definition image and generating intermediate features in the extraction process;

a module 5, configured to construct a first loss based on the intermediate result and train the neural network model, with a foreground map label and a channel transparent map label of the low-definition image as training targets;

a module 7, configured to construct a second loss based on the matting result by using the foreground image label and the channel transparency image label of the training image as training targets, and train the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.

The high-resolution real-time automatic green curtain image matting system based on the attention mechanism is characterized in that the training image is generated by selecting a green curtain data set with a green overflowing foreground image and a corresponding channel transparent image, and synthesizing the green curtain data set with a green curtain background image to obtain the training image.

The high-resolution real-time automatic green curtain image matting system based on the attention mechanism is characterized in that the generation process of the training image is that a foreground image F with green overflow and a corresponding channel transparent image alpha are randomly selected from a green curtain image matting data set containing green overflow, a background image B of a green curtain background data set is randomly selected, and a synthetic image C is generated according to a synthetic formula after the three are unified in resolution:

C＝F×α+B(1-α)

performing down-sampling processing on the synthetic image C to obtain a low-resolution image C';

Feature _m ，Sh ortscuts＝MobileNetV2(C′)

Feature _aspp ＝ASPP(Feature _m )

wherein Feature _aspp Indicates the nullSampling results output by the hole space convolution pooling pyramid module;

the attention module performs Feature extraction by the following formula, feature _a Indicating this attention feature:

Feature _se ＝SE(Feature _aspp )

Feature _up ＝UpSample(Feature _se )

Feature _cat ＝Concat(Feature _up ，Shortcuts _i )

Feature _conv1 ＝ConvBlocki(Feature _cat )

the first loss comprises a loss of a transparent map of the channel

Loss of gradient

And laplace loss

Loss of foreground map

And lapalalLoss of Si

And loss L to error map _err ；

Use of L1 loss to scale predicted low-clearance channel transparencies α _lr And the foreground icon of the low-definition image is signed group Truth

The difference between, i represents the pixel position;

wherein

Represents a gradient;

wherein

a group Truth representing an error graph;

F _hr ，α _hr ＝DGF(C，C′，F _l r，α _lr ，h idden)

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

Wherein:

Claims

1. A high-resolution real-time automatic green screen image matting method based on an attention mechanism is characterized by comprising the following steps:

step 1, constructing a neural network model comprising a feature encoder, a void space convolution pooling pyramid module, an attention module and a feature decoder;

step 3, the cavity space convolution pooling pyramid module samples the image features in parallel by cavity convolution with different sampling rates, and the attention module extracts the features of the sampling results to obtain attention features;

step 6, adding a high-resolution processing module for the neural network model output after the training is finished to obtain a matting model, and restoring the intermediate result to the resolution which is the same as that of the training image by the high-resolution processing module to obtain a matting result which comprises a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;

2. The attention-based high-resolution real-time automatic green curtain matting method according to claim 1, wherein the training image is generated by selecting a foreground image with green overflow and a corresponding channel transparency image from a green curtain data set and synthesizing the foreground image with a green curtain background image to obtain the training image.

3. The attention-based high-resolution real-time automatic green screen matting method according to claim 1, wherein the feature encoder includes a plurality of convolutional layers, the features behind each convolutional layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;

4. The attention mechanism-based high-resolution real-time automatic green screen matting method according to claim 1, wherein the training image generation process is to randomly select a foreground image F with green overflow and a corresponding channel transparency image α from a green screen matting data set containing green overflow, randomly select a background image B of a green screen background data set, unify resolutions of the three, and generate a composite image C according to a composite formula:

C＝F×α+B(1-α)

Feature _m ，Sh ortscuts＝MobileNetV2(C′)

Feature _aspp ＝ASPP(Feature _m )

the attention module performs Feature extraction, feature, by the following equation _se Representing this attention feature:

Feature _se ＝SE(Feature _aspp )

the feature decoder includes a plurality of convolutional layers, each convolutional layer except the last layer being followed by a BN layer and a ReLU activation functionBefore each convolution layer, bilinear up-sampling is adopted and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSample () represents upsampling, concat represents feature splicing, short clients _i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;

Feature _up ＝UpSample(Feature _se )

Feature _cat ＝Concat(Feature _up ，Shortcuts _i )

Feature _conv1 ＝ConvBlocki(Feature _cat )

the intermediate result also includes an error map and an implicit feature hidden of the low-definition image, and the first penalty is:

the first loss comprises a loss of a transparent map of the channel

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

And loss L to error graph _err ；

The difference between, i represents the pixel position;

wherein

Represents a gradient;

wherein

a group Truth representing an error graph;

the output of the second stage model at high resolution is F _hr ，α _hr As shown in the following equation, where DGF denotes the depth guided filtering module employed, F _lr ，α _lr H idden indicates the low-serum median result:

F _hr ，α _hr ＝DGF(C，C′，F _lr ，α _lr ，h idden)

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

Wherein:

5. a high-resolution real-time automatic green screen matting system based on an attention mechanism is characterized by comprising:

a module 3, configured to enable the void space convolution pooling pyramid module to perform parallel sampling on the image features by void convolution at different sampling rates, and perform feature extraction on the sampling result by the attention module to obtain an attention feature;

a module 4, configured to enable the feature decoder to decode the attention feature according to the intermediate feature, so as to obtain an intermediate result that includes the foreground map of the low-definition image after green overflow is eliminated and the channel transparency map of the low-definition image;

a module 6, configured to add a high resolution processing module to the neural network model output after the training is completed, to obtain a matting model, where the high resolution processing module restores the intermediate result to a resolution that is the same as that of the training image, to obtain a matting result that includes a foreground image of the training image after green overflow is eliminated and a channel transparency image of the training;

6. The attention-based high-resolution real-time automatic green curtain matting system according to claim 1, wherein the training image is generated by selecting a foreground image with green color overflow and a corresponding channel transparency image from a green curtain data set and synthesizing the foreground image with a green curtain background image to obtain the training image.

7. The attention-based high-resolution real-time automatic green screen matting system according to claim 1, wherein the feature encoder comprises a plurality of convolutional layers, and the features after each convolutional layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;

the intermediate result also includes an error map and implicit feature hidden for the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplace penalty for the channel transparency map, L1 penalty and laplace penalty for the foreground map, and L2 penalty for the error map.

8. The attention mechanism-based high-resolution real-time automatic green screen matting system according to claim 1, wherein the training image is generated by randomly selecting a foreground image F with green overflow and a corresponding channel transparency image α from a green screen matting dataset containing green overflow, randomly selecting a background image B of a green screen background dataset, unifying resolutions of the three, and generating a composite image C according to a composite formula:

C＝F×α+B(1-α)

Feature _m ，Sh ortscuts＝MobileNetV2(C′)

Feature _aspp ＝ASPP(Feature _m )

Feature _se ＝SE(Feature _aspp )

Feature _up ＝UpSample(Feature _se )

Feature _cat ＝Concat(Feature _up ，Shortcuts _i )

Feature _conv1 ＝ConvBlocki(Feature _cat )

the first penalty comprises a penalty for a transparency map of the channel

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

And loss L to error graph _err ；

The difference between, i represents the pixel position;

wherein

Represents a gradient;

wherein

a group Truth representing an error graph;

the output of the second stage model at high resolution is F _hr ，α _hr . As shown in the following equation, wherein DGF denotes the depth guided filtering module employed, F _lr ，α _lr H idden indicates the low-serum median result:

F _hr ，α _hr ＝DGF(C，C′，F _lr ，α _lr ，h idden)

Loss of gradient

And laplace loss

Loss of foreground map

And laplace loss

Wherein:

9. a storage medium storing a program for executing the high resolution real-time automatic green screen matting method based on attention mechanism according to any one of claims 1 to 4.

10. A client for the high resolution real-time automatic green screen matting system based on the attention mechanism as claimed in any one of claims 5 to 8.