CN115457266A - High-resolution real-time automatic green screen image matting method and system based on attention mechanism - Google Patents

High-resolution real-time automatic green screen image matting method and system based on attention mechanism Download PDF

Info

Publication number
CN115457266A
CN115457266A CN202211029515.5A CN202211029515A CN115457266A CN 115457266 A CN115457266 A CN 115457266A CN 202211029515 A CN202211029515 A CN 202211029515A CN 115457266 A CN115457266 A CN 115457266A
Authority
CN
China
Prior art keywords
image
feature
loss
green
matting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211029515.5A
Other languages
Chinese (zh)
Inventor
李兆歆
靳悦
朱登明
石敏
王兆其
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202211029515.5A priority Critical patent/CN115457266A/en
Publication of CN115457266A publication Critical patent/CN115457266A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a high-resolution real-time automatic green screen image matting method and system based on an attention mechanism, which comprises the following steps: and the participation of manual adjustment is removed by combining a feature extraction model and an attention mechanism through a deep learning method, so that automatic processing is realized. The method is realized by adopting a lightweight encoder model and a method of firstly carrying out green curtain matting under low resolution and then restoring the original resolution through a high resolution processing module. In order to achieve a more refined image matting effect, a jump connection mode is adopted to relieve the phenomenon that low-level features are lost. And training of the network model is carried out by adopting a special green curtain image matting data set, so that the feature extraction module can also learn to remove relevant features of green overflow.

Description

High-resolution real-time automatic green screen image matting method and system based on attention mechanism
Technical Field
The invention relates to the technical field of image processing, in particular to a high-resolution real-time automatic green screen image matting method based on an attention mechanism.
Background
The green screen keying refers to taking pictures or videos by taking a green screen as a background so as to carry out later-stage keying synthesis. The green screen image matting technology can be used for making film and television plays such as Hollywood large films and the like, and is widely applied to scenes of daily life of people such as virtual studios, online live broadcast and the like. Because the green curtain image matting has more application scenes and extremely high commercial value, a plurality of scholars and enterprises at home and abroad deeply research the green curtain image matting problem.
The current green curtain image matting method can be divided into two main categories: traditional methods and deep learning based methods. The traditional green curtain image matting method comprises methods of chroma image matting, color difference image matting, brightness image matting, triangular image matting and the like, and most commercial software is realized based on the methods. Chroma matting refers to estimating the opacity of each pixel by estimating the chroma difference of a foreground object from a given key color according to the key color provided by a user. The color difference image matting method is used for green screen image matting according to color difference and is mainly suitable for scenes with background color proportion far larger than other two colors. The brightness matting method is based on the difference of image brightness values. This method is typically used to extract foreground objects that are very bright or self-illuminating, such as smoke or sparks. The triangle matting rule is to perform green screen matting by shooting the same foreground under different backgrounds. The deep learning-based method is mostly based on natural scenes for image matting, and comprises a trimap-based method, a background-based method and an automatic method. CF is an affinity-based method that considers local regions of an image to be contiguous, with neighboring pixels of similar color having similar Alpha values. Thus, the Alpha value of an unknown pixel can be estimated by sampling its known foreground and background neighbors. KNN is also an affinity-based method that considers local regions of an image to have smooth properties that can be used to propagate Alpha values from known regions to unknown regions. FBA proposes low-cost modifications to Alpha-matching networks to predict foreground and background colors. LFP learns remote context features outside the receptive field. BGMV2 (mobile) requires a background image as an additional input and is aligned with the input image. MODNet is an automatic real-time processing method for human images, however, the processing resolution ratio corresponding to real-time performance is low. AIM designs a unified semantic representation module to guide matting to produce more accurate results.
Most of commercial software capable of achieving green screen image matting generally adopts a traditional method to conduct green screen image matting, but a large number of parameters which need to be adjusted by professionals exist, and automatic processing is difficult. When the green screen image matting problem is processed, the higher the resolution is, the longer the required processing time is, and the real-time processing speed is difficult to achieve. In the deep learning-based method, both the trimap-based method and the background-based method require a user to additionally input trimap or background as a priori, and although the methods have high precision, the automatic processing is difficult to realize. In the deep learning-based method, an automatic method can realize automatic and high-resolution real-time processing, but the method is suitable for processing natural scene image matting, and when the method is used for green screen image matting, the problem that a shot object is stained with green (green overflow) due to the fact that a green screen is reflected to the shot object cannot be removed.
Disclosure of Invention
In order to solve the problems, the invention provides a high-resolution real-time automatic green screen image matting method based on an attention mechanism. And the light-weight MobileNet model and the attention module are combined to effectively extract the features, the participation of manual adjustment is removed, and the automatic processing is realized. And the real-time processing speed under high resolution is realized by adopting a lightweight characteristic encoder and a high resolution processing module. Adopt green curtain to scratch like special data set and carry out the network training for the network model learns the green relevant characteristic that overflows, thereby the effectual green phenomenon of spilling over of getting rid of.
When the inventor researches a high-resolution real-time automatic green screen image matting method, the inventor discovers that a large number of parameters are generally required to be adjusted by professionals in the traditional method in the prior art, the inventor discovers that the defect can be overcome through a deep learning method through natural image matting research, and the participation of manual adjustment can be removed by combining a feature extraction model and an attention mechanism, so that automatic processing is realized. The traditional method and the deep learning-based method in the prior art are generally low in speed when high-resolution materials are processed and difficult to achieve real-time, and the defect can be overcome by adopting a lightweight encoder model and a method of performing green screen matting under low resolution and then restoring the original resolution through a high-resolution processing module. In order to achieve a more refined image matting effect, a jump connection mode is adopted to relieve the phenomenon that low-level features are lost. The method based on deep learning in the prior art is found to be difficult to remove green overflow, and the defect can be overcome by training a network model by adopting a special green curtain image matting data set, so that a feature extraction module can also learn to remove relevant features of green overflow.
The invention specifically provides a high-resolution real-time automatic green screen image matting method based on an attention mechanism, which comprises the following steps:
step 1, constructing a neural network model comprising a feature encoder, a cavity space convolution pooling pyramid module, an attention module and a feature decoder;
step 2, the feature encoder samples the training image to obtain a low-definition image, extracts the image features of the low-definition image and generates intermediate features in the extraction process;
step 3, the void space convolution pooling pyramid module samples the image features in parallel by void convolution with different sampling rates, and the attention module extracts the features of the sampling results to obtain attention features;
step 4, the feature decoder decodes the attention feature according to the intermediate feature to obtain an intermediate result which comprises a foreground image of the low-definition image after green overflow is eliminated and a channel transparent image of the low-definition image;
step 5, taking the foreground image label and the channel transparent image label of the low-definition image as training targets, constructing a first loss based on the intermediate result, and training the neural network model;
step 6, adding a high-resolution processing module to the neural network model output after the training is finished to obtain a matting model, and restoring the intermediate result to the resolution which is the same as that of the training image by the high-resolution processing module to obtain a matting result which comprises a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;
step 7, constructing a second loss based on the matting result by taking the foreground image label and the channel transparent image label of the training image as training targets, and training the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.
According to the high-resolution real-time automatic green-curtain image matting method based on the attention mechanism, the training image is generated by selecting a foreground image with green overflow and a corresponding channel transparent image in a green-curtain data set and synthesizing the foreground image with a green-curtain background image to obtain the training image.
The attention mechanism-based high-resolution real-time automatic green screen image matting method is characterized in that the feature encoder comprises a plurality of convolution layers, and features behind each convolution layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;
the intermediate results also include an error map and implicit feature, hidden, of the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplacian penalty for the channel transparency map, L1 penalty and laplacian penalty for the foreground map, and L2 penalty for the error map.
The high-resolution real-time automatic green-curtain image matting method based on the attention mechanism is characterized in that the generation process of the training image is that a foreground image F with green overflow and a corresponding channel transparent image alpha are randomly selected from a green-curtain image matting data set containing green overflow, a background image B of a green-curtain background data set is randomly selected, and a synthetic image C is generated according to a synthetic formula after the three are unified in resolution:
C=F×α+B(1-α)
performing down-sampling processing on the synthesized image C to obtain a low-resolution image C';
the Feature extraction process of the Feature encoder is shown in the following formula, wherein Feature m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;
Feature m ,Sh ortscuts=MobileNetV2(C′)
the sampling of the cavity space convolution pooling pyramid module is as follows:
Feature aspp =ASPP(Feature m )
wherein Feature aspp Representing the sampling result output by the void space convolution pooling pyramid module;
the attention module performs Feature extraction, feature, by the following equation se Indicating this attention feature:
Feature se =SE(Feature aspp )
the feature decoder comprises a plurality of convolutional layers, each convolutional layer except the last layer is followed by a BN layer and a ReLU activation function, and bilinear upsampling is adopted before each convolutional layer and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSSample () represents upsampling, concat represents feature splicing, short endings i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;
Feature up =UpSample(Feature se )
Feature cat =Concat(Feature up ,Shortcuts i )
Feature conv1 =ConvBlocki(Feature cat )
the intermediate result also includes an error map and implicit feature, hidden, of the low-definition image, the first penalty being:
Figure BDA0003815346080000041
the first loss comprises a loss of a transparent map of the channel
Figure BDA0003815346080000042
Loss of gradient
Figure BDA0003815346080000043
And laplace loss
Figure BDA0003815346080000044
Loss of foreground map
Figure BDA0003815346080000045
And laplace loss
Figure BDA0003815346080000046
And loss L to error map err
Figure BDA0003815346080000047
Use of L1 loss to scale predicted low-clearance channel transparencies α lr And the foreground icon label of the low-definition image is Ground Truth
Figure BDA0003815346080000048
The difference between, i represents the pixel position;
Figure BDA0003815346080000049
wherein
Figure BDA0003815346080000051
Represents a gradient;
Figure BDA0003815346080000052
wherein
Figure BDA0003815346080000053
Representing a method for solving a pyramid, wherein s represents a pyramid level;
Figure BDA0003815346080000054
Figure BDA0003815346080000055
Figure BDA0003815346080000056
the foreground icon representing the low-definition image is labeled group Truth, F lr Representing the predicted low-definition foreground image, with a standard L2 penalty for the error map, where err represents the predicted error map,
Figure BDA0003815346080000057
a group Truth representing an error graph;
Figure BDA0003815346080000058
Figure BDA0003815346080000059
the output of the second stage model at high resolution is F hr ,α hr . As shown in the following equation, where DGF denotes the depth guided filtering module employed, F lr ,α lr H idden indicates the low-serum median result:
F hr ,α hr =DGF(C,C′,F lr ,α lr ,h idden)
the second loss L hr Except for the first loss L as shown below lr In addition, the loss of channel transparency map for high resolution output results is included
Figure BDA00038153460800000510
Loss of gradient
Figure BDA00038153460800000511
And laplace loss
Figure BDA00038153460800000512
Loss of foreground map
Figure BDA00038153460800000513
And laplace loss
Figure BDA00038153460800000514
Figure BDA00038153460800000515
Wherein:
Figure BDA00038153460800000516
Figure BDA00038153460800000517
Figure BDA00038153460800000518
Figure BDA00038153460800000519
Figure BDA00038153460800000520
the invention also provides a high-resolution real-time automatic green screen image matting system based on an attention mechanism, which comprises the following components:
the module 1 is used for constructing a neural network model comprising a feature encoder, a cavity space convolution pooling pyramid module, an attention module and a feature decoder;
a module 2, configured to sample the training image by the feature encoder to obtain a low-definition image, extract image features of the low-definition image, and generate an intermediate feature in an extraction process;
the module 3 is used for enabling the void space convolution pooling pyramid module to carry out parallel sampling on the image features by void convolution with different sampling rates, and the attention module carries out feature extraction on sampling results to obtain attention features;
a module 4, configured to enable the feature decoder to decode the attention feature according to the intermediate feature, so as to obtain an intermediate result including the foreground map of the low-definition image after eliminating green overflow and the channel transparency map of the low-definition image;
a module 5, configured to construct a first loss based on the intermediate result and train the neural network model, with foreground map labels and channel transparent map labels of the low-definition image as training targets;
a module 6, configured to add a high resolution processing module to the neural network model output after the training is completed to obtain a matting model, where the high resolution processing module restores the intermediate result to a resolution that is the same as that of the training image to obtain a matting result that includes a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;
a module 7, configured to construct a second loss based on the matting result by using foreground image labels and channel transparency image labels of the training image as training targets, and train the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.
The high-resolution real-time automatic green curtain image matting system based on the attention mechanism is characterized in that the generation process of the training image is to select a foreground image with green overflow and a corresponding channel transparent image in a green curtain data set and synthesize the foreground image with a green curtain background image to obtain the training image.
The high-resolution real-time automatic green screen matting system based on the attention mechanism is characterized in that the feature encoder comprises a plurality of convolution layers, and features behind each convolution layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;
the intermediate results also include an error map and implicit feature, hidden, of the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplacian penalty for the channel transparency map, L1 penalty and laplacian penalty for the foreground map, and L2 penalty for the error map.
The high-resolution real-time automatic green screen image matting system based on the attention mechanism is characterized in that the generation process of the training image is to randomly select a foreground image F with green overflow and a corresponding channel transparent image alpha from a green screen image matting data set containing green overflow, randomly select a background image B of a green screen background data set, and generate a synthetic image C according to a synthetic formula after unifying the resolution of the three:
C=F×α+B(1-α)
performing down-sampling processing on the synthesized image C to obtain a low-resolution image C';
the Feature extraction process of the Feature encoder is shown as the following formula, wherein Feature m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;
Feature m ,Sh ortscuts=MobileNetV2(C′)
the sampling of the void space convolution pooling pyramid module is as follows:
Feature aspp =ASPP(Feature m )
wherein Feature aspp Representing the sampling result output by the void space convolution pooling pyramid module;
the attention module performs Feature extraction by the following formula, feature se Indicating this attention feature:
Feature se =SE(Feature aspp )
the feature decoder comprises a plurality of convolutional layers, each convolutional layer except the last layer is followed by a BN layer and a ReLU activation function, and bilinear upsampling is adopted before each convolutional layer and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSample () represents upsampling, concat represents feature splicing, short clients i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;
Feature up =UpSample(Feature se )
Feature cat =Concat(Feature up ,Shortcuts i )
Feature conv1 =ConvBlocki(Feature cat )
the intermediate result also includes an error map and implicit feature, hidden, of the low-definition image, the first penalty being:
Figure BDA0003815346080000071
the first penalty comprises a penalty for a transparency map of the channel
Figure BDA0003815346080000072
Loss of gradient
Figure BDA0003815346080000073
And laplace loss
Figure BDA0003815346080000074
Loss of foreground map
Figure BDA0003815346080000075
And laplace loss
Figure BDA0003815346080000076
And loss L to error graph err
Figure BDA0003815346080000077
Use of L1 loss to scale predicted low-clearance channel transparencies α lr And the foreground icon label of the low-definition image is Ground Truth
Figure BDA0003815346080000078
The difference between, i represents the pixel position;
Figure BDA0003815346080000079
wherein
Figure BDA00038153460800000710
Represents a gradient;
Figure BDA0003815346080000081
wherein
Figure BDA0003815346080000082
Representing a system for solving a pyramid, and s representing a pyramid level;
Figure BDA0003815346080000083
Figure BDA0003815346080000084
Figure BDA0003815346080000085
the foreground icon representing the low-definition image is labeled group Truth, F lr Representing the predicted low-definition foreground image, using a standard L2 penalty for the error map, where err represents the predicted error map,
Figure BDA0003815346080000086
a group Truth representing an error graph;
Figure BDA0003815346080000087
Figure BDA0003815346080000088
the output of the second stage model at high resolution is F hr ,α hr . As shown in the following equation, where DGF denotes the depth guided filtering module employed, F lr ,α lr H idden represents the low serum median result:
F hr ,α hr =DGF(C,C′,F lr ,α lr ,h idden)
the second loss L hr Except for the first loss L as shown below lr In addition, the loss of channel transparency map for high resolution output results is included
Figure BDA0003815346080000089
Loss of gradient
Figure BDA00038153460800000810
And laplace loss
Figure BDA00038153460800000811
Loss of foreground map
Figure BDA00038153460800000812
And laplace loss
Figure BDA00038153460800000813
Figure BDA00038153460800000814
Wherein:
Figure BDA00038153460800000815
Figure BDA00038153460800000816
Figure BDA00038153460800000817
Figure BDA00038153460800000818
Figure BDA00038153460800000819
the invention also provides a storage medium for storing a program for executing any one of the high-resolution real-time automatic green curtain image matting methods based on the attention mechanism.
The invention also provides a client used for the high-resolution real-time automatic green screen image matting system based on the attention mechanism.
According to the scheme, the invention has the advantages that:
the method is particularly effective for real-time automatic green screen image matting under high resolution, and qualitative comparison with the current newer various image matting methods CF, KNN, FBA, MODNet, LFP, BGMV2 (mobile) and AIM is listed in FIG. 2, and quantitative comparison is listed in Table 1. A number of comparative experiments indicate the effectiveness of the proposed solution.
The scheme of the invention is compared with the seven image matting methods, wherein CF and KNN are traditional methods for image matting, both are based on Trimap, and the remaining five methods are methods based on deep learning, wherein FBA and LFP are methods based on Trimap, BGMV2 (mobile) is a method based on Background, and MODNet and AIM are automatic image matting methods. As shown in fig. 2, the CF method performs poorly for the holes, and the KNN method performs better, but cannot remove the green overflow. The FBA and LFP methods can obtain finer effect, but also have the problem of difficulty in removing green overflow, and cannot be used for video matting and are difficult to process in real time because Trimap is required as an auxiliary input. The BGMV2 (mobile) method is dependent on the quality of the input background and is therefore more limited. The AIM and the MODNet methods are automatic methods, but the effect is not good, the high quality effect is difficult to achieve, and the MODNet only performs image matting on the portrait, and the limitation is large. The method can realize vivid green curtain image matting effect. As shown in table 1, the comparative experiments were performed using a plurality of metrics. The method has better performance on each index, and although the indexes of the FBA method and the LFP method are superior to the method, the two methods are both Trimap-based methods, require additional manual participation, cannot be real-time and cannot remove green overflow. In addition, as shown in table 2, these two models have large parameters and are slow in processing speed. In terms of speed, as shown in table 1, compared with other methods based on deep learning, the method has fewer model parameters, and can achieve a keying speed of about 80fps at 2K (the algorithm is run on a server configured as Nvidia GeForce RTX 3090). In summary, the scheme of the invention is as follows: the high-resolution real-time automatic green screen image matting method based on the attention mechanism is very effective.
Table 1 and other matting methods quantitatively compare
Figure BDA0003815346080000091
TABLE 2 comparison of model parameters and sizes
Figure BDA0003815346080000101
Drawings
FIG. 1 is a diagram of a network model architecture;
FIG. 2 is a qualitative comparison of matting methods of the present invention and other matting methods.
Detailed Description
The invention aims to solve the problems that most of the existing methods need human participation and are difficult to automatically process and the processing speed is low under high resolution, and provides a high-resolution real-time automatic green screen image matting method based on an attention mechanism, which comprises the following steps:
step 1, reading an input green screen image or a green screen video frame, and performing down-sampling on the green screen image or the green screen video frame;
step 2, a characteristic encoder is built, and characteristic extraction is carried out on the green screen image after down-sampling;
and 3, step 3: adding a cavity space convolution pooling pyramid ASPP module and an attention SE module behind an encoder to further process the extracted features;
and 4, step 4: a characteristic decoder is set up to process the acquired characteristic information to acquire high-grade characteristic information rich in semantic information and related to object matting, and the characteristic graph is restored to the resolution ratio after the down-sampling processing in the step 1 through up-sampling;
and 5: the feature encoder, the ASPP module, the attention module and the feature decoder form a first-stage training model, and the model outputs an intermediate result and is trained through a first-stage loss function;
step 6: the second stage of the model carries out high-resolution processing on the intermediate result output by the first stage so as to restore the original resolution;
and 7: training the second-stage model, training the DGF module, updating the first-stage training model at the same time, and obtaining a final green-curtain image matting result;
in the step 1, the green screen image or the green screen video frame is synthesized into the green screen image in the training stage, specifically, a corresponding foreground image and an Alpha image with green overflow are randomly selected from the special green screen data set, and a green screen background image is randomly selected for synthesis. In the test phase, the green screen image or video frame actually shot by the camera is referred to. The down-sampling adopts a bilinear method to reduce the resolution to half of the original resolution.
The step 2 of building the characteristic encoder refers to building an encoder structure based on a MobileNetV2 model provided by a PyTorch official party, the encoder can play a role of down-sampling at the same time, the encoder structure comprises a plurality of convolution layers, and the characteristic of each convolution layer is reserved as an intermediate characteristic to be reserved for subsequent jump connection.
The ASPP module in the step 3 is a cavity space convolution pooling pyramid, and the module enables the network receptive field to be increased on the premise that the resolution of the model is not changed. The attention module employs SENET, which allows the model to focus on the relationships between channels and automatically learn the importance of the different channel features.
And 4, building a feature decoder in the step 4 means building four convolutional layers, wherein each convolutional layer performs up-sampling on the feature of the previous layer and is connected (jump-connected) with the intermediate feature corresponding to the step 2. The up-sampling adopts a bilinear method to double the resolution.
The intermediate result in the step 5 refers to the foreground image, the Alpha image, the error image and the hidden feature hidden after the green overflow is removed under the low resolution. The loss function includes loss, gradient loss and Laplace loss of an Alpha image, loss and Laplace loss of a foreground image and loss of an error image, wherein the error image is helpful for improving the matting fineness degree. The foreground map refers to the target image. The Alpha graph refers to the black and white graph in fig. 2, error refers to the difference between the predicted Alpha graph and the group Truth of the Alpha graph, and the hidden feature refers to the output redundant feature for subsequent high-resolution processing.
The high resolution processing in step 6 is to add a depth guided filtering module (DGF) as a high resolution processing module after the first-stage model.
Except for the loss function used for training the second-stage model in the step 7, each loss term is calculated under high resolution for training. And the final green curtain image matting result is a foreground image and a corresponding Alpha image which are obtained after the green overflow is removed under the original resolution.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment is a high-resolution real-time automatic green screen image matting method based on an attention mechanism. The main experimental environment is GPU: nvidia GeForce RTX 3090, pytorre. The method is realized by the following steps:
the method comprises the following steps:
randomly selecting a foreground image F with green overflow and a corresponding Alpha image Alpha from a green screen keying data set containing green overflow, randomly selecting a background image B of a green screen background data set, unifying the resolution of the foreground image F with the background image B, the background image B with the green screen background data set and the Alpha image Alpha with the green screen background data set to obtain (H, W), and then generating a synthetic image C according to the following synthetic formula:
C=F×α+B(1-α) (1)
and (2) performing down-sampling processing on the synthetic image to change the resolution to half of the original resolution, namely the resolution is (0.5H, 0.5W), wherein the down sampling () in the C represents the down-sampling method, the C 'represents the down-sampled image, and the C' is used as the input for building a network model.
C′=DownSample(C) (2)
Step two:
and (4) building a network model of the first stage, namely realizing green screen image matting under low resolution. The lightweight MobileNet is combined with an attention mechanism to serve as a feature extraction encoder, then a decoder is built corresponding to the encoder, and the decoder combines the features of the corresponding encoder and the features of a decoder on the upper layer in a jumping connection mode to serve as input. The decoder outputs the foreground map, the corresponding Alpha map, the error map, and the feature hidden for subsequent use.
a. Feature encoding
In order to realize real-time processing speed, the feature encoder module adopts lightweight MobileNet as a feature extractor, and the feature extraction speed is improved on the premise of ensuring the feature extraction quality. The method adopts a MobileNet V2 model released by PyTorch officially and modifies the model. The modifications include using dilation on the last module to keep the output stride 16 and deleting the classifier module originally used for classification. As shown in the following equation, wherein Feature m Representing features extracted after multiple convolution modules of MobileNetV2, shortcuts representing intermediate features for each volume block for subsequent skip join use.
Feature m ,Shortscuts=MobileNetV2(C′) (3)
In order to enhance the ability of the network model to obtain multi-scale context, an ASPP module is added behind the MobileNet encoder. The ASPP module refers to a hole space convolution pooling pyramid, which performs convolution parallel sampling on given input holes with different sampling rates, which is equivalent to capturing the context of an image in multiple proportions, i.e., increasing the network receptive field without changing the resolution. As shown in the following formula, wherein Feature aspp And representing the characteristics of the hollow space after convolution and pooling of the pyramid.
Feature aspp =ASPP(Feature m ) (4)
b. Attention mechanism
In order to realize more accurate image matting effect and accurately extract a foreground target object, the network model provided by the method is added with an attention mechanism after primary feature extraction, and the basic idea of the attention mechanism in computer vision is to enable the model to ignore irrelevant information and pay attention to key information. There are various methods for attention, and the method adopts SENet (Squeeze-and-Excitation Network) to extract the attention feature. The SENET module is concerned with the relationship between channels, and hopes that the model can automatically learn the importance of different channel characteristics. One SEnet module can be divided into two steps of compression and excitation: compressing to obtain the Global compression characteristic quantity of the current Feature Map by executing Global Average Poolling on the Feature Map layer; and (4) exciting to obtain the weight of each channel in the Feature Map through two layers of fully-connected bottleeck structures, and taking the weighted Feature Map as the input of the next layer of network. As shown in the following formula, wherein Feature a Features after attention extraction are represented.
Feature se =SE(Feature aspp ) (5)
c. Feature decoding
The decoder network is divided into four convolutional layers, each except the last layer being followed by a BN layer and a ReLU activation function. Before each convolutional layer, the scheme adopts bilinear up-sampling and is spliced with the jump connection characteristic from the encoder. Taking the first convolutional layer of the decoder as an example, the following formula is shown, where UpSample () represents the upsampling method used, concat represents the feature concatenation, shortcuts i Representing the encoder characteristics of the corresponding skip connection and ConvBlock1 representing the first convolution module of the decoder.
Feature up =UpSample(Feature se ) (6)
Feature cat =Concat(Featureu p ,Shortcuts i ) (7)
Figure BDA0003815346080000131
d. Model output and loss function
The output content of the decoded features by the decoder is divided into four items: the foreground map after green overflow, the corresponding Alpha map, error map and implicit feature hidden are removed for subsequent use. The resolution of the output results was (0.5H, 0.5W) at low resolution. The corresponding loss function is as follows:
loss function L lr Comprises the following steps:
Figure BDA0003815346080000132
wherein:
for Alpha maps, first, the L1 penalty is used to scale the entire Alpha map α and its group Truth
Figure BDA0003815346080000133
I denotes the pixel position.
Figure BDA0003815346080000134
The second loss is called gradient loss.
Figure BDA0003815346080000135
The third loss is the pyramidal laplacian loss.
Figure BDA0003815346080000141
The standard L1 loss and laplace loss are also used for the predicted foreground F,
Figure BDA0003815346080000142
represents group Truth. Only the foreground visible loss is calculated, which means that the Ground Truth
Figure BDA0003815346080000143
Is greater than 0.
Figure BDA0003815346080000147
Figure BDA0003815346080000144
Standard L2 loss was used for error map.
Figure BDA0003815346080000145
The group Truth of the error map is shown.
Figure BDA0003815346080000146
Step three:
training is carried out on the basis of the first-stage model, after the model converges, a high-resolution processing module DGF is added on the basis of the first-stage model, and training is continued until the model converges.
a. First stage model training
And training according to the first-stage network model built in the step two. The training data set is a green curtain matting data set and a green curtain background data set, wherein the Ground route of the green curtain matting data set comprises an original green curtain image, a foreground image after green overflow is removed, a corresponding Alpha image and a foreground image with green overflow. The method respectively performs random data enhancement on a foreground image with green overflow, a corresponding Alpha image and a randomly selected green curtain background image and then synthesizes the images, wherein the data enhancement mode comprises operations of turning, rotating, translating, chrominance conversion, luminance conversion, saturation conversion and the like. And (4) downsampling the synthesized image according to the requirement, and inputting the synthesized image into a first-stage network model for training until convergence.
b. High resolution module optimization
The first stage network model can realize an automatic green screen matting method based on an attention mechanism, but the real-time processing speed is difficult to achieve when high-resolution materials are processed. The method thus divides model training into two phases. The first stage carries out automatic green curtain image matting under low resolution, and the second stage adds a light-weight high-resolution processing module DGF on the basis of convergence of the first stage, and restores the low-resolution processing result to high resolution, thereby realizing the real-time automatic green curtain image matting method under high resolution.
The high-resolution processing module adopts a depth guided filtering DGF module, and the traditional guided filtering algorithm can not only realize the edge smoothness of bilateral filtering, but also has good performance near the detected edge, and can be applied to scenes such as image enhancement, HDR compression, image matting, image defogging and the like. Depth-guided filtering is a guided filtering method implemented based on deep learning, which can be used to efficiently generate a high-resolution output under a corresponding low-resolution output and high-resolution guide map. As shown in the following equation, where DGF denotes the depth guided filtering module employed, F lr ,α lr To representFirst stage model output results at low resolution, F hr ,α hr Representing the output of the model at the second stage at high resolution.
F hr ,α hr =DGF(C,C′,F lr ,α lr ,hidden) (16)
c. Model output and loss function
The output of the second-stage model is a foreground graph with high resolution and green overflow and a corresponding Alpha graph, and the loss function comprises a plurality of loss functions under high resolution besides the loss function of the first stage, as shown in the following formula.
Figure BDA0003815346080000151
Figure BDA0003815346080000152
Figure BDA0003815346080000153
Figure BDA0003815346080000154
Figure BDA0003815346080000155
Loss function L hr Comprises the following steps:
Figure BDA0003815346080000156
the following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above embodiments.
The invention also provides a high-resolution real-time automatic green screen image matting system based on an attention mechanism, which comprises the following components:
the module 1 is used for constructing a neural network model comprising a feature encoder, a cavity space convolution pooling pyramid module, an attention module and a feature decoder;
the module 2 is used for enabling the feature encoder to sample the training image to obtain a low-definition image, extracting the image features of the low-definition image and generating intermediate features in the extraction process;
the module 3 is used for enabling the void space convolution pooling pyramid module to carry out parallel sampling on the image features by void convolution with different sampling rates, and the attention module carries out feature extraction on sampling results to obtain attention features;
a module 4, configured to enable the feature decoder to decode the attention feature according to the intermediate feature, so as to obtain an intermediate result including the foreground map of the low-definition image after eliminating green overflow and the channel transparency map of the low-definition image;
a module 5, configured to construct a first loss based on the intermediate result and train the neural network model, with a foreground map label and a channel transparent map label of the low-definition image as training targets;
a module 6, configured to add a high resolution processing module to the neural network model output after the training is completed to obtain a matting model, where the high resolution processing module restores the intermediate result to a resolution that is the same as that of the training image to obtain a matting result that includes a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;
a module 7, configured to construct a second loss based on the matting result by using the foreground image label and the channel transparency image label of the training image as training targets, and train the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.
The high-resolution real-time automatic green curtain image matting system based on the attention mechanism is characterized in that the training image is generated by selecting a green curtain data set with a green overflowing foreground image and a corresponding channel transparent image, and synthesizing the green curtain data set with a green curtain background image to obtain the training image.
The high-resolution real-time automatic green screen matting system based on the attention mechanism is characterized in that the feature encoder comprises a plurality of convolution layers, and features behind each convolution layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;
the intermediate results also include an error map and implicit feature, hidden, of the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplacian penalty for the channel transparency map, L1 penalty and laplacian penalty for the foreground map, and L2 penalty for the error map.
The high-resolution real-time automatic green curtain image matting system based on the attention mechanism is characterized in that the generation process of the training image is that a foreground image F with green overflow and a corresponding channel transparent image alpha are randomly selected from a green curtain image matting data set containing green overflow, a background image B of a green curtain background data set is randomly selected, and a synthetic image C is generated according to a synthetic formula after the three are unified in resolution:
C=F×α+B(1-α)
performing down-sampling processing on the synthetic image C to obtain a low-resolution image C';
the Feature extraction process of the Feature encoder is shown in the following formula, wherein Feature m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;
Feature m ,Sh ortscuts=MobileNetV2(C′)
the sampling of the void space convolution pooling pyramid module is as follows:
Feature aspp =ASPP(Feature m )
wherein Feature aspp Indicates the nullSampling results output by the hole space convolution pooling pyramid module;
the attention module performs Feature extraction by the following formula, feature a Indicating this attention feature:
Feature se =SE(Feature aspp )
the feature decoder comprises a plurality of convolutional layers, each convolutional layer except the last layer is followed by a BN layer and a ReLU activation function, and bilinear upsampling is adopted before each convolutional layer and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSSample () represents upsampling, concat represents feature splicing, short endings i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;
Feature up =UpSample(Feature se )
Feature cat =Concat(Feature up ,Shortcuts i )
Feature conv1 =ConvBlocki(Feature cat )
the intermediate result also includes an error map and implicit feature, hidden, of the low-definition image, the first penalty being:
Figure BDA0003815346080000171
the first loss comprises a loss of a transparent map of the channel
Figure BDA0003815346080000172
Loss of gradient
Figure BDA0003815346080000173
And laplace loss
Figure BDA0003815346080000174
Loss of foreground map
Figure BDA0003815346080000175
And lapalalLoss of Si
Figure BDA0003815346080000176
And loss L to error map err
Figure BDA0003815346080000177
Use of L1 loss to scale predicted low-clearance channel transparencies α lr And the foreground icon of the low-definition image is signed group Truth
Figure BDA0003815346080000178
The difference between, i represents the pixel position;
Figure BDA0003815346080000179
wherein
Figure BDA00038153460800001710
Represents a gradient;
Figure BDA00038153460800001711
wherein
Figure BDA00038153460800001712
Representing a system for solving a pyramid, and s representing a pyramid level;
Figure BDA00038153460800001713
Figure BDA00038153460800001714
Figure BDA00038153460800001715
the foreground icon representing the low-definition image is labeled group Truth, F lr Representing the predicted low-definition foreground image, with a standard L2 penalty for the error map, where err represents the predicted error map,
Figure BDA0003815346080000181
a group Truth representing an error graph;
Figure BDA0003815346080000182
Figure BDA0003815346080000183
the output of the second stage model at high resolution is F hr ,α hr . As shown in the following equation, where DGF denotes the depth guided filtering module employed, F lr ,α lr H idden represents the low serum median result:
F hr ,α hr =DGF(C,C′,F l r,α lr ,h idden)
the second loss L hr Except for the first loss L as shown below lr In addition, the loss of channel transparency map for high resolution output results is included
Figure BDA0003815346080000184
Loss of gradient
Figure BDA0003815346080000185
And laplace loss
Figure BDA0003815346080000186
Loss of foreground map
Figure BDA0003815346080000187
And laplace loss
Figure BDA0003815346080000188
Figure BDA0003815346080000189
Wherein:
Figure BDA00038153460800001810
Figure BDA00038153460800001811
Figure BDA00038153460800001812
Figure BDA00038153460800001813
Figure BDA00038153460800001814
the invention also provides a storage medium for storing a program for executing any one of the high-resolution real-time automatic green curtain image matting methods based on the attention mechanism.
The invention also provides a client used for the high-resolution real-time automatic green screen image matting system based on the attention mechanism.

Claims (10)

1. A high-resolution real-time automatic green screen image matting method based on an attention mechanism is characterized by comprising the following steps:
step 1, constructing a neural network model comprising a feature encoder, a void space convolution pooling pyramid module, an attention module and a feature decoder;
step 2, the feature encoder samples the training image to obtain a low-definition image, extracts the image features of the low-definition image and generates intermediate features in the extraction process;
step 3, the cavity space convolution pooling pyramid module samples the image features in parallel by cavity convolution with different sampling rates, and the attention module extracts the features of the sampling results to obtain attention features;
step 4, the feature decoder decodes the attention feature according to the intermediate feature to obtain an intermediate result which comprises a foreground image of the low-definition image after green overflow is eliminated and a channel transparent image of the low-definition image;
step 5, taking the foreground image label and the channel transparent image label of the low-definition image as training targets, constructing a first loss based on the intermediate result, and training the neural network model;
step 6, adding a high-resolution processing module for the neural network model output after the training is finished to obtain a matting model, and restoring the intermediate result to the resolution which is the same as that of the training image by the high-resolution processing module to obtain a matting result which comprises a foreground image of the training image after the green overflow is eliminated and a channel transparency image of the training;
step 7, constructing a second loss based on the matting result by taking the foreground image label and the channel transparent image label of the training image as training targets, and training the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.
2. The attention-based high-resolution real-time automatic green curtain matting method according to claim 1, wherein the training image is generated by selecting a foreground image with green overflow and a corresponding channel transparency image from a green curtain data set and synthesizing the foreground image with a green curtain background image to obtain the training image.
3. The attention-based high-resolution real-time automatic green screen matting method according to claim 1, wherein the feature encoder includes a plurality of convolutional layers, the features behind each convolutional layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;
the intermediate results also include an error map and implicit feature, hidden, of the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplacian penalty for the channel transparency map, L1 penalty and laplacian penalty for the foreground map, and L2 penalty for the error map.
4. The attention mechanism-based high-resolution real-time automatic green screen matting method according to claim 1, wherein the training image generation process is to randomly select a foreground image F with green overflow and a corresponding channel transparency image α from a green screen matting data set containing green overflow, randomly select a background image B of a green screen background data set, unify resolutions of the three, and generate a composite image C according to a composite formula:
C=F×α+B(1-α)
performing down-sampling processing on the synthetic image C to obtain a low-resolution image C';
the Feature extraction process of the Feature encoder is shown as the following formula, wherein Feature m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;
Feature m ,Sh ortscuts=MobileNetV2(C′)
the sampling of the cavity space convolution pooling pyramid module is as follows:
Feature aspp =ASPP(Feature m )
wherein Feature aspp Representing the sampling result output by the void space convolution pooling pyramid module;
the attention module performs Feature extraction, feature, by the following equation se Representing this attention feature:
Feature se =SE(Feature aspp )
the feature decoder includes a plurality of convolutional layers, each convolutional layer except the last layer being followed by a BN layer and a ReLU activation functionBefore each convolution layer, bilinear up-sampling is adopted and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSample () represents upsampling, concat represents feature splicing, short clients i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;
Feature up =UpSample(Feature se )
Feature cat =Concat(Feature up ,Shortcuts i )
Feature conv1 =ConvBlocki(Feature cat )
the intermediate result also includes an error map and an implicit feature hidden of the low-definition image, and the first penalty is:
Figure FDA0003815346070000021
the first loss comprises a loss of a transparent map of the channel
Figure FDA0003815346070000022
Loss of gradient
Figure FDA0003815346070000023
And laplace loss
Figure FDA0003815346070000024
Loss of foreground map
Figure FDA0003815346070000025
And laplace loss
Figure FDA0003815346070000026
And loss L to error graph err
Figure FDA0003815346070000031
Use of L1 loss to scale predicted low-clearance channel transparencies α lr And the foreground icon label of the low-definition image is Ground Truth
Figure FDA0003815346070000032
The difference between, i represents the pixel position;
Figure FDA0003815346070000033
wherein
Figure FDA0003815346070000034
Represents a gradient;
Figure FDA0003815346070000035
wherein
Figure FDA0003815346070000036
Representing a method for solving a pyramid, wherein s represents a pyramid level;
Figure FDA0003815346070000037
Figure FDA0003815346070000038
Figure FDA0003815346070000039
the foreground icon representing the low-definition image is labeled group Truth, F lr Representing the predicted low-definition foreground image, using a standard L2 penalty for the error map, where err represents the predicted error map,
Figure FDA00038153460700000310
a group Truth representing an error graph;
Figure FDA00038153460700000311
Figure FDA00038153460700000312
the output of the second stage model at high resolution is F hr ,α hr As shown in the following equation, where DGF denotes the depth guided filtering module employed, F lr ,α lr H idden indicates the low-serum median result:
F hr ,α hr =DGF(C,C′,F lr ,α lr ,h idden)
the second loss L hr Except for the first loss L as shown below lr In addition, the loss of channel transparency map for high resolution output results is included
Figure FDA00038153460700000313
Loss of gradient
Figure FDA00038153460700000314
And laplace loss
Figure FDA00038153460700000315
Loss of foreground map
Figure FDA00038153460700000316
And laplace loss
Figure FDA00038153460700000317
Figure FDA00038153460700000318
Wherein:
Figure FDA00038153460700000319
Figure FDA00038153460700000320
Figure FDA00038153460700000321
Figure FDA00038153460700000322
Figure FDA0003815346070000041
5. a high-resolution real-time automatic green screen matting system based on an attention mechanism is characterized by comprising:
the module 1 is used for constructing a neural network model comprising a feature encoder, a cavity space convolution pooling pyramid module, an attention module and a feature decoder;
the module 2 is used for enabling the feature encoder to sample the training image to obtain a low-definition image, extracting the image features of the low-definition image and generating intermediate features in the extraction process;
a module 3, configured to enable the void space convolution pooling pyramid module to perform parallel sampling on the image features by void convolution at different sampling rates, and perform feature extraction on the sampling result by the attention module to obtain an attention feature;
a module 4, configured to enable the feature decoder to decode the attention feature according to the intermediate feature, so as to obtain an intermediate result that includes the foreground map of the low-definition image after green overflow is eliminated and the channel transparency map of the low-definition image;
a module 5, configured to construct a first loss based on the intermediate result and train the neural network model, with a foreground map label and a channel transparent map label of the low-definition image as training targets;
a module 6, configured to add a high resolution processing module to the neural network model output after the training is completed, to obtain a matting model, where the high resolution processing module restores the intermediate result to a resolution that is the same as that of the training image, to obtain a matting result that includes a foreground image of the training image after green overflow is eliminated and a channel transparency image of the training;
a module 7, configured to construct a second loss based on the matting result by using the foreground image label and the channel transparency image label of the training image as training targets, and train the matting model; and inputting the green screen image into the trained image matting model to obtain a foreground image and a channel transparent image as the image matting result of the green screen image.
6. The attention-based high-resolution real-time automatic green curtain matting system according to claim 1, wherein the training image is generated by selecting a foreground image with green color overflow and a corresponding channel transparency image from a green curtain data set and synthesizing the foreground image with a green curtain background image to obtain the training image.
7. The attention-based high-resolution real-time automatic green screen matting system according to claim 1, wherein the feature encoder comprises a plurality of convolutional layers, and the features after each convolutional layer are reserved as the intermediate features; the feature decoder includes a plurality of convolutional layers, each convolutional layer upsampling a feature of a previous layer and connecting with the intermediate feature;
the intermediate result also includes an error map and implicit feature hidden for the low-definition image, the first penalties including L1 penalty, gradient penalty, and laplace penalty for the channel transparency map, L1 penalty and laplace penalty for the foreground map, and L2 penalty for the error map.
8. The attention mechanism-based high-resolution real-time automatic green screen matting system according to claim 1, wherein the training image is generated by randomly selecting a foreground image F with green overflow and a corresponding channel transparency image α from a green screen matting dataset containing green overflow, randomly selecting a background image B of a green screen background dataset, unifying resolutions of the three, and generating a composite image C according to a composite formula:
C=F×α+B(1-α)
performing down-sampling processing on the synthetic image C to obtain a low-resolution image C';
the Feature extraction process of the Feature encoder is shown as the following formula, wherein Feature m Representing the image features extracted after passing through a plurality of convolution modules in the feature encoder, wherein Shortcuts represent the intermediate features of each volume block;
Feature m ,Sh ortscuts=MobileNetV2(C′)
the sampling of the void space convolution pooling pyramid module is as follows:
Feature aspp =ASPP(Feature m )
wherein Feature aspp Representing the sampling result output by the void space convolution pooling pyramid module;
the attention module performs Feature extraction, feature, by the following equation se Indicating this attention feature:
Feature se =SE(Feature aspp )
the feature decoder comprises a plurality of convolutional layers, each convolutional layer except the last layer is followed by a BN layer and a ReLU activation function, and bilinear upsampling is adopted before each convolutional layer and spliced with the intermediate features from the feature encoder; the feature decoder decoding process is shown as follows, where UpSample () represents upsampling, concat represents feature splicing, short clients i Represents the corresponding i-th layer intermediate features, and ConvBlocki represents the i-th convolution module of the decoder;
Feature up =UpSample(Feature se )
Feature cat =Concat(Feature up ,Shortcuts i )
Feature conv1 =ConvBlocki(Feature cat )
the intermediate result also includes an error map and implicit feature, hidden, of the low-definition image, the first penalty being:
Figure FDA0003815346070000051
the first penalty comprises a penalty for a transparency map of the channel
Figure FDA0003815346070000052
Loss of gradient
Figure FDA0003815346070000053
And laplace loss
Figure FDA0003815346070000054
Loss of foreground map
Figure FDA0003815346070000055
And laplace loss
Figure FDA0003815346070000056
And loss L to error graph err
Figure FDA0003815346070000061
Use of L1 loss to scale predicted low-clearance channel transparencies α lr And the foreground icon of the low-definition image is signed group Truth
Figure FDA0003815346070000062
The difference between, i represents the pixel position;
Figure FDA0003815346070000063
wherein
Figure FDA0003815346070000064
Represents a gradient;
Figure FDA0003815346070000065
wherein
Figure FDA0003815346070000066
Representing a system for solving a pyramid, and s representing a pyramid level;
Figure FDA0003815346070000067
Figure FDA0003815346070000068
Figure FDA0003815346070000069
the foreground icon representing the low-definition image is labeled group Truth, F lr Representing the predicted low-definition foreground image, using a standard L2 penalty for the error map, where err represents the predicted error map,
Figure FDA00038153460700000610
a group Truth representing an error graph;
Figure FDA00038153460700000611
Figure FDA00038153460700000612
the output of the second stage model at high resolution is F hr ,α hr . As shown in the following equation, wherein DGF denotes the depth guided filtering module employed, F lr ,α lr H idden indicates the low-serum median result:
F hr ,α hr =DGF(C,C′,F lr ,α lr ,h idden)
the second loss L hr Except for the first loss L as shown below lr In addition, the loss of channel transparency map for high resolution output results is included
Figure FDA00038153460700000613
Loss of gradient
Figure FDA00038153460700000614
And laplace loss
Figure FDA00038153460700000615
Loss of foreground map
Figure FDA00038153460700000616
And laplace loss
Figure FDA00038153460700000617
Figure FDA00038153460700000618
Wherein:
Figure FDA00038153460700000619
Figure FDA00038153460700000620
Figure FDA00038153460700000621
Figure FDA00038153460700000622
Figure FDA0003815346070000071
9. a storage medium storing a program for executing the high resolution real-time automatic green screen matting method based on attention mechanism according to any one of claims 1 to 4.
10. A client for the high resolution real-time automatic green screen matting system based on the attention mechanism as claimed in any one of claims 5 to 8.
CN202211029515.5A 2022-08-25 2022-08-25 High-resolution real-time automatic green screen image matting method and system based on attention mechanism Pending CN115457266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211029515.5A CN115457266A (en) 2022-08-25 2022-08-25 High-resolution real-time automatic green screen image matting method and system based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211029515.5A CN115457266A (en) 2022-08-25 2022-08-25 High-resolution real-time automatic green screen image matting method and system based on attention mechanism

Publications (1)

Publication Number Publication Date
CN115457266A true CN115457266A (en) 2022-12-09

Family

ID=84300809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211029515.5A Pending CN115457266A (en) 2022-08-25 2022-08-25 High-resolution real-time automatic green screen image matting method and system based on attention mechanism

Country Status (1)

Country Link
CN (1) CN115457266A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746264A (en) * 2023-12-07 2024-03-22 河北翔拓航空科技有限公司 Multitasking implementation method for unmanned aerial vehicle detection and road segmentation
CN118115734A (en) * 2024-01-25 2024-05-31 浪潮智能终端有限公司 Portrait matting method based on general attention mechanism ESP-CA

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746264A (en) * 2023-12-07 2024-03-22 河北翔拓航空科技有限公司 Multitasking implementation method for unmanned aerial vehicle detection and road segmentation
CN118115734A (en) * 2024-01-25 2024-05-31 浪潮智能终端有限公司 Portrait matting method based on general attention mechanism ESP-CA

Similar Documents

Publication Publication Date Title
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
CN109671023B (en) Face image super-resolution secondary reconstruction method
Guo et al. Dense scene information estimation network for dehazing
CN115457266A (en) High-resolution real-time automatic green screen image matting method and system based on attention mechanism
CN113408471B (en) Non-green-curtain portrait real-time matting algorithm based on multitask deep learning
CN112884776B (en) Deep learning matting method based on synthesis data set augmentation
CN111145290B (en) Image colorization method, system and computer readable storage medium
CN114549574A (en) Interactive video matting system based on mask propagation network
Guo et al. Dense123'color enhancement dehazing network
WO2023066173A1 (en) Image processing method and apparatus, and storage medium and electronic device
CN114092774B (en) RGB-T image significance detection system and detection method based on information flow fusion
CN109191392A (en) A kind of image super-resolution reconstructing method of semantic segmentation driving
CN112489056A (en) Real-time human body matting method suitable for mobile terminal
CN114693929A (en) Semantic segmentation method for RGB-D bimodal feature fusion
CN112288630A (en) Super-resolution image reconstruction method and system based on improved wide-depth neural network
CN116664435A (en) Face restoration method based on multi-scale face analysis map integration
Xiao et al. Image hazing algorithm based on generative adversarial networks
CN113240701A (en) Real-time high-resolution opera character matting method under non-green curtain
CN117097853A (en) Real-time image matting method and system based on deep learning
CN114463189A (en) Image information analysis modeling method based on dense residual UNet
WO2023010981A1 (en) Encoding and decoding methods and apparatus
CN116342877A (en) Semantic segmentation method based on improved ASPP and fusion module in complex scene
CN113627342B (en) Method, system, equipment and storage medium for video depth feature extraction optimization
CN112164078B (en) RGB-D multi-scale semantic segmentation method based on encoder-decoder
CN110111254B (en) Depth map super-resolution method based on multi-stage recursive guidance and progressive supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination