CN117097853A

CN117097853A - Real-time image matting method and system based on deep learning

Info

Publication number: CN117097853A
Application number: CN202311031197.0A
Authority: CN
Inventors: 吴呈瑜; 秦江山; 占敖
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-21

Abstract

The invention provides a real-time image matting method and a system based on deep learning, wherein the method comprises the following steps: s1: acquiring a matting data set; s2: constructing a matting network model based on a ViT and CNN mixed structure; s3: training the model by using the data set, and correcting the model by using the loss function to obtain a trained model; s4: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time. Aiming at the problem of unstable image matting results under a complex background, the invention integrates a self-attention mechanism to strengthen the global information extraction capability, reduces the possibility of semantic misjudgment of foreground and background pixels, and ensures the precision of the image matting results. Meanwhile, the invention can process video data in real time without additional constraint, has low use cost and can be used for various non-professional scenes.

Description

A real-time keying method and system based on deep learning

技术领域Technical field

本发明属于图像处理技术领域，尤其涉及计算机视觉技术，具体涉及一种基于深度学习的实时抠像方法及系统。The invention belongs to the field of image processing technology, in particular to computer vision technology, and specifically to a real-time matting method and system based on deep learning.

背景技术Background technique

抠像是一种计算机视觉领域的热门技术，它可以从图片或视频中有效地分离出人们感兴趣的前景物体，广泛应用于电视直播、电影特效、广告宣传等多种商业领域。该技术的数学模型如公式(1)所示：Keying is a popular technology in the field of computer vision. It can effectively separate foreground objects that people are interested in from pictures or videos. It is widely used in various commercial fields such as live television, movie special effects, and advertising. The mathematical model of this technology is shown in formula (1):

I＝αF+(1-α)B (1)I＝αF+(1-α)B (1)

其中，I为给定的图片或视频帧，F为前景图像、B为背景图像，α为alpha图，即前景图像像素的不透明度。在仅有已知量I的情况下，无法通过该公式得出另外三个未知量的值，因此该问题是欠约束问题。Among them, I is a given picture or video frame, F is the foreground image, B is the background image, and α is the alpha map, which is the opacity of the foreground image pixels. In the case of only known quantity I, the values of the other three unknown quantities cannot be obtained through this formula, so the problem is an under-constrained problem.

传统抠像方法基于采样和传播，通过假设图像中不同像素的颜色具有一定的函数关系，人为地对公式(1)添加约束。这类方法没有充分利用图像的上下文信息，在前景像素和背景像素颜色相似时，容易发生误判，抠像的精度较低，结果不稳定。Traditional keying methods are based on sampling and propagation, and artificially add constraints to formula (1) by assuming that the colors of different pixels in the image have a certain functional relationship. This type of method does not make full use of the contextual information of the image. When the foreground pixels and background pixels are similar in color, misjudgments are prone to occur. The keying accuracy is low and the results are unstable.

由于传统方法存在各种缺陷，目前自然背景抠像主要使用两种基于深度学习的方法：基于卷积神经网络(Convolution Neural Network,CNN)的抠像方法、基于VisionTransformer(ViT)的抠像方法。其中，基于CNN的方法通过使用卷积层、池化层和激活层构造卷积神经网络模型，是深度学习中较为传统的方法；基于ViT的方法使用ViT模块来构造神经网络模型，该方法构造的模型可以是纯ViT结构，也可以是CNN和ViT的混合结构。ViT拥有自注意力机制，可以捕捉图像中长程像素点的关联性，对图像全局信息进行建模，相比CNN具有更高的精度，是计算机视觉领域的新兴技术。Due to various defects in traditional methods, two deep learning-based methods are currently used for natural background matting: the matting method based on Convolution Neural Network (CNN) and the matting method based on VisionTransformer (ViT). Among them, the CNN-based method constructs a convolutional neural network model by using convolution layers, pooling layers, and activation layers, which is a more traditional method in deep learning; the ViT-based method uses the ViT module to construct a neural network model, which constructs The model can be a pure ViT structure or a hybrid structure of CNN and ViT. ViT has a self-attention mechanism that can capture the correlation of long-range pixels in the image and model the global information of the image. It has higher accuracy than CNN and is an emerging technology in the field of computer vision.

经过对现有文献的检索后发现如下相关文献：After searching the existing literature, the following relevant literature was found:

稳定视频抠像(Robust Video Matting,RVM)方法(Lin Shanchuan,Yang Linjie,Saleemi I,et al.“Robust High-Resolution Video Matting with Temporal Guidance(基于时间引导的高分辨率稳定视频抠像)”.Proc of IEEE/CVF Winter Conference onApplications of Computer Vision.Waikoloa,HI,USA:IEEE Press,2022:3132-3141)，使用MobileNet为主干网络实现自然背景的实时抠像。该方法实时性高，在简单背景下抠像精度较好。但由于该方法仍使用传统CNN结构，全局信息的处理能力较弱，因此在复杂背景下容易混淆前景和背景像素。Robust Video Matting (RVM) method (Lin Shanchuan, Yang Linjie, Saleemi I, et al. "Robust High-Resolution Video Matting with Temporal Guidance (High-resolution stable video matting based on time guidance)". Proc of IEEE/CVF Winter Conference onApplications of Computer Vision.Waikoloa, HI, USA: IEEE Press, 2022:3132-3141), using MobileNet as the backbone network to achieve real-time keying of natural backgrounds. This method has high real-time performance and good keying accuracy on simple backgrounds. However, because this method still uses the traditional CNN structure and has weak global information processing capabilities, it is easy to confuse foreground and background pixels in complex backgrounds.

VMFormer(Video Matting with Transformer)方法(Li Jiachen,Goel V,Ohanyan M,et al.VMFormer (Video Matting with Transformer) method (Li Jiachen, Goel V, Ohanyan M, et al.

“VMFormer:End-to-End Video Matting with Transformer(基于Transformer的端到端视频抠像)”,https://arxiv.org/abs/2208.12801)，改进了以往CNN结构处理图像的不足之处，引入Vision Transformer实现图像的特征提取和特征图解码。该方法在编码和解码器上都大量使用了普通的Vision Transformer结构，导致其所构造的网络模型参数较多，达到了RVM方法的约2倍。实验表明，VMFormer方法所提出的网络模型在NvidiaGeForce RTX 4060GPU上仅能以每秒3帧的速度处理1080p图像，难以实现图像的实时处理。目前，此方法仅适合处理现成的视频文件，无法应用于直播等领域，在使用场景上受到一定限制。"VMFormer: End-to-End Video Matting with Transformer (End-to-End Video Matting based on Transformer)", https://arxiv.org/abs/2208.12801), which improves the shortcomings of previous CNN structures in image processing, Vision Transformer is introduced to implement image feature extraction and feature map decoding. This method uses a large number of ordinary Vision Transformer structures in both the encoder and decoder, resulting in a large number of network model parameters, which is about 2 times that of the RVM method. Experiments show that the network model proposed by the VMFormer method can only process 1080p images at a speed of 3 frames per second on the Nvidia GeForce RTX 4060 GPU, making it difficult to achieve real-time processing of images. Currently, this method is only suitable for processing ready-made video files and cannot be applied to fields such as live broadcasts. It is subject to certain limitations in usage scenarios.

发明内容Contents of the invention

针对自然背景的实时抠像任务，现有CNN模型处理复杂背景时精度不佳的问题，本发明提出一种利用ViT模型中自注意力机制来强化全局关系建模，减少图像像素语义识别错误的频率，进而在保证实时性的基础上实现高分辨率和高精度视频抠像的技术方案。Aiming at the real-time matting task of natural backgrounds, the existing CNN model has poor accuracy when processing complex backgrounds. The present invention proposes a method that uses the self-attention mechanism in the ViT model to strengthen global relationship modeling and reduce image pixel semantic recognition errors. frequency, and then achieve a technical solution for high-resolution and high-precision video keying while ensuring real-time performance.

为解决上述问题，本发明采用的技术方案如下：In order to solve the above problems, the technical solutions adopted by the present invention are as follows:

一种基于深度学习的实时抠像方法，包括以下步骤：A real-time keying method based on deep learning, including the following steps:

S1：获取抠像数据集；S1: Obtain the keying data set;

S2：构建基于ViT与CNN混合结构的抠像网络模型；S2: Construct a matting network model based on the hybrid structure of ViT and CNN;

S3：利用数据集对模型进行训练，并通过损失函数进行校正，得到训练完成的模型；S3: Use the data set to train the model, and correct it through the loss function to obtain the trained model;

S4：将待抠像的图像文件或摄像头得到的视频帧，和来自上一时刻的循环特征图输入训练完成的模型，实时获取抠像alpha图和本时刻的循环特征图。S4: Input the image file to be matted or the video frame obtained by the camera and the cyclic feature map from the previous moment into the trained model, and obtain the matting alpha map and the cyclic feature map at this moment in real time.

优选的，在步骤S1中，所述的抠像数据集具体包含视频抠像前景数据集、视频背景数据集、图片抠像前景数据集、图片背景数据集和人像分割数据集，均为360p、720p或1080p的图像。其中，抠像前景数据集包含前景图像和对应的alpha图。Preferably, in step S1, the keying data set specifically includes a video keying foreground data set, a video background data set, a picture keying foreground data set, a picture background data set and a portrait segmentation data set, all of which are 360p, 720p or 1080p images. Among them, the keyed foreground data set contains foreground images and corresponding alpha maps.

优选的，在步骤S2中，构建基于ViT与CNN混合结构的抠像网络模型时，采用以下方法：Preferably, in step S2, when constructing the matting network model based on the hybrid structure of ViT and CNN, the following method is used:

S2.1：构建原始图像重采样子网络，将较高分辨率的原始图像进行下采样后再送入编码器子网络处理，并将解码器网络生成的低分辨率alpha图恢复为原始分辨率alpha图；S2.1: Construct the original image resampling sub-network, down-sample the higher-resolution original image and then send it to the encoder sub-network for processing, and restore the low-resolution alpha image generated by the decoder network to the original resolution alpha picture;

S2.2：构建基于ViT和CNN混合结构的特征提取编码器子网络，从下采样后的原始图像中提取多层次特征；S2.2: Construct a feature extraction encoder subnetwork based on the hybrid structure of ViT and CNN to extract multi-level features from the downsampled original image;

S2.3：构建瓶颈块子网络，衔接编码器和解码器子网络；S2.3: Construct a bottleneck block subnetwork to connect the encoder and decoder subnetworks;

S2.4：构建基于注意力和内容感知的循环解码器子网络，对特征图进行时空建模，并生成低分辨率的alpha图。S2.4: Construct a recurrent decoder subnetwork based on attention and content awareness, perform spatiotemporal modeling of feature maps, and generate low-resolution alpha maps.

优选的，在步骤S2.1中，构建原始图像重采样子网络，具体包含以下步骤：Preferably, in step S2.1, construct the original image resampling subnetwork, which specifically includes the following steps:

S2.1.1：将高分辨率的原始图像F1.1通过平均池化操作进行下采样，得到低分辨率原始图像F1.2，将其送入步骤S2.2所述的编码器子网络；S2.1.1: Downsample the high-resolution original image F1.1 through the average pooling operation to obtain the low-resolution original image F1.2, and send it to the encoder subnetwork described in step S2.2;

S2.1.2：将步骤S2.4所述的编码器子网络生成的低分辨率alpha图F1.3与步骤S2.1.1所述的高分辨率原始图像F1.1拼接，输入到深度导向滤波器(Deep Guided Filter,DGF)中，从而恢复出原始分辨率的alpha图F1.4。S2.1.2: Splice the low-resolution alpha image F1.3 generated by the encoder subnetwork described in step S2.4 with the high-resolution original image F1.1 described in step S2.1.1, and input it into the depth-guided filter (Deep Guided Filter, DGF), thereby restoring the alpha image F1.4 of the original resolution.

优选的，在步骤S2.2中，构建基于ViT和CNN混合结构的特征提取编码器子网络，具体是：Preferably, in step S2.2, a feature extraction encoder subnetwork based on the hybrid structure of ViT and CNN is constructed, specifically:

将3个Mobile ViT V3模块嵌入到MobileNet V3 Large的17个倒残差块中，组成编码器子网络，并从该子网络中引出3个跳跃连接特征图F2.1、F2.2、F2.3。编码器网络的末端输出特征图F2.4。Three Mobile ViT V3 modules are embedded into the 17 inverted residual blocks of MobileNet V3 Large to form an encoder subnetwork, and three skip connection feature maps F2.1, F2.2, and F2 are derived from the subnetwork. 3. The end of the encoder network outputs feature map F2.4.

优选的，在步骤S2.3中，构建瓶颈块子网络，具体是：Preferably, in step S2.3, a bottleneck block subnetwork is constructed, specifically:

使用卷积块注意力模块(Convolutional Block Attention Module,CBAM)、LR-ASPP、Conv-GRU和内容感知的特征重构(Content-Aware Reassembly of FEatures,CARAFE)上采样算子依次连接组成瓶颈块子网络。该子网络接受特征图F2.4作为输入，并输出特征图F3。Use the Convolutional Block Attention Module (CBAM), LR-ASPP, Conv-GRU and Content-Aware Reassembly of FEatures (CARAFE) upsampling operators to connect in turn to form bottleneck blocks. network. This sub-network accepts feature map F2.4 as input and outputs feature map F3.

优选的，在步骤S2.4中，构建基于注意力和内容感知的循环解码器子网络，具体模块结构是：Preferably, in step S2.4, a recurrent decoder subnetwork based on attention and content awareness is constructed. The specific module structure is:

构造3个编码器模块D1、D2、D3，每个模块的构造方法为：使用卷积层、标准化层、激活层、Conv-GRU和CARAFE上采样算子前后连接，组合成解码器模块。Construct three encoder modules D1, D2, and D3. The construction method of each module is: use the convolution layer, normalization layer, activation layer, Conv-GRU and CARAFE upsampling operator to connect back and forth to combine into a decoder module.

优选的，在步骤S2.4中，构建基于注意力和内容感知的循环解码器子网络，具体步骤是：Preferably, in step S2.4, a recurrent decoder subnetwork based on attention and content awareness is constructed. The specific steps are:

S2.4.1：将低分辨率原始图像F1.2经过8倍下采样得到图F4.1.1，将跳跃连接的特征图F2.1经过一层CBAM得到特征图F4.1.2，然后将F4.1.1、F4.1.2和F3送入解码器模块D1，得到输出特征图F4.1.3；S2.4.1: The low-resolution original image F1.2 is downsampled 8 times to obtain the image F4.1.1, and the skip-connected feature map F2.1 is passed through a layer of CBAM to obtain the feature map F4.1.2, and then F4.1.1, F4.1.2 and F3 are sent to the decoder module D1 to obtain the output feature map F4.1.3;

S2.4.2：将低分辨率原始图像F1.2经过4倍下采样得到图F4.2.1，将跳跃连接的特征图F2.2经过一层CBAM得到特征图F4.2.2，然后将F4.2.1、F4.2.2和F4.1.3送入解码器模块D2，得到输出特征图F4.2.3；S2.4.2: The low-resolution original image F1.2 is downsampled 4 times to obtain the image F4.2.1, and the skip-connected feature map F2.2 is passed through a layer of CBAM to obtain the feature map F4.2.2, and then F4.2.1, F4.2.2 and F4.1.3 are sent to the decoder module D2 to obtain the output feature map F4.2.3;

S2.4.3：将低分辨率原始图像F1.2经过2倍下采样得到图F4.3.1，将跳跃连接的特征图F2.3经过一层CBAM得到特征图F4.3.2，然后将F4.3.1、F4.3.2和F4.2.3送入解码器模块D3，得到输出特征图F4.3.3；S2.4.3: The low-resolution original image F1.2 is downsampled by 2 times to obtain the image F4.3.1, and the skip-connected feature map F2.3 is passed through a layer of CBAM to obtain the feature map F4.3.2, and then F4.3.1, F4.3.2 and F4.2.3 are sent to the decoder module D3 to obtain the output feature map F4.3.3;

S2.4.4：将低分辨率原始图像F1.2和特征图F4.3.3拼接，送入两组卷积层、标准化层、激活层构成的模块处理，得到低分辨率alpha图F1.3。S2.4.4: Splice the low-resolution original image F1.2 and the feature map F4.3.3, and send them to the module processing composed of two sets of convolution layers, normalization layers, and activation layers to obtain the low-resolution alpha image F1.3.

优选的，在步骤S3中，使用的损失函数具体是：L1损失、拉普拉斯金字塔损失和时间连贯性损失三者之和，计算公式如下：Preferably, in step S3, the loss function used is specifically: the sum of L1 loss, Laplacian pyramid loss and temporal coherence loss. The calculation formula is as follows:

其中，α_t是t时刻真实的alpha图，为t时刻预测的alpha图，/>代表alpha图在拉普拉斯金字塔第i层的值。Among them, α _t is the real alpha map at time t, Alpha map predicted for time t,/> Represents the value of the alpha map at the i-th level of the Laplacian pyramid.

优选的，在步骤S4中，模型的输入具体是：原始图像F1.1和来自上一时刻Conv-GRU层输出的循环特征图T1.1、T1.2、T1.3，其中，循环特征图是可选输入，在处理单张图片时非必要。Preferably, in step S4, the inputs of the model are: the original image F1.1 and the cyclic feature maps T1.1, T1.2, and T1.3 output from the Conv-GRU layer at the previous moment, where the cyclic feature maps This is an optional input and is not necessary when processing a single image.

优选的，在步骤S4中，模型的输出具体是：预测的alpha图F1.4和本时刻Conv-GRU层输出的循环特征图T2.1、T2.2、T2.3。Preferably, in step S4, the output of the model is specifically: the predicted alpha map F1.4 and the cyclic feature maps T2.1, T2.2, and T2.3 output by the Conv-GRU layer at this time.

本发明还公开了一种基于深度学习的实时抠像系统，用于执行上述方法，其包括以下模块：The invention also discloses a real-time matting system based on deep learning for executing the above method, which includes the following modules:

数据集获取模块：获取抠像数据集，包含视频抠像前景数据集、视频背景数据集、图片抠像前景数据集、图片背景数据集和人像分割数据集；Data set acquisition module: obtains the keying data set, including video keying foreground data set, video background data set, picture keying foreground data set, picture background data set and portrait segmentation data set;

网络模型构建模块：构建基于ViT与CNN混合结构的抠像网络模型；Network model building module: Construct a matting network model based on the hybrid structure of ViT and CNN;

模型训练模块：利用数据集对抠像网络模型进行训练，并通过损失函数进行校正，得到训练完成的模型；Model training module: Use the data set to train the matting network model, and correct it through the loss function to obtain the trained model;

抠像alpha图获取模块：将待抠像的图像文件或摄像头得到的视频帧，和来自上一时刻的循环特征图输入训练完成的模型，实时获取抠像alpha图和本时刻的循环特征图。Keying alpha map acquisition module: Input the image file to be keyed or the video frame obtained by the camera and the cycle feature map from the previous moment into the trained model, and obtain the keying alpha map and the cycle feature map at this moment in real time.

本发明针对复杂背景下抠像结果不稳定的问题，融入了自注意力机制来强化全局信息提取能力，减少前景和背景像素被语义误判的可能，提高了抠像结果的精度；同时，本发明可实时地处理视频数据，无需额外约束，使用成本低，可用于多种非专业场景。Aiming at the problem of unstable matting results under complex backgrounds, this invention incorporates a self-attention mechanism to enhance global information extraction capabilities, reduce the possibility of semantic misjudgments of foreground and background pixels, and improve the accuracy of matting results; at the same time, this invention The invention can process video data in real time without additional constraints, has low cost of use, and can be used in a variety of non-professional scenarios.

附图说明Description of the drawings

下面结合附图和实施例对本发明作进一步的说明：The present invention will be further described below in conjunction with the accompanying drawings and examples:

图1为本发明优选实施例一种基于深度学习的实时抠像方法的流程图；Figure 1 is a flow chart of a real-time keying method based on deep learning according to a preferred embodiment of the present invention;

图2为本发明的网络框架；Figure 2 is the network framework of the present invention;

图3为本发明的编码器子网络结构图；Figure 3 is a structural diagram of the encoder subnetwork of the present invention;

图4为本发明的解码器子网络结构图；Figure 4 is a structural diagram of the decoder subnetwork of the present invention;

图5为本发明优选实施例一种基于深度学习的实时抠像系统框图。Figure 5 is a block diagram of a real-time matting system based on deep learning according to a preferred embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明优选实施例做详细说明。The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本发明针对自然背景的实时抠像任务，普通CNN模型处理复杂背景时精度不佳的问题，提出一种利用ViT模型中自注意力机制来强化全局关系建模，减少图像像素语义识别错误的频率，进而在保证实时性的基础上实现高分辨率和高精度视频抠像的技术方案。Aiming at the real-time matting task of natural backgrounds and the problem of poor accuracy of ordinary CNN models when processing complex backgrounds, this invention proposes a method that uses the self-attention mechanism in the ViT model to strengthen global relationship modeling and reduce the frequency of image pixel semantic recognition errors. , and then achieve a technical solution for high-resolution and high-precision video keying while ensuring real-time performance.

如图1-4所示，本实施例一种基于深度学习的实时抠像方法，具体包括如下步骤：As shown in Figure 1-4, this embodiment is a real-time keying method based on deep learning, which specifically includes the following steps:

S1：获取抠像数据集，包含前景、前景的alpha图和背景，并将数据集划分为训练集、验证集和测试集；S1: Obtain the keying data set, including the foreground, the alpha image of the foreground and the background, and divide the data set into a training set, a verification set and a test set;

S3：利用数据集对步骤S2的模型进行训练，并通过损失函数进行校正，得到训练完成的模型；S3: Use the data set to train the model in step S2, and correct it through the loss function to obtain the trained model;

各步骤具体介绍如下。Each step is detailed below.

在步骤S1中，图片和视频抠像的前景数据集由前景图片和对应的真实alpha图组成。本实施例采用的视频抠像数据集为Video Matte 240K；图片抠像数据集为AIM-500、Adobe Image Matting Datase、Distinctions-646、PPM-100以及P3M-10K；背景数据集为DVM、Indoor CVPR 09；人像分割数据集为COCO、Supervisely Person Dataset和YoutubeVIS2021。In step S1, the foreground data set for image and video keying consists of the foreground image and the corresponding real alpha map. The video matting data set used in this embodiment is Video Matte 240K; the image matting data set is AIM-500, Adobe Image Matting Datase, Distinctions-646, PPM-100 and P3M-10K; the background data set is DVM, Indoor CVPR 09; The portrait segmentation data sets are COCO, Supervisely Person Dataset and YoutubeVIS2021.

由于ViT结构较依赖数据集的数据增强来突破没有归纳偏置特性的限制，本实施例对图像进行的数据增强操作主要有以下几种：Since the ViT structure relies more on data enhancement of the data set to break through the limitation of no inductive bias characteristics, the data enhancement operations performed on the image in this embodiment mainly include the following:

(1)旋转：将图像按自身的中心进行90度、180度或270度的旋转；(1) Rotation: Rotate the image 90 degrees, 180 degrees or 270 degrees according to its own center;

(2)平移：使图像前景平移，偏离原本的位置；(2) Translation: Translate the foreground of the image away from its original position;

(3)拉伸：将图像以一定角度倾斜拉伸；(3) Stretching: Stretch the image at an angle;

(4)缩放：将图像的前景以随机倍数缩小或放大；(4) Zoom: Reduce or enlarge the foreground of the image by random multiples;

(5)裁剪：将图像前景裁剪掉一部分；(5) Cropping: Crop out part of the foreground of the image;

(6)变色：将图像由原始色彩转变为灰度图；(6) Color change: Convert the image from original color to grayscale;

(7)噪点：向图像随机地添加密度不一的噪点；(7) Noise: Randomly add noise of varying densities to the image;

(8)截断：针对视频抠像数据集，将完整的视频片段以随机时长裁剪掉一部分；(8) Truncation: For the video matting data set, crop a part of the complete video clip at a random length;

(9)倒放：针对视频抠像数据集，颠倒视频帧序列的顺序进行训练；(9) Reverse playback: For the video matting data set, reverse the order of the video frame sequence for training;

(10)抽帧：针对视频抠像数据集，在一段视频片段中，每隔一定的间隔删除一个视频帧。(10) Frame extraction: For the video matting data set, in a video clip, a video frame is deleted at certain intervals.

在步骤S2中，构建基于ViT与CNN混合结构的抠像网络模型，具体还包括以下子步骤：In step S2, a matting network model based on the hybrid structure of ViT and CNN is constructed, which also includes the following sub-steps:

在步骤S2.1中，构建原始图像重采样子网络的作用是更快地处理高分辨率图像，若对实时性没有要求或是处理图像的分辨率较低，则该子网络非必需。In step S2.1, the function of constructing the original image resampling sub-network is to process high-resolution images faster. If there is no requirement for real-time performance or the resolution of the processed image is low, this sub-network is not necessary.

在步骤S2.1中，构建原始图像重采样子网络，具体包含以下子步骤：In step S2.1, construct the original image resampling subnetwork, which specifically includes the following sub-steps:

在步骤S2.2中，构建基于ViT和CNN混合结构的特征提取编码器子网络，具体是：将3个Mobile ViT V3模块嵌入到MobileNet V3 Large的17个倒残差块中，组成编码器子网络。In step S2.2, construct a feature extraction encoder subnetwork based on the hybrid structure of ViT and CNN, specifically: embed 3 Mobile ViT V3 modules into 17 inverted residual blocks of MobileNet V3 Large to form the encoder subnetwork network.

具体的，在步骤S2.2中，采用MobileNet V3 Large的17个倒残差块组成的结构，在该结构的第4、第6和第9个倒残差块的后面嵌入Mobile ViT V3模块，形成混合结构。此外，在第2个倒残差块、第1个Mobile ViT块和第2个Mobile ViT块的位置引出3个跳跃连接特征图F2.1、F2.2、F2.3。编码器接受下采样原始图像F1.2作为输入，并输出特征图F2.4。Specifically, in step S2.2, a structure composed of 17 inverted residual blocks of MobileNet V3 Large is used, and the Mobile ViT V3 module is embedded behind the 4th, 6th and 9th inverted residual blocks of the structure. form a hybrid structure. In addition, three skip connection feature maps F2.1, F2.2, and F2.3 are derived at the positions of the second inverted residual block, the first Mobile ViT block, and the second Mobile ViT block. The encoder accepts the downsampled original image F1.2 as input and outputs the feature map F2.4.

在步骤S2.3中，构建瓶颈块子网络，方法是：使用CBAM、LR-ASPP、Conv-GRU和CARAFE上采样算子依次连接组成瓶颈块子网络。In step S2.3, the bottleneck block subnetwork is constructed by using CBAM, LR-ASPP, Conv-GRU and CARAFE upsampling operators to connect in sequence to form the bottleneck block subnetwork.

具体的，在步骤S2.3中，瓶颈块子网络接受特征图F2.4作为输入，输出特征图F3。此外，就该结构中的Conv-GRU层而言，其接受上一时刻的循环特征图T1.1作为约束输入，并输出一个本时刻的循环特征图T2.1，用作下一时刻其本身的约束输入。Specifically, in step S2.3, the bottleneck block subnetwork accepts feature map F2.4 as input and outputs feature map F3. In addition, as far as the Conv-GRU layer in this structure is concerned, it accepts the cyclic feature map T1.1 of the previous moment as a constraint input, and outputs a cyclic feature map T2.1 of this moment, which is used as itself at the next moment. constraint input.

在步骤S2.4中，构建基于注意力和内容感知的循环解码器子网络，具体是：共构造3个编码器模块D1、D2、D3，并将三者前后连接，在末端添加两组卷积层、标准化层、激活层构成的模块，形成解码器子网络。In step S2.4, construct a cyclic decoder subnetwork based on attention and content awareness, specifically: construct a total of 3 encoder modules D1, D2, and D3, connect the three before and after, and add two sets of volumes at the end The module composed of accumulation layer, normalization layer and activation layer forms a decoder sub-network.

具体的，在步骤S2.4中，构建基于注意力和内容感知的循环解码器子网络，步骤如下：Specifically, in step S2.4, a recurrent decoder subnetwork based on attention and content awareness is constructed. The steps are as follows:

S2.4.3：将低分辨率原始图像F1.2经过2倍下采样得到图F4.3.1，将跳跃连接的特征图F2.3经过一层CBAM得到特征图F4.3.2，然后将F4.3.1、F4.3.2和F4.2.3送入解码器模块D3，得到输出特征图F4.3.3；S2.4.3: The low-resolution original image F1.2 is downsampled by 2 times to obtain the image F4.3.1. The skip-connected feature map F2.3 is passed through a layer of CBAM to obtain the feature map F4.3.2, and then F4.3.1, F4.3.2 and F4.2.3 are sent to the decoder module D3 to obtain the output feature map F4.3.3;

具体的，在步骤S2.4中，对于解码器模块D1、D2、D3中的每个Conv-GRU层，分别接受来自上一时刻的循环特征图T1.2、T1.3、T1.4作为约束输入，并输出本时刻的循环特征图T2.2、T2.3、T2.4，用于下一时刻对应位置的Conv-GRU层的约束输入。Specifically, in step S2.4, for each Conv-GRU layer in the decoder modules D1, D2, and D3, the cyclic feature maps T1.2, T1.3, and T1.4 from the previous moment are respectively accepted as Constrain the input, and output the cyclic feature maps T2.2, T2.3, and T2.4 at this time, which are used for the constraint input of the Conv-GRU layer at the corresponding position at the next time.

表1消融实验Table 1 Ablation experiments

消融实验共计进行3次。其中，消融模型1消去了编码器中的Mobile ViT模块，消融模型2不仅消去了Mobile ViT，还消去了解码器中的CBAM和CARAFE算子。结果表明，不论是编码器中的Mobile ViT模块还是解码器中的CBAM和CARAFE算子，都对网络模型的预测alpha图精度产生了显著影响，模型机制越完整，抠像精度越高。The ablation experiment was performed a total of 3 times. Among them, ablation model 1 eliminates the Mobile ViT module in the encoder, and ablation model 2 not only eliminates Mobile ViT, but also eliminates the CBAM and CARAFE operators in the decoder. The results show that both the Mobile ViT module in the encoder and the CBAM and CARAFE operators in the decoder have a significant impact on the accuracy of the predicted alpha map of the network model. The more complete the model mechanism, the higher the keying accuracy.

在步骤S3中，利用数据集对模型进行训练，具体是：先对720p视频抠像数据集进行训练，再对1080p视频抠像数据集进行训练，最后对图片抠像数据集进行训练。其中，人像分割训练穿插在上述的训练步骤中，每隔若干次抠像训练，就进行一次人像分割训练。In step S3, the data set is used to train the model, specifically: first train the 720p video keying data set, then train the 1080p video keying data set, and finally train the picture keying data set. Among them, portrait segmentation training is interspersed in the above training steps, and portrait segmentation training is performed every few times of keying training.

具体的，在步骤S3中，使用损失函数对训练进行校正。所使用的损失函数为L1损失、拉普拉斯金字塔损失和时间连贯性损失三者之和，计算公式如下：Specifically, in step S3, the loss function is used to correct the training. The loss function used is the sum of L1 loss, Laplacian pyramid loss and temporal coherence loss. The calculation formula is as follows:

在步骤S4中，模型的输入具体是：原始图像F1.1和来自上一时刻Conv-GRU层输出的循环特征图T1.1、T1.2、T1.3、T1.4，其中，循环特征图是可选输入，在处理单张图片时非必要。In step S4, the input to the model is specifically: the original image F1.1 and the cyclic feature maps T1.1, T1.2, T1.3, and T1.4 output from the Conv-GRU layer at the previous moment, where the cyclic feature Image is an optional input and is not required when processing a single image.

在步骤S4中，模型的输出具体是：预测的alpha图F1.4和本时刻Conv-GRU层输出的循环特征图T2.1、T2.2、T2.3、T2.4。In step S4, the output of the model is specifically: the predicted alpha map F1.4 and the cyclic feature maps T2.1, T2.2, T2.3, and T2.4 output by the Conv-GRU layer at this time.

该模型充分利用了自注意力机制对图像特征的全局提取能力，以及内容感知机制对模型整体精度的提升，解决了传统CNN网络对长程像素点关系不敏感的缺陷，能够准确区分复杂背景下前景和背景像素的语义；解码器使用注意力和内容感知机制，提升了重建图像的精度，使细节更加清晰；本发明还通过深度导向滤波器提高了处理高分辨率图像的速度。在对实时性和精度都有要求的抠像任务上，本发明具有较好的应用价值。This model makes full use of the self-attention mechanism's ability to globally extract image features and the content-aware mechanism to improve the overall accuracy of the model. It solves the shortcoming of traditional CNN networks that are insensitive to long-range pixel relationships, and can accurately distinguish foregrounds in complex backgrounds. and the semantics of background pixels; the decoder uses attention and content-aware mechanisms to improve the accuracy of reconstructed images and make details clearer; the invention also improves the speed of processing high-resolution images through depth-guided filters. This invention has good application value in keying tasks that require both real-time performance and accuracy.

如图5所示，本实施例公开了一种基于深度学习的实时抠像系统，用于执行上述方法实施例，其包括以下模块：As shown in Figure 5, this embodiment discloses a real-time matting system based on deep learning for executing the above method embodiment, which includes the following modules:

本实施例其他内容可参考上述方法实施例。For other contents of this embodiment, please refer to the above method embodiment.

以上所述仅是对本发明的优选实施例及原理进行了详细说明，对本领域的普通技术人员而言，依据本发明提供的思想，在具体实施方式上会有改变之处，而这些改变也应视为本发明的保护范围。The above is only a detailed description of the preferred embodiments and principles of the present invention. For those of ordinary skill in the art, there will be changes in the specific implementation methods based on the ideas provided by the present invention, and these changes should also be made. regarded as the protection scope of the present invention.

Claims

1. The real-time image matting method based on deep learning is characterized by comprising the following steps of:

s1: acquiring a matting data set;

s2: constructing a matting network model based on a ViT and CNN mixed structure;

s3: training the model in the step S2 by utilizing a data set, and correcting by a loss function to obtain a trained model;

s4: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.

2. The method of claim 1 wherein the data set in step S1 comprises a video matting foreground data set, a video background data set, a picture matting data set, and a portrait segmentation data set.

3. The method of claim 1, wherein in step S2, a matting network model based on a ViT and CNN hybrid structure is constructed, specifically as follows:

s2.1: an original image resampling sub-network is constructed, the original image with higher resolution is subjected to downsampling and then sent to an encoder sub-network for processing, and a low resolution alpha image generated by a decoder network is restored to an original resolution alpha image;

s2.2: constructing a characteristic extraction encoder sub-network based on a ViT and CNN mixed structure, and extracting multi-level characteristics from the original image after downsampling;

s2.3: constructing a bottleneck block sub-network, and connecting an encoder sub-network and a decoder sub-network;

s2.4: a cyclic decoder sub-network based on attention and content perception is constructed, the feature map is subjected to space-time modeling, and a low-resolution alpha map is generated.

4. A method according to claim 3, characterized in that in step S2.1, an original image resampling sub-network is built, comprising in particular the steps of:

s2.1.1: downsampling the high-resolution original image F1.1 through an average pooling operation to obtain a low-resolution original image F1.2, and sending the low-resolution original image F1.2 into the encoder subnetwork in the step S2.2;

s2.1.2: the low resolution alpha map F1.3 generated by the encoder sub-network in step S2.4 is spliced with the high resolution original image F1.1 in step S2.1.1 and input into a depth-oriented filter, so that the alpha map F1.4 with the original resolution is restored.

5. A method according to claim 3, characterized in that in step S2.2, a feature extraction encoder sub-network based on a ViT and CNN hybrid structure is constructed, in particular as follows: respectively embedding 3 Mobile ViT V3 modules behind the 4 th, 6 th and 9 th reverse residual blocks of the Mobile Net V3 Larget to form an encoder sub-network, and leading out 3 jump connection characteristic diagrams F2.1, F2.2 and F2.3 from the positions of the 2 nd reverse residual block, the 1 st Mobile ViT block and the 2 nd Mobile ViT block of the sub-network; the end of the encoder network outputs a profile F2.4.

6. A method according to claim 3, characterized in that in step S2.3 a bottleneck block sub-network is built, in particular: the convolution block attention module, the LR-ASPP, the Conv-GRU and the characteristic reconstruction upsampling operator of the content perception are sequentially connected to form a bottleneck block sub-network; the subnetwork accepts as input the profile F2.4 and outputs the profile F3.

7. Method according to any of claims 3-6, characterized in that in step S2.4 a cyclic decoder sub-network based on attention and content perception is built, in particular: the convolution layer, the standardization layer, the activation layer, the Conv-GRU and the CARAFE up-sampling operator are connected back and forth to form encoder modules, and 3 encoder modules D1, D2 and D3 are constructed;

and/or, in step S2.4, constructing a cyclic decoder sub-network based on attention and content perception, specifically as follows:

s2.4.1: the method comprises the steps of downsampling a low-resolution original image F1.2 by 8 times to obtain a graph F4.1.1, obtaining a feature graph F4.1.2 by a layer of CBAM through a jump connection feature graph F2.1, and then sending F4.1.1, F4.1.2 and F3 to a decoder module D1 to obtain an output feature graph F4.1.3;

s2.4.2: downsampling the low-resolution original image F1.2 by 4 times to obtain a graph F4.2.1, obtaining a feature graph F4.2.2 by a layer of CBAM through the jump connected feature graph F2.2, and then sending F4.2.1, F4.2.2 and F4.1.3 to a decoder module D2 to obtain an output feature graph F4.2.3;

s2.4.3: downsampling the low-resolution original image F1.2 by 2 times to obtain a graph F4.3.1, obtaining a feature graph F4.3.2 by a layer of CBAM through the jump connection feature graph F2.3, and then sending F4.3.1, F4.3.2 and F4.2.3 to a decoder module D3 to obtain an output feature graph F4.3.3;

s2.4.4: and splicing the low-resolution original image F1.2 and the feature map F4.3.3, and sending the spliced images into a module formed by two groups of convolution layers, a normalization layer and an activation layer for processing to obtain the low-resolution alpha map F1.3.

8. The method according to any one of claims 1-6, wherein in step S3, the loss function is specifically: the sum of the L1 loss, the Laplacian pyramid loss and the time continuity loss is calculated as follows:

wherein alpha is _t Is an alpha map of the reality at the time t,alpha-map predicted for time t +.>Representing the value of the alpha map at the ith layer of the laplacian pyramid.

9. The method according to any one of claims 1-6, wherein in step S4, the content of the model input is specifically: the method comprises the steps that an image file to be scratched or a video frame F1.1 obtained by a camera and cyclic feature images T1.1, T1.2, T1.3 and T1.4 output by a Conv-GRU layer at the last moment are obtained, and if the moment is the starting moment, the cyclic feature images do not need to be input;

and/or, in step S4, the content of the output of the trained model is specifically: predicted alpha map F1.4 and cycle characteristic maps T2.1, T2.2, T2.3, T2.4 output by Conv-GRU layers at the moment.

10. A deep learning based real-time matting system for performing a method as claimed in any one of claims 1 to 9, comprising the following modules:

a data set acquisition module: acquiring a matting data set;

the network model building module: constructing a matting network model based on a ViT and CNN mixed structure;

model training module: training the image matting network model by utilizing the data set, and correcting by a loss function to obtain a trained model;

and an image matting alpha image acquisition module: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.