CN117097853A - Real-time image matting method and system based on deep learning - Google Patents
Real-time image matting method and system based on deep learning Download PDFInfo
- Publication number
- CN117097853A CN117097853A CN202311031197.0A CN202311031197A CN117097853A CN 117097853 A CN117097853 A CN 117097853A CN 202311031197 A CN202311031197 A CN 202311031197A CN 117097853 A CN117097853 A CN 117097853A
- Authority
- CN
- China
- Prior art keywords
- image
- network
- matting
- sub
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000013135 deep learning Methods 0.000 title claims abstract description 14
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims description 13
- 230000008447 perception Effects 0.000 claims description 9
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 7
- 238000012952 Resampling Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 9
- 230000008569 process Effects 0.000 abstract description 7
- 238000013527 convolutional neural network Methods 0.000 description 21
- 238000002679 ablation Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 239000003086 colorant Substances 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/272—Means for inserting a foreground image in a background image, i.e. inlay, outlay
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/2628—Alteration of picture size, shape, position or orientation, e.g. zooming, rotation, rolling, perspective, translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a real-time image matting method and a system based on deep learning, wherein the method comprises the following steps: s1: acquiring a matting data set; s2: constructing a matting network model based on a ViT and CNN mixed structure; s3: training the model by using the data set, and correcting the model by using the loss function to obtain a trained model; s4: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time. Aiming at the problem of unstable image matting results under a complex background, the invention integrates a self-attention mechanism to strengthen the global information extraction capability, reduces the possibility of semantic misjudgment of foreground and background pixels, and ensures the precision of the image matting results. Meanwhile, the invention can process video data in real time without additional constraint, has low use cost and can be used for various non-professional scenes.
Description
Technical Field
The invention belongs to the technical field of image processing, in particular to a computer vision technology, and particularly relates to a real-time image matting method and system based on deep learning.
Background
The image matting is a popular technology in the field of computer vision, can effectively separate foreground objects of interest from pictures or videos, and is widely applied to various commercial fields such as television live broadcasting, film special effects, advertisement propaganda and the like. The mathematical model of the technique is shown in formula (1):
I=αF+(1-α)B (1)
where I is a given picture or video frame, F is a foreground image, B is a background image, and α is an alpha map, i.e., the opacity of the foreground image pixels. With only the known quantity I, the other three unknown values cannot be derived from this formula, and thus the problem is an under-constraint problem.
The traditional matting method is based on sampling and spreading, and constraint is artificially added to the formula (1) by assuming that colors of different pixels in an image have a certain functional relationship. The method does not fully utilize the context information of the image, when the colors of the foreground pixel and the background pixel are similar, misjudgment is easy to occur, the image matting precision is low, and the result is unstable.
Because the traditional method has various defects, two methods based on deep learning are mainly used for the current natural background matting: an image matting method based on a convolutional neural network (Convolution Neural Network, CNN) and an image matting method based on Vision Transformer (ViT). The CNN-based method constructs a convolutional neural network model by using a convolutional layer, a pooling layer and an activation layer, and is a more traditional method in deep learning; the ViT module is used to construct a neural network model based on the ViT method, and the model constructed by the method can be a pure ViT structure or a mixed structure of CNN and ViT. ViT has a self-attention mechanism, can capture the relevance of long-range pixel points in an image, models the global information of the image, has higher precision compared with CNN, and is an emerging technology in the field of computer vision.
After searching the prior literature, the following related literature is found:
a method for stabilizing video matting (Robust Video Matting, RVM) (Lin Shanchuan, yang Linjie, saleemi I, et al, "Robust High-Resolution Video Matting with Temporal Guidance (High resolution stabilized video matting based on time guidance)", proc of IEEE/CVF Winter Conference on Applications of Computer Vision.Waikoloa, HI, USA:IEEE Press,2022:3132-3141, uses MobileNet as a backbone network to achieve real-time matting of natural background. The method has high instantaneity and good image matting precision under a simple background. However, since the method still uses the traditional CNN structure, the processing capability of global information is weak, so that foreground and background pixels are easily confused in a complex background.
VMFormer (Video Matting with Transformer) (Li Jiachen, goel V, ohanylan M, et al.
The VMFormer End-to-End Video Matting with Transformer (End-to-End video matting based on a Transformer), https:// arxiv.org/abs/2208.12801), the defect that the prior CNN structure processes the image is overcome, and Vision Transformer is introduced to realize the feature extraction and the feature map decoding of the image. The method uses a large amount of common Vision Transformer structures on both the encoder and the decoder, so that the constructed network model parameters are more, and the RVM method is about 2 times. Experiments show that the network model proposed by the VMFormer method can only process 1080p images at a speed of 3 frames per second on a Nvidia GeForce RTX 4060GPU, and real-time processing of the images is difficult to realize. At present, the method is only suitable for processing ready-made video files, cannot be applied to the fields of live broadcasting and the like, and is limited in use scenes.
Disclosure of Invention
Aiming at the problem of poor precision when the existing CNN model processes complex background in real-time image matting task of natural background, the invention provides a technical scheme for enhancing global relation modeling by utilizing a self-attention mechanism in a ViT model, reducing the frequency of image pixel semantic recognition errors, and further realizing high-resolution and high-precision video matting on the basis of ensuring real-time performance.
In order to solve the problems, the invention adopts the following technical scheme:
a real-time image matting method based on deep learning comprises the following steps:
s1: acquiring a matting data set;
s2: constructing a matting network model based on a ViT and CNN mixed structure;
s3: training the model by using the data set, and correcting the model by using the loss function to obtain a trained model;
s4: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.
Preferably, in step S1, the matting data set specifically includes a video matting foreground data set, a video background data set, a picture matting foreground data set, a picture background data set and a portrait segmentation data set, which are all images of 360p, 720p or 1080 p. The matting foreground data set comprises a foreground image and a corresponding alpha image.
Preferably, in step S2, when constructing the image matting network model based on the ViT and CNN hybrid structure, the following method is adopted:
s2.1: an original image resampling sub-network is constructed, the original image with higher resolution is subjected to downsampling and then sent to an encoder sub-network for processing, and a low resolution alpha image generated by a decoder network is restored to an original resolution alpha image;
s2.2: constructing a characteristic extraction encoder sub-network based on a ViT and CNN mixed structure, and extracting multi-level characteristics from the original image after downsampling;
s2.3: constructing a bottleneck block sub-network, and connecting an encoder sub-network and a decoder sub-network;
s2.4: a cyclic decoder sub-network based on attention and content perception is constructed, the feature map is subjected to space-time modeling, and a low-resolution alpha map is generated.
Preferably, in step S2.1, an original image resampling sub-network is constructed, specifically comprising the steps of:
s2.1.1: downsampling the high-resolution original image F1.1 through an average pooling operation to obtain a low-resolution original image F1.2, and sending the low-resolution original image F1.2 to the encoder subnetwork in the step S2.2;
s2.1.2: the low resolution alpha map F1.3 generated by the encoder sub-network in step S2.4 is spliced with the high resolution original image F1.1 in step S2.1.1 and input to a depth-oriented filter (Deep Guided Filter, DGF) to recover the original resolution alpha map F1.4.
Preferably, in step S2.2, a feature extraction encoder sub-network based on a ViT and CNN hybrid structure is constructed, specifically:
the 3 Mobile ViT V3 modules are embedded into 17 inverted residual blocks of the Mobile Net V3 Larget to form an encoder sub-network, and 3 jump connection feature graphs F2.1, F2.2 and F2.3 are led out from the sub-network. The end of the encoder network outputs a profile F2.4.
Preferably, in step S2.3, a bottleneck block sub-network is constructed, specifically:
the bottleneck block sub-networks are formed by sequentially connecting a convolution block attention module (Convolutional Block Attention Module, CBAM), LR-ASPP, conv-GRU and a Content-aware feature reconstruction (Content-Aware Reassembly of FEatures, CARAFE) up-sampling operator. The subnetwork accepts as input the profile F2.4 and outputs the profile F3.
Preferably, in step S2.4, a cyclic decoder sub-network based on attention and content awareness is built, and the specific module structure is:
3 encoder modules D1, D2, D3 are constructed, each module being constructed by the following method: the decoder module is formed by connecting a convolution layer, a standardization layer, an activation layer, conv-GRU and CARAFE up-sampling operators in front and back.
Preferably, in step S2.4, a cyclic decoder sub-network based on attention and content perception is built, specifically the steps are:
s2.4.1: the method comprises the steps of downsampling a low-resolution original image F1.2 by 8 times to obtain a graph F4.1.1, obtaining a feature graph F4.1.2 by a layer of CBAM through a jump connection feature graph F2.1, and then sending F4.1.1, F4.1.2 and F3 to a decoder module D1 to obtain an output feature graph F4.1.3;
s2.4.2: downsampling the low-resolution original image F1.2 by 4 times to obtain a graph F4.2.1, obtaining a feature graph F4.2.2 by a layer of CBAM through the jump connected feature graph F2.2, and then sending F4.2.1, F4.2.2 and F4.1.3 to a decoder module D2 to obtain an output feature graph F4.2.3;
s2.4.3: downsampling the low-resolution original image F1.2 by 2 times to obtain a graph F4.3.1, obtaining a feature graph F4.3.2 by a layer of CBAM through the jump connection feature graph F2.3, and then sending F4.3.1, F4.3.2 and F4.2.3 to a decoder module D3 to obtain an output feature graph F4.3.3;
s2.4.4: and splicing the low-resolution original image F1.2 and the feature map F4.3.3, and sending the spliced images into a module formed by two groups of convolution layers, a normalization layer and an activation layer for processing to obtain the low-resolution alpha map F1.3.
Preferably, in step S3, the loss function used is specifically: the sum of the L1 loss, the Laplacian pyramid loss and the time continuity loss is calculated as follows:
wherein alpha is t Is an alpha map of the reality at the time t,alpha-map predicted for time t +.>Representing the value of the alpha map at the ith layer of the laplacian pyramid.
Preferably, in step S4, the input of the model is specifically: the original image F1.1 and the cyclic feature map T1.1, T1.2, T1.3 from the last moment Conv-GRU layer output, wherein the cyclic feature map is an optional input, not necessary when processing a single picture.
Preferably, in step S4, the output of the model is specifically: predicted alpha map F1.4 and cycle characteristic maps T2.1, T2.2, T2.3 output by Conv-GRU layers at the moment.
The invention also discloses a real-time image matting system based on deep learning, which is used for executing the method and comprises the following modules:
a data set acquisition module: acquiring a matting data set, wherein the matting data set comprises a video matting foreground data set, a video background data set, a picture matting foreground data set, a picture background data set and a portrait segmentation data set;
the network model building module: constructing a matting network model based on a ViT and CNN mixed structure;
model training module: training the image matting network model by utilizing the data set, and correcting by a loss function to obtain a trained model;
and an image matting alpha image acquisition module: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.
Aiming at the problem of unstable image matting results under a complex background, the invention integrates a self-attention mechanism to strengthen the global information extraction capability, reduces the possibility of semantic misjudgment of foreground and background pixels, and improves the precision of the image matting results; meanwhile, the invention can process video data in real time without additional constraint, has low use cost and can be used for various non-professional scenes.
Drawings
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
fig. 1 is a flowchart of a real-time matting method based on deep learning according to a preferred embodiment of the present invention;
FIG. 2 is a network framework of the present invention;
FIG. 3 is a block diagram of an encoder sub-network according to the present invention;
FIG. 4 is a diagram of a decoder subnetwork configuration of the present invention;
fig. 5 is a block diagram of a real-time matting system based on deep learning according to a preferred embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Aiming at the problem that the precision is poor when a common CNN model processes a complex background in real-time image matting task of a natural background, the invention provides a technical scheme for enhancing global relation modeling by utilizing a self-attention mechanism in a ViT model, reducing the frequency of image pixel semantic recognition errors and further realizing high-resolution and high-precision video matting on the basis of ensuring real-time performance.
As shown in fig. 1-4, the real-time image matting method based on deep learning in this embodiment specifically includes the following steps:
s1: acquiring a matting data set, wherein the matting data set comprises a foreground, an alpha image of the foreground and a background, and dividing the data set into a training set, a verification set and a test set;
s2: constructing a matting network model based on a ViT and CNN mixed structure;
s3: training the model in the step S2 by utilizing a data set, and correcting by a loss function to obtain a trained model;
s4: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.
The steps are specifically described below.
In step S1, a foreground dataset of the picture and video matting is composed of the foreground picture and a corresponding real alpha map. The Video matting data set adopted in the embodiment is Video matt material 240K; the picture matting data set is AIM-500, adobe Image Matting Datase, distingions-646, PPM-100 and P3M-10K; the background data set is DVM, indicator CVPR 09; the portrait segmentation dataset was COCO, supervisely Person Dataset and YoutubeVIS2021.
Since ViT structure is more dependent on data enhancement of data set to break through the limitation of no induction bias characteristic, the data enhancement operation of the embodiment on the image mainly has the following steps:
(1) And (3) rotation: the image is rotated by 90 degrees, 180 degrees or 270 degrees according to the center of the image;
(2) Translation: translating the image foreground to deviate from the original position;
(3) Stretching: obliquely stretching the image at a certain angle;
(4) Scaling: the foreground of the image is reduced or enlarged by random times;
(5) Cutting: clipping out a part of the image foreground;
(6) Color change: converting the image from an original color to a gray scale;
(7) Noise point: randomly adding noise points with different densities to the image;
(8) Cutting: aiming at the video matting data set, cutting off a part of the complete video segment with random duration;
(9) And (3) inverting and releasing: aiming at the video matting data set, training is carried out by reversing the sequence of video frames;
(10) And (5) frame extraction: aiming at the video matting data set, deleting a video frame at certain intervals in a section of video fragment.
In step S2, a matting network model based on a ViT and CNN hybrid structure is constructed, and the method specifically further includes the following sub-steps:
s2.1: an original image resampling sub-network is constructed, the original image with higher resolution is subjected to downsampling and then sent to an encoder sub-network for processing, and a low resolution alpha image generated by a decoder network is restored to an original resolution alpha image;
s2.2: constructing a characteristic extraction encoder sub-network based on a ViT and CNN mixed structure, and extracting multi-level characteristics from the original image after downsampling;
s2.3: constructing a bottleneck block sub-network, and connecting an encoder sub-network and a decoder sub-network;
s2.4: a cyclic decoder sub-network based on attention and content perception is constructed, the feature map is subjected to space-time modeling, and a low-resolution alpha map is generated.
In step S2.1, the role of constructing the original image resampling sub-network is to process the high resolution image faster, which is not necessary if there is no requirement for real-time or the resolution of the processed image is low.
In step S2.1, an original image resampling sub-network is constructed, specifically comprising the following sub-steps:
s2.1.1: downsampling the high-resolution original image F1.1 through an average pooling operation to obtain a low-resolution original image F1.2, and sending the low-resolution original image F1.2 to the encoder subnetwork in the step S2.2;
s2.1.2: the low resolution alpha map F1.3 generated by the encoder sub-network in step S2.4 is spliced with the high resolution original image F1.1 in step S2.1.1 and input to a depth-oriented filter (Deep Guided Filter, DGF) to recover the original resolution alpha map F1.4.
In step S2.2, a feature extraction encoder sub-network based on a ViT and CNN hybrid structure is constructed, specifically: the 3 Mobile ViT V3 modules are embedded into 17 inverted residual blocks of Mobile net V3 Large, constituting an encoder subnetwork.
Specifically, in step S2.2, a structure composed of 17 inverted residual blocks of Mobile net V3 Large is adopted, and a Mobile ViT V3 module is embedded behind the 4 th, 6 th and 9 th inverted residual blocks of the structure to form a hybrid structure. In addition, 3 jump connection feature maps F2.1, F2.2, F2.3 are derived at the positions of the 2 nd inverse residual block, the 1 st Mobile ViT block, and the 2 nd Mobile ViT block. The encoder accepts as input the downsampled original image F1.2 and outputs a feature map F2.4.
In step S2.3, a bottleneck block sub-network is constructed by: and sequentially connecting CBAM, LR-ASPP, conv-GRU and CARAFE up-sampling operators to form a bottleneck block sub-network.
Specifically, in step S2.3, the bottleneck block subnetwork accepts the feature map F2.4 as input and outputs the feature map F3. In addition, the Conv-GRU layer in this structure accepts the cyclic characteristic map T1.1 at the previous time as a constraint input, and outputs the cyclic characteristic map T2.1 at the present time as its own constraint input at the next time.
In step S2.4, a cyclic decoder sub-network based on attention and content awareness is built, in particular: and constructing 3 encoder modules D1, D2 and D3 in total, connecting the three modules front and back, and adding modules consisting of two groups of convolution layers, a standardization layer and an activation layer at the tail end to form a decoder sub-network.
Specifically, in step S2.4, a cyclic decoder sub-network based on attention and content perception is constructed as follows:
s2.4.1: the method comprises the steps of downsampling a low-resolution original image F1.2 by 8 times to obtain a graph F4.1.1, obtaining a feature graph F4.1.2 by a layer of CBAM through a jump connection feature graph F2.1, and then sending F4.1.1, F4.1.2 and F3 to a decoder module D1 to obtain an output feature graph F4.1.3;
s2.4.2: downsampling the low-resolution original image F1.2 by 4 times to obtain a graph F4.2.1, obtaining a feature graph F4.2.2 by a layer of CBAM through the jump connected feature graph F2.2, and then sending F4.2.1, F4.2.2 and F4.1.3 to a decoder module D2 to obtain an output feature graph F4.2.3;
s2.4.3: downsampling the low-resolution original image F1.2 by 2 times to obtain a graph F4.3.1, obtaining a feature graph F4.3.2 by a layer of CBAM through the jump connection feature graph F2.3, and then sending F4.3.1, F4.3.2 and F4.2.3 to a decoder module D3 to obtain an output feature graph F4.3.3;
s2.4.4: and splicing the low-resolution original image F1.2 and the feature map F4.3.3, and sending the spliced images into a module formed by two groups of convolution layers, a normalization layer and an activation layer for processing to obtain the low-resolution alpha map F1.3.
Specifically, in step S2.4, for each Conv-GRU layer in the decoder modules D1, D2, D3, the cyclic feature maps T1.2, T1.3, T1.4 from the previous time are received as constraint inputs, and the cyclic feature maps T2.2, T2.3, T2.4 from the present time are output for constraint inputs of the Conv-GRU layer at the corresponding position at the next time.
Table 1 ablation experiments
Ablation experiments were performed 3 times in total. The ablation model 1 eliminates the Mobile ViT module in the encoder, and the ablation model 2 eliminates not only the Mobile ViT, but also CBAM and CARAFE operators in the decoder. The result shows that the prediction alpha graph precision of the network model is obviously influenced by a Mobile ViT module in the encoder or a CBAM and CARAFE operator in the decoder, and the more complete the model mechanism is, the higher the matting precision is.
In step S3, training the model with the dataset is specifically: training the 720p video matting data set, training the 1080p video matting data set, and finally training the picture matting data set. The portrait segmentation training is interspersed in the training step, and is performed once every several times of portrait matting training.
Specifically, in step S3, the training is corrected using the loss function. The loss function used is the sum of the loss of L1, the loss of the Laplacian pyramid and the loss of time continuity, and the calculation formula is as follows:
wherein alpha is t Is an alpha map of the reality at the time t,alpha-map predicted for time t +.>Representing the value of the alpha map at the ith layer of the laplacian pyramid.
In step S4, the input of the model is specifically: the original image F1.1 and the cyclic feature maps T1.1, T1.2, T1.3, T1.4 from the last moment Conv-GRU layer output, wherein the cyclic feature maps are optional inputs, not necessary when processing a single picture.
In step S4, the output of the model is specifically: predicted alpha map F1.4 and cycle characteristic maps T2.1, T2.2, T2.3, T2.4 output by Conv-GRU layers at the moment.
The model fully utilizes the global extraction capability of a self-attention mechanism to image features and the improvement of the overall accuracy of the model by a content perception mechanism, solves the defect that the traditional CNN network is insensitive to the relation of long-range pixel points, and can accurately distinguish the semantics of foreground pixels and background pixels under a complex background; the decoder uses the attention and content sensing mechanism, so that the accuracy of the reconstructed image is improved, and details are clearer; the invention also increases the speed of processing high resolution images by means of a depth-oriented filter. The invention has better application value in the image matting task with requirements on real-time performance and precision.
As shown in fig. 5, this embodiment discloses a real-time matting system based on deep learning, which is configured to execute the above method embodiment, and includes the following modules:
a data set acquisition module: acquiring a matting data set, wherein the matting data set comprises a video matting foreground data set, a video background data set, a picture matting foreground data set, a picture background data set and a portrait segmentation data set;
the network model building module: constructing a matting network model based on a ViT and CNN mixed structure;
model training module: training the image matting network model by utilizing the data set, and correcting by a loss function to obtain a trained model;
and an image matting alpha image acquisition module: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.
For other content in this embodiment, reference may be made to the above-described method embodiments.
The foregoing is only illustrative of the preferred embodiments and principles of the present invention, and changes in specific embodiments will occur to those skilled in the art upon consideration of the teachings provided herein, and such changes are intended to be included within the scope of the invention as defined by the claims.
Claims (10)
1. The real-time image matting method based on deep learning is characterized by comprising the following steps of:
s1: acquiring a matting data set;
s2: constructing a matting network model based on a ViT and CNN mixed structure;
s3: training the model in the step S2 by utilizing a data set, and correcting by a loss function to obtain a trained model;
s4: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.
2. The method of claim 1 wherein the data set in step S1 comprises a video matting foreground data set, a video background data set, a picture matting data set, and a portrait segmentation data set.
3. The method of claim 1, wherein in step S2, a matting network model based on a ViT and CNN hybrid structure is constructed, specifically as follows:
s2.1: an original image resampling sub-network is constructed, the original image with higher resolution is subjected to downsampling and then sent to an encoder sub-network for processing, and a low resolution alpha image generated by a decoder network is restored to an original resolution alpha image;
s2.2: constructing a characteristic extraction encoder sub-network based on a ViT and CNN mixed structure, and extracting multi-level characteristics from the original image after downsampling;
s2.3: constructing a bottleneck block sub-network, and connecting an encoder sub-network and a decoder sub-network;
s2.4: a cyclic decoder sub-network based on attention and content perception is constructed, the feature map is subjected to space-time modeling, and a low-resolution alpha map is generated.
4. A method according to claim 3, characterized in that in step S2.1, an original image resampling sub-network is built, comprising in particular the steps of:
s2.1.1: downsampling the high-resolution original image F1.1 through an average pooling operation to obtain a low-resolution original image F1.2, and sending the low-resolution original image F1.2 into the encoder subnetwork in the step S2.2;
s2.1.2: the low resolution alpha map F1.3 generated by the encoder sub-network in step S2.4 is spliced with the high resolution original image F1.1 in step S2.1.1 and input into a depth-oriented filter, so that the alpha map F1.4 with the original resolution is restored.
5. A method according to claim 3, characterized in that in step S2.2, a feature extraction encoder sub-network based on a ViT and CNN hybrid structure is constructed, in particular as follows: respectively embedding 3 Mobile ViT V3 modules behind the 4 th, 6 th and 9 th reverse residual blocks of the Mobile Net V3 Larget to form an encoder sub-network, and leading out 3 jump connection characteristic diagrams F2.1, F2.2 and F2.3 from the positions of the 2 nd reverse residual block, the 1 st Mobile ViT block and the 2 nd Mobile ViT block of the sub-network; the end of the encoder network outputs a profile F2.4.
6. A method according to claim 3, characterized in that in step S2.3 a bottleneck block sub-network is built, in particular: the convolution block attention module, the LR-ASPP, the Conv-GRU and the characteristic reconstruction upsampling operator of the content perception are sequentially connected to form a bottleneck block sub-network; the subnetwork accepts as input the profile F2.4 and outputs the profile F3.
7. Method according to any of claims 3-6, characterized in that in step S2.4 a cyclic decoder sub-network based on attention and content perception is built, in particular: the convolution layer, the standardization layer, the activation layer, the Conv-GRU and the CARAFE up-sampling operator are connected back and forth to form encoder modules, and 3 encoder modules D1, D2 and D3 are constructed;
and/or, in step S2.4, constructing a cyclic decoder sub-network based on attention and content perception, specifically as follows:
s2.4.1: the method comprises the steps of downsampling a low-resolution original image F1.2 by 8 times to obtain a graph F4.1.1, obtaining a feature graph F4.1.2 by a layer of CBAM through a jump connection feature graph F2.1, and then sending F4.1.1, F4.1.2 and F3 to a decoder module D1 to obtain an output feature graph F4.1.3;
s2.4.2: downsampling the low-resolution original image F1.2 by 4 times to obtain a graph F4.2.1, obtaining a feature graph F4.2.2 by a layer of CBAM through the jump connected feature graph F2.2, and then sending F4.2.1, F4.2.2 and F4.1.3 to a decoder module D2 to obtain an output feature graph F4.2.3;
s2.4.3: downsampling the low-resolution original image F1.2 by 2 times to obtain a graph F4.3.1, obtaining a feature graph F4.3.2 by a layer of CBAM through the jump connection feature graph F2.3, and then sending F4.3.1, F4.3.2 and F4.2.3 to a decoder module D3 to obtain an output feature graph F4.3.3;
s2.4.4: and splicing the low-resolution original image F1.2 and the feature map F4.3.3, and sending the spliced images into a module formed by two groups of convolution layers, a normalization layer and an activation layer for processing to obtain the low-resolution alpha map F1.3.
8. The method according to any one of claims 1-6, wherein in step S3, the loss function is specifically: the sum of the L1 loss, the Laplacian pyramid loss and the time continuity loss is calculated as follows:
wherein alpha is t Is an alpha map of the reality at the time t,alpha-map predicted for time t +.>Representing the value of the alpha map at the ith layer of the laplacian pyramid.
9. The method according to any one of claims 1-6, wherein in step S4, the content of the model input is specifically: the method comprises the steps that an image file to be scratched or a video frame F1.1 obtained by a camera and cyclic feature images T1.1, T1.2, T1.3 and T1.4 output by a Conv-GRU layer at the last moment are obtained, and if the moment is the starting moment, the cyclic feature images do not need to be input;
and/or, in step S4, the content of the output of the trained model is specifically: predicted alpha map F1.4 and cycle characteristic maps T2.1, T2.2, T2.3, T2.4 output by Conv-GRU layers at the moment.
10. A deep learning based real-time matting system for performing a method as claimed in any one of claims 1 to 9, comprising the following modules:
a data set acquisition module: acquiring a matting data set;
the network model building module: constructing a matting network model based on a ViT and CNN mixed structure;
model training module: training the image matting network model by utilizing the data set, and correcting by a loss function to obtain a trained model;
and an image matting alpha image acquisition module: and inputting the video frame obtained by the image file to be scratched or the camera and the cyclic feature image from the last moment into a training completed model, and acquiring the scratched alpha image and the cyclic feature image at the moment in real time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311031197.0A CN117097853A (en) | 2023-08-16 | 2023-08-16 | Real-time image matting method and system based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311031197.0A CN117097853A (en) | 2023-08-16 | 2023-08-16 | Real-time image matting method and system based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117097853A true CN117097853A (en) | 2023-11-21 |
Family
ID=88778039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311031197.0A Pending CN117097853A (en) | 2023-08-16 | 2023-08-16 | Real-time image matting method and system based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117097853A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117351118A (en) * | 2023-12-04 | 2024-01-05 | 江西师范大学 | Lightweight fixed background matting method and system combined with depth information |
-
2023
- 2023-08-16 CN CN202311031197.0A patent/CN117097853A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117351118A (en) * | 2023-12-04 | 2024-01-05 | 江西师范大学 | Lightweight fixed background matting method and system combined with depth information |
CN117351118B (en) * | 2023-12-04 | 2024-02-23 | 江西师范大学 | Lightweight fixed background matting method and system combined with depth information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113362223B (en) | Image super-resolution reconstruction method based on attention mechanism and two-channel network | |
Zhang et al. | DCSR: Dilated convolutions for single image super-resolution | |
Wang et al. | Esrgan: Enhanced super-resolution generative adversarial networks | |
US10614574B2 (en) | Generating image segmentation data using a multi-branch neural network | |
CN110415172B (en) | Super-resolution reconstruction method for face area in mixed resolution code stream | |
WO2023010831A1 (en) | Method, system and apparatus for improving image resolution, and storage medium | |
CN113139551A (en) | Improved semantic segmentation method based on deep Labv3+ | |
CN110717921B (en) | Full convolution neural network semantic segmentation method of improved coding and decoding structure | |
CN112258436A (en) | Training method and device of image processing model, image processing method and model | |
CN113724136B (en) | Video restoration method, device and medium | |
CN117097853A (en) | Real-time image matting method and system based on deep learning | |
CN113902925A (en) | Semantic segmentation method and system based on deep convolutional neural network | |
CN114723760A (en) | Portrait segmentation model training method and device and portrait segmentation method and device | |
CN111524060B (en) | System, method, storage medium and device for blurring portrait background in real time | |
CN112489056A (en) | Real-time human body matting method suitable for mobile terminal | |
CN113379606A (en) | Face super-resolution method based on pre-training generation model | |
CN109615576A (en) | The single-frame image super-resolution reconstruction method of base study is returned based on cascade | |
CN115457266A (en) | High-resolution real-time automatic green screen image matting method and system based on attention mechanism | |
CN112200817A (en) | Sky region segmentation and special effect processing method, device and equipment based on image | |
Hua et al. | Dynamic scene deblurring with continuous cross-layer attention transmission | |
Wan et al. | Progressive convolutional transformer for image restoration | |
CN111950496B (en) | Mask person identity recognition method | |
CN116362995A (en) | Tooth image restoration method and system based on standard prior | |
CN110853040B (en) | Image collaborative segmentation method based on super-resolution reconstruction | |
CN114219738A (en) | Single-image multi-scale super-resolution reconstruction network structure and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |