CN114549574A

CN114549574A - Interactive video matting system based on mask propagation network

Info

Publication number: CN114549574A
Application number: CN202210193688.4A
Authority: CN
Inventors: 沈蓉豪; 戴国骏; 周文晖; 项雷雷
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-27

Abstract

The invention discloses an interactive video matting system based on a mask propagation network and feature fusion, which comprises a cache module, an interactive image rough segmentation module, a mask time domain propagation module and a fine segmentation module based on space-time feature fusion. Compared with the existing video matting method, the system can realize the matting of the whole video foreground target only by clicking or doodling the foreground target of any frame of the video, does not need to provide a tripartite graph for each frame, greatly reduces the workload of a user, achieves the effect of an advanced matting algorithm, effectively solves the problem of space-time consistency among video frames through a space-time feature fusion module, and effectively inhibits the phenomena of artifacts and flicker possibly generated by moving object details.

Description

Interactive video matting system based on mask propagation network

Technical Field

The invention relates to the technical field of image processing, in particular to a video matting system based on a mask propagation network and feature fusion.

Background

Image Matting is a technology focusing on object Foreground extraction, and the core idea is to mathematically model an Image, regard the Image as a convex combination of Foreground and Background parts according to a certain weight (transparency mask), and distinguish the Foreground (forego) and Background (Background) parts through a determined transparency mask (Alpha mate). The solving formula of the mathematical model is as follows:

I_z＝α_zF_z+(1-α_z)B_z (1)

wherein z represents a certain pixel point with coordinates (x, y) in the image, I_zThen the RGB color value of pixel point z is represented, F_zColor value representing a foreground pixel, B_zColor value, alpha, representing a background pixel_zA transparency mask value representing z, having a value range of [0,1 ]]. To solve this formula, additional supplemental constraints need to be introduced. Common supplementary inputs are trimap (trimap), sketches (scribes), background images, foreground coordinates, etc.

Video matting is the task of extracting moving foreground objects from a given video on the basis of image matting, and compared with image matting, the video matting method brings about two challenges.

Firstly, a video matting method is to perform matting on each frame of image of a video, so that each frame needs to be provided with supplementary input such as a ternary diagram or a sketch, and with the increase of data volume, if the video is manually marked, the time and labor cost expenses are huge, and therefore researchers often strive to weaken the input of supplementary information and enable a network to learn the movement of a foreground object, so that the transparency mask of an object in the video is predicted; in addition, the method selects and utilizes an input section of background video or a background image, and compares the input section of background video or the background image with the original video to obtain foreground prior information, so that a large amount of artificial labeling is avoided.

Second, the transparency mask (alpha matte) obtained by video matting needs to be consistent in the dimensions of time and space. If the image matting algorithm is directly applied to each frame of the video and the result is spliced into the video, artifacts and flickers are inevitably generated on moving objects or detailed parts. The traditional solution is to calculate the motion of the foreground by finding local or non-local affinities among pixel point colors, but the effect is often unsatisfactory, especially when processing complex scenes, such as complex backgrounds and fast moving foregrounds. More recent methods use optical flow estimation to predict the motion of the foreground, but optical flow estimation often fails to address large areas of translucent area.

Disclosure of Invention

The invention provides an interactive video matting system based on a mask propagation network and feature fusion.A user only needs to provide a small amount of clicking or doodling operation on any frame of image of a video and prompts that the position is a foreground or a background, so that matting on all video frames can be completed, and a trisection image is not needed to be provided for each frame of image, thereby greatly reducing the workload of video matting, having performance equivalent to an advanced matting algorithm, and solving the problem of space-time consistency of foreground objects between frames.

An interactive video matting system based on a mask propagation network comprises a cache module, an interactive image rough segmentation module, a mask time domain propagation module and a fine segmentation module based on space-time feature fusion.

The cache module is used for caching the video in a video frame mode so as to obtain an original input image of each frame; and the memory frame is used for caching the mask time domain propagation module mark.

The interactive target rough segmentation module is used for interacting an input image, the interaction comprises two interaction modes of clicking and doodling, a user selects any interaction mode according to actual conditions, foreground target information (an indication image) of the original input image is obtained through single clicking or doodling, the foreground target information is input to an image segmentation network in combination with the original input image, and a primary Mask (Mask) is obtained.

The user can optimize the mask by repeatedly clicking or doodling until a sufficiently accurate mask is obtained, and then the mask is sent to the mask time domain propagation module.

The mask time domain propagation module comprises a space-time memory frame reader (memory frame reader) based on an attention mechanism;

the attention-based spatiotemporal memory frame reader comprises a memory encoder (memory encoder), a query encoder (query encoder) and a mask decoder (query decoder).

After obtaining a Mask (Mask) corresponding to a single-frame original image, the Mask time domain propagation module performs Mask propagation from the positive and negative time domain directions. The principle is that a mask of a query frame (query frame) is predicted according to an existing memory frame (memory frame) in a current Cache (Cache) module, then the query frame with the predicted mask is marked as the memory frame and stored in the Cache (Cache) module, a next frame of a video is taken as a new query frame, the operation is repeated until the next frame is the memory frame or the last frame of the video frame, and the propagation is stopped, which means that the masks of all the frames are obtained.

The specific propagation method is that the current interactive frame is used as a memory frame (memory frame), an adjacent frame is used as a query frame (query frame), matching is carried out through a key feature map of the memory frame and the query frame, then a value feature map of the memory frame is multiplied by a weight generated by key feature matching, finally the value feature map of the query frame is connected and sent to a mask decoder (query decoder) for decoding, and finally the mask of the query frame is predicted.

The subdivision module based on the space-time feature fusion comprises a subdivision encoder, a subdivision decoder, an ASPP (asynchronous serial port) cavity convolution pooling pyramid, a space-time feature fusion module and a gradual thinning module.

The fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask (alpha matte) according to all video frame masks output by the mask time domain propagation module and the original images of the video frames, and eliminates artifacts and flickering phenomena which may appear in video matting by utilizing spatio-temporal information between frames.

The subdivision module based on the space-time feature fusion divides each frame of original image F in the video_iThe following operations are performed: f is to be_iAnd two adjacent original images F_i-1、F_i+1And a corresponding mask M_i M_i-1、M_i+1And respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision encoder to perform multi-level feature extraction, inputting the coding features of the bottommost layer of the subdivision encoder into an ASPP (advanced standard protocol) hollow convolutional pooling pyramid to perform multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder to perform layer-by-layer upward decoding. And simultaneously, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding layer through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding layer of the subdivision decoder through jumping connection and adds the aligned and fused feature map and the feature map decoded by the previous layer of the subdivision decoder for decoding of the current layer. The characteristic of the last-layer decoding of the subdivision decoder is the characteristic obtained by outputting the ASPP hole convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding layer by layer upwards. In addition, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, so that the matting result which is gradually thinned can be obtained in the upward decoding process of the subdivision decoder, and finally the accurate transparency mask (alpha matte) is obtained.

A use method of an interactive video matting system based on a mask propagation network comprises the following steps:

caching a video to be processed in a video frame mode through a caching module so as to obtain an original image of each frame;

step (2), carrying out coarse segmentation on a foreground target in an original input image through an interactive target coarse segmentation module, thereby extracting a foreground target Mask (Mask);

a user selects an original image of any frame from a cache module as an original input image, and obtains an indication graph corresponding to the original input image by clicking or doodling a foreground object, wherein the indication graph is a single-channel graph with the value of a pixel point clicked or doodled by the user as 1, three channels connected with the original input image jointly form four-channel input, and the four-channel input is input to an image segmentation network; the image segmentation network performs coarse segmentation on the foreground target in the original input image according to semantic information provided by the indication map, so as to extract a foreground target Mask (Mask).

And (3) obtaining masks of all frames in the video through a mask time domain propagation module.

Step (4), judging whether the obtained masks of all frames are satisfied by a user;

and (4) if the user is not satisfied with the obtained masks of all the frames, selecting the original image corresponding to the unsatisfied mask as the original input image, and obtaining the masks of all the frames through the steps (2) and (3) again until the user obtains the satisfactory masks of all the frames.

And (5) when the user obtains the masks of all the satisfactory frames, predicting the accurate transparency mask through a fine segmentation module based on space-time feature fusion.

The fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask according to all video frame masks output by the mask temporal propagation module and the original input image of each frame of the video stored in the cache module.

Each frame of original image F in the video is segmented by a sub-segmentation module based on space-time feature fusion_iThe following operations are performed: f is to be_iAnd two adjacent original images F_i-1、F_i+1And a corresponding mask M_i M_i-1、M_i+1And respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision encoder to perform multi-level feature extraction, inputting the coding features of the bottommost layer of the subdivision encoder into an ASPP (advanced standard protocol) hollow convolutional pooling pyramid to perform multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder to perform layer-by-layer upward decoding. And simultaneously, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding layer through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding layer of the subdivision decoder through jumping connection and adds the aligned and fused feature map and the feature map decoded by the previous layer of the subdivision decoder for decoding of the current layer. The characteristics of the last-layer decoding of the subdivision decoder are characteristics obtained by outputting an ASPP hollow convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding upwards layer by layer, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, and the accurate transparency mask (alpha matte) is finally obtained by the gradual thinning cutout result.

The invention has the following beneficial effects:

compared with the existing video matting method, the method can realize the matting of the whole video foreground target only by clicking or doodling the foreground target of any frame of the video, does not need to provide a tripartite graph for each frame, greatly reduces the workload of a user, achieves the effect of an advanced matting algorithm, effectively solves the problem of space-time consistency among video frames through a space-time feature fusion module, and effectively inhibits the phenomena of artifacts and flicker possibly generated by moving object details.

Drawings

FIG. 1 is a flow diagram of the overall interactive video matting method and system;

FIG. 2 is a flow chart of a spatiotemporal memory frame reader;

FIG. 3 is a flow diagram of a mask time domain propagation module;

FIG. 4 is a diagram of a feature fusion network module architecture;

FIG. 5 is a block diagram of a subdivided network based spatiotemporal feature aggregation module;

fig. 6 is a sectional view of the embodiment of the invention.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings so that the advantages and features of the invention can be more easily understood by those skilled in the art, and the scope of the invention will be more clearly defined.

As shown in fig. 1, an interactive video matting system based on a mask propagation network includes a cache module, an interactive image rough segmentation module, a mask time domain propagation module, and a fine segmentation module based on spatio-temporal feature fusion:

firstly, a cache module:

Secondly, an interactive target rough segmentation module:

as shown in the upper half of fig. 1, in this module, a user selects an original input image of any frame from a cache module, and obtains an indication graph corresponding to the original input image by clicking or doodling a foreground object, where the indication graph is a single-channel graph in which the value of a pixel point clicked or doodled by the user is set to 1, and three channels connecting the original input image together form four-channel input, and the four-channel input is input to an image segmentation network; the image segmentation network performs coarse segmentation on the foreground target in the original input image according to semantic information provided by the indication map, so as to extract a foreground target Mask (Mask).

In the embodiment of the invention, a user operation interface GUI is established, a user can interact with an input image through the user operation interface GUI, the user can select two modes of clicking and doodling for interaction to generate a single-channel indication image, and in addition, the user can click and doodling for multiple times to adjust the generated foreground target mask.

In the embodiment of the invention, a Deeplab V3+ network is adopted as a backbone in an image segmentation network, the network receives six-channel input, wherein three channels are RGB images, a single channel is a mask, and two channels are positive and negative graffiti images, the mask has two conditions, the mask is empty during initial interaction, and the mask is a single-channel image containing an error region when the generated foreground target mask is adjusted.

In the embodiment of the invention, a Deeplab V3+ network is trained on a public data set PASCAL VOC 2012 Segmentation compatibility, in order to enable the Deeplab V3+ network to learn an interaction mode of user doodling, training data of user doodling interaction needs to be collected, but huge workload is brought, so that random experience of whether a mask is empty is set to 0.5, when the mask is not empty, a real transparency mask (groudth alpha matte) is corroded and expanded to obtain the trained mask, the real transparency mask is provided by the public data set, and then a thinning or random Bezier curve strategy is used for generating corresponding input doodling for an error region of the mask to simulate the user doodling.

Thirdly, a mask time domain propagation module:

Original input image F selected by interactive target rough segmentation module_iAnd the generated mask M_iPredicting correspondence of all remaining video framesAnd (5) masking.

In a spatiotemporal memory frame reader, processing video is typically such that the second frame picture starts processing each frame picture in sequence, and we consider the video frame with the target mask (mask) before as a memory frame (memory frame) and the video frame without the mask currently as a query frame (query frame).

The space-time memory frame reader based on the attention mechanism comprises a memory encoder (memory encoder), a query encoder (query encoder) and a mask decoder (query decoder). The memory encoder and the query encoder both adopt ResNet50 as a backbone network, and use the characteristic diagram of stage-4(res4) of ResNet50 as a basic characteristic diagram of a calculated key value characteristic diagram.

For the input part, the memory encoder adds an extra input channel in the first convolutional layer, whose inputs are image and Mask (Mask), while the query encoder input is only image.

As shown in fig. 2, two convolution layers are added at the ends of the memory encoder and the query encoder, and two feature maps, namely, a Key Map (Key Map) and a Value Map (Value Map), are generated respectively for calculating the similarity of Key features between the query frame (query frame) and the memory frame (memory frame), and the Key Map and the Value Map are respectively formed by adding two convolution layers to the ends of the memory encoder and the query encoder

And

representation, where HW represents the original size, C^kAnd C^vSet to 128 and 512, respectively.

As can be seen from FIG. 2, for each memory frame T, the spatiotemporal memory frame reader calculates its key-value signature by convolution and concatenates the outputs into a memory key map K^MSum memory value graph V^MAnd then look up the key map K^QAnd memory key diagram K^MMatching is performed by dot product, and the formula is as follows:

F＝(K^M)^TK^Q (2)

wherein the entity F ∈ R^THW*HWRepresenting the affinity of the query and memory points.

Performing a spatiotemporal memory read operation by first measuring the similarity of all pixels between the query key map and the memory key map to calculate V^MWeight of, V^MMultiplying the sum by the weight and then multiplying the sum by V^QThe sums are inputted to a mask decoder.

And after the mask decoder acquires the output of the space-time memory reading operation, reconstructing a target mask of the query frame. The mask thinning network proposed by Facebook is used as a building module, the output of the space-time memory reading operation is compressed to 256 channels by utilizing a convolution layer and a residual block, then the compressed reading operation output is gradually amplified by three mask thinning modules, the amplification is doubled at one time, and the mask thinning module of each stage is connected with an inquiry encoder through jump connection to obtain the output and the characteristic diagram of the previous stage. The output of the last mask refinement module is fed into the convolutional layers for reconstructing the object mask, each convolutional layer of the decoder uses a 3 × 3 convolutional filter to generate 256 channel outputs, and the last convolutional layer outputs a predictive mask proportional to the original image 1/4.

The steps of the embodiment of the invention mainly comprise:

1) and taking the mask extracted by interactive segmentation and the single-frame original image as a memory frame, and taking the adjacent frame needing to be predicted as a query frame.

2) Performing convolution operation on the query frame to obtain a key feature graph K of the query frame^QSum value feature graph V^Q。

3) Computing a key feature map K for a query frame^QAnd a key feature map K of the memory frame^MAnd then with the value feature map V of the memory frame^MMultiplying to obtain an aligned value characteristic diagram V^Q。

4) And adding the aligned feature maps and the value feature maps of the query frame, and decoding by a decoder to obtain a Mask (Mask) of the query frame.

5) And putting the query frame into a Cache (Cache) module as a memory frame. The prediction of the next frame is continued.

6) And repeating the operation until the next frame is a memory frame or the last frame of the video frame, and stopping propagation.

The space-time memory frame reader realizes the whole derivation process of obtaining the mask prediction of the query frame through the memory frame, and a corresponding mask propagation strategy is required to be specified in order to realize the mask of a single-frame picture and a mask pair for deriving the whole video frame.

As shown in FIG. 3, we input an image F with the original_iAnd a mask M obtained by the interactive target rough segmentation module_iAs a reference, the frame is propagated to other frames in the front and rear directions in the time domain dimension. In each direction, the following strategy is followed:

and (3) each time, according to the propagation of the current frame to the next frame in the direction, marking the query frame predicted to be masked as a memory frame and storing the memory frame in a Cache (Cache) module, and continuing the propagation process until the next frame is a memory frame or the last frame of a video frame.

Through the mask time domain propagation module, a user can obtain masks corresponding to all video frames.

Fourthly, a fine segmentation module based on space-time feature fusion:

As can be seen from fig. 5, the subdivision module based on spatio-temporal feature fusion includes a subdivision encoder, a subdivision decoder, an ASPP empty convolution pooling pyramid, a spatio-temporal feature fusion module, and a gradual refinement module.

Three continuous frame images F in the buffer module_r-1、F_r、F_r+1And corresponding mask M_r-1、M_r、M_r+1The three groups of 4-channel inputs are connected and respectively transmitted into a fine segmentation coder to extract three groups of depth feature maps Fea with different scales_r-1、Fea_r、Fea_r+1Subdivision of the bottommost layer of the encoderInputting the features into an ASPP hollow convolution pooling pyramid for multi-scale feature extraction and fusion, and then outputting the features to the bottom layer of a subdivision decoder for layer-by-layer upward decoding. Three groups of feature maps Fea_r-1、Fea_r、Fea_r+1And features are output to a space-time feature fusion module through jumping connection to carry out feature alignment and feature fusion, then the aligned and fused features of different scales are respectively input to corresponding levels of a subdivision decoder, and are added with a feature map decoded by a level above the subdivision decoder to carry out decoding of the current level, and guidance information is provided for eliminating error prediction. The characteristic of the last-layer decoding of the subdivision decoder is the characteristic obtained by outputting the ASPP hole convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding layer by layer upwards. And step-by-step thinning modules are respectively introduced into the second layer, the third layer and the fourth layer of the subdivision decoder, so that the transparency mask prediction result is gradually thinned in the upward decoding process.

The ASPP cavity convolution pooling pyramid module in the embodiment of the invention mainly refers to a practice mode adopted in Deeplab V3+, and mainly aims to capture semantic information of different scales and fuse the semantic information according to cavity convolution of different sampling rates, and specific principles are not repeated herein.

Next, we aim at a subdivision coder, a subdivision decoder, a spatio-temporal feature fusion module, and a step-by-step refinement module. An introduction is made.

(1) A subdivision encoder and a subdivision decoder:

as shown in FIG. 5, a custom U-Net structure is used for the subdivision encoder and decoder network, and in the input part of the subdivision encoder, the feature input S0 ∈ R of a four-channel composed of RGB image and guide map is used⁴ ^*512*512The number of channels is 4, and the size is set to 512 × 512 according to the input size, because the data loading on the network usually cuts the input image. The input characteristic is convolved by two layers to obtain a two-time downsampling characteristic diagram S1E R³² ^*256*256After each layer of convolution, the spectrum normalization operation and the batch normalization processing are carried out, and the purpose of adding the Lips to the network is to carry out the batch normalization processingThe chitz constant constrains so that the training is more stable. Then obtaining the characteristic S2 epsilon R through convolution of a second layer and the first residual block Res1 in sequence^64*128*128Then, the second residual block Res2 of the third layer is passed to obtain the characteristic S3 ∈ R^128*64*64And respectively obtaining a 16-time downsampling feature map S4 ∈ R through a third residual block Res3 of the fourth layer and a fourth residual block Res4 of the fifth layer²⁵⁶ ^*32*32And 32 times down-sampled graph S5 ∈ R^512*16*16。

In the sub-segmentation decoder part, as shown in the right side of fig. 5, the feature map decoded by each layer is combined with the feature output by the space-time feature fusion module of the corresponding layer, and then is up-sampled and decoded. In addition, the second layer, the third layer and the fifth layer can predict transparency masks with different scales through convolution. These predicted transparency masks are used together with the prediction of the next level as input to the step-by-step refinement module to derive the transparency mask of the next level.

(2) Spatio-temporal feature fusion module

As shown in fig. 4, the spatio-temporal feature fusion module includes a spatio-temporal feature alignment module and a spatio-temporal feature aggregation module, wherein the spatio-temporal feature alignment module is composed of two 3 × 3 general volumes and one 3 × 3 variable convolution network, and the spatio-temporal feature aggregation module is composed of a channel attention network (channel attention), a spatial attention network (spatial attention) and a global convolution network (global convolution network) connected in series.

In this embodiment, the features Fea of two adjacent frames are used_r-1And Fea_rInputting the data into a time-space feature alignment module to obtain Fea_r-1Aligned to Fea_rIs to Fea_rAnd Fea_r+1After the same operation, Fea_r+1Aligned to Fea_rFeatures, using features of adjacent 3 frames as a group to carry out feature alignment, obtaining two alignment to Fea_rThe characteristics of (1).

And after the two aligned features concat are input into a space-time feature aggregation module, channel weighting is carried out on the input features through a channel attention network, then the spatial information of the input features is weighted through a spatial attention network, and finally the features are output after a global convolution.

The feature alignment described above relies on the principle nature of variable convolution, which can apply weights to the offsets of the input features to align the features. For image frame I at time t_tIf the deviation of the pixel point P is expressed by delta P, the aligned characteristic F is obtained^*Expressed as:

where k is the convolution sum position of the variable convolution, w_kIt is the weight of that location. Δ pk is then the linear offset of the characteristic at times t and t + Δ t. The alignment feature and learning such offsets allows the model to automatically map the same or similar regions and pixels and also encode timing information into the aligned features.

The spatio-temporal feature aggregation module processes the aligned features by an attention mechanism, and guides the model to utilize the importance of different channels and the region of interest on a certain channel from the dimension of the channel and the dimension of the space. The channel attention network uses a global average pooling layer, performs pooling operations on input features, then a full-link layer to calculate a channel attention weight map, and multiplies the map with aligned features. The spatial attention network obtains two 1 xHxW characteristics from input characteristics through a global flat pooling layer and a global maximum pooling layer, reduces channels through a convolution and sigmod activation layer after concat is carried out on the input characteristics, outputs a 1 xHxW spatial attention weight map, multiplies the original input characteristics with the spatial attention weight map to obtain a characteristic map with applied spatial weight, reduces the number of channels by using a 1 x 1 convolution layer, and expands an acceptance domain through the global convolution network.

(3) Gradual refining module

The purpose of the module is to use foreground information near the drawing boundary of low-level features to improve the final cutout effect in order to identify the internal region of an object by using the high-level features.

The principle is as follows:

assuming that the number of current decoder layers is l, the gradual thinning module firstly outputs a transparency mask alpha output by the l-1 decoder layer_l-1Upsampled to current layer alpha_lThe dimension (c) of (c). Then, the following formula is used to obtain a self-guiding graph g_l(x,y)：

As shown in equation (4), the bootstrap map represents the unknown region of the previous layer of prediction mask, and the bootstrap map refers to a single-channel map with 0,1 pixel values, where 1 represents the unknown region and 0 represents the determined region. Masking a current level by a bootstrap map_lIs covered with a mask alpha_l-1So that both high-level and low-level information is utilized. The formula is as follows:

α_l＝α_lg_l+α_l-1(1-g_l) (5)

after progressive refinement modules are applied to the 2 nd layer, the 3 rd layer and the 5 th layer of the decoder respectively, the unknown region represented by the bootstrap map is also reduced along with upward decoding of the features, and the predicted transparency mask is also progressively refined.

The process is to predict the transparency mask of a single frame of image of the video, and apply a subdivision module of space-time feature fusion to each frame of original image and the corresponding mask so as to obtain a fine matting result aiming at the video.

Fig. 6 is a sectional view of the embodiment of the invention.

The subdivision module based on the spatio-temporal feature fusion in the embodiment of the invention is trained on a public Video data set provided by Deep Video matching, wherein the data set comprises 408 foreground pictures, 87 foreground videos and 6659 background videos in total. I select 48 video foregrounds and 231 picture foregrounds, randomly select 15 background video data sets from 6659 background video data sets, and synthesize 4185 training data sets as training sets. 248 video samples are also selected from the validation set as the validation set.

Claims

1. An interactive video matting system based on a mask propagation network is characterized by comprising a cache module, an interactive image rough segmentation module, a mask time domain propagation module and a fine segmentation module based on space-time feature fusion;

the cache module is used for caching the video in a video frame mode so as to obtain an original input image of each frame; meanwhile, the memory frame is used for caching the mask time domain propagation module mark;

the interactive target rough segmentation module is used for interacting an input image, the interaction comprises two interaction modes of clicking and doodling, a user selects any interaction mode according to actual conditions, foreground target information, namely an indication image, of the original input image is obtained through single clicking or doodling, the indication image is input to an image segmentation network in combination with the original input image, and a primary mask is obtained;

the user can optimize the mask by repeatedly clicking or doodling until the mask which is accurate enough is obtained, and then the mask is sent to the mask time domain propagation module;

the mask time domain propagation module comprises a space-time memory frame reader based on an attention mechanism; the attention-based spatiotemporal memory frame reader comprises a memory encoder, a query encoder and a mask decoder;

after obtaining a mask corresponding to the single-frame original image, the mask time domain propagation module performs mask propagation from the positive and negative time domain directions; the method is characterized in that a mask of a query frame is predicted according to an existing memory frame in a current cache module, then the query frame with the predicted mask is marked as the memory frame and stored in the cache module, the next frame of a video is taken as a new query frame, and the operation is repeated until the next frame is the memory frame or the last frame of the video frame, so that propagation is stopped, and the mask of all frames is obtained;

the subdivision module based on the space-time feature fusion comprises a subdivision encoder, a subdivision decoder, an ASPP (asynchronous serial port) cavity convolution pooling pyramid, a space-time feature fusion module and a gradual thinning module;

the fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask according to all video frame masks output by the mask time domain propagation module and the original images of the video frames, and eliminates artifacts and flickering phenomena which may appear in video matting by utilizing spatio-temporal information between frames.

2. The interactive video matting system based on the mask propagation network as recited in claim 1, wherein the specific propagation mode is to use the current interactive frame as a memory frame and the adjacent frame as a query frame, match the memory frame with the key feature map of the query frame, multiply the value feature map of the memory frame by the weight generated by the key feature matching, finally connect the value feature map of the query frame and send it to the mask decoder for decoding, and finally predict the mask of the query frame.

3. The interactive video matting system based on the mask propagation network as claimed in claim 2, wherein the segmentation module based on spatio-temporal feature fusion is used for segmenting each frame of original image F in the video_iThe following operations are performed: f is to be_iAnd two adjacent original images F_i-1、F_i+1And a corresponding mask M_i M_i-1、M_i+1Respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision coder for multilevel feature extraction, inputting the coding features of the bottommost layer of the subdivision coder into an ASPP (asynchronous response packet) hole convolution pooling pyramid for multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder for layer-by-layer upward decoding; meanwhile, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding layer through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding layer of the subdivision decoder through jumping connection and the feature map and the subdivision decoder are connected with each otherAdding the feature maps decoded at the previous level to decode the current level; the characteristics of the last-layer decoding of the subdivision decoder are characteristics obtained by outputting an ASPP hollow convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding layer by layer upwards; in addition, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, so that the matting result can be gradually thinned in the upward decoding process of the subdivision decoder, and finally the accurate transparency mask is obtained.

4. The interactive video matting system based on the mask propagation network according to claim 1, 2 or 3, characterized in that the image segmentation network of the interactive object rough segmentation module adopts a deep v3+ network as a backbone, the network accepts six-channel input, wherein three channels are RGB images, a single channel is a mask, two channels are positive and negative graffiti images, wherein the mask has two cases, the mask is empty at the time of initial interaction, and when the generated foreground object mask is adjusted, the mask is a single-channel image containing an error region.

5. The interactive video matting system based on mask propagation network as claimed in claim 4, wherein the memory encoder and the query encoder both adopt ResNet50 as backbone network, and use the stage-4 feature map of ResNet50 as a basic feature map of the calculated key-value feature map; for the input part, the memory encoder adds an additional input channel in the first convolution layer, the input of the memory encoder is an image and a mask, and the input of the query encoder is only an image;

two convolution layers are added at the tail ends of the memory encoder and the query encoder to respectively generate a key map and a value map for calculating the similarity of key features between the query frame and the memory frame, wherein the key map and the value map are respectively formed by

And

representation, where HW represents the original size, C^kAnd C^vSet to 128 and 512, respectively;

for each memory frame T, the spatio-temporal memory frame reader calculates its key value signature by convolution operation and concatenates the outputs into a memory key map K^MAnd memory value map V^MAnd then look up the key map K^QAnd memory key diagram K^MMatching is performed by dot product, and the formula is as follows:

F＝(K^M)^TK^Q (2)

wherein the entity F ∈ R^THW*HWRepresents the affinity of the query point and the memory point;

performing a spatiotemporal memory read operation by first measuring the similarity of all pixels between the query key map and the memory key map to calculate V^MWeight of, V^MMultiplying the sum by the weight and then multiplying the sum by V^QAdding and inputting the sum to a mask decoder;

after the mask decoder obtains the output of the time-space memory reading operation, reconstructing a target mask of the query frame; the mask thinning network proposed by Facebook is used as a building module, the output of the space-time memory reading operation is compressed to 256 channels by utilizing a convolution layer and a residual block, then the compressed reading operation output is gradually amplified by three mask thinning modules, the amplification is doubled at one time, and the mask thinning module of each stage is connected with an inquiry encoder through jump connection to obtain the output and the characteristic diagram of the previous stage; the output of the last mask refinement module is fed into the convolutional layers for reconstructing the object mask, each convolutional layer of the decoder uses a 3 × 3 convolutional filter to generate 256 channel outputs, and the last convolutional layer outputs a predictive mask proportional to the original image 1/4.

6. The interactive video matting system based on mask propagation network as claimed in claim 5, wherein said network of sub-division encoders and decoders uses a custom U-Net structure, and the input part of the sub-division encoder is RGB image plus indexFeature input S0 ∈ R of four-channel formed by guide graph^4*512*512The number of channels is 4, and the size is set to 512 x 512 according to the input size; the input characteristic is convolved by two layers to obtain a two-time downsampling characteristic diagram S1 ∈ R^32*256*256After each layer of convolution, spectral normalization operation and batch normalization processing are carried out, and the purpose of doing so is to add Lipschitz constant constraint to the network, so that the training is more stable; then obtaining the characteristic S2 epsilon R through convolution of a second layer and the first residual block Res1 in sequence^64*128*128Then, the second residual block Res2 of the third layer is passed to obtain the characteristic S3 ∈ R^128*64*64And respectively obtaining a 16-time downsampling feature map S4 ∈ R through a third residual block Res3 of the fourth layer and a fourth residual block Res4 of the fifth layer^256*32*32And 32 times the downsampling map S5 ∈ R^512*16*16；

In the subdivision decoder part, the feature graph decoded by each layer is combined with the feature output by the space-time feature fusion module of the corresponding layer, and then the feature graph is up-sampled and decoded; in addition, the second layer, the third layer and the fifth layer can predict transparency masks with different scales through convolution; these predicted transparency masks are used together with the prediction of the next level as input to the step-by-step refinement module to derive the transparency mask of the next level.

7. The interactive video matting system based on the mask propagation network as claimed in claim 7, wherein the spatio-temporal feature fusion module includes a spatio-temporal feature alignment module and a spatio-temporal feature aggregation module, wherein the spatio-temporal feature alignment module is composed of two 3 × 3 general convolutions and one 3 × 3 variable convolution, and the spatio-temporal feature aggregation module is composed of a channel attention network, a spatial attention network and a global convolution network connected in series;

feature Fea of two adjacent frames_r-1And Fea_rInputting the data into a time-space feature alignment module to obtain Fea_r-1Aligned to Fea_rIs to Fea_rAnd Fea_r+1After the same operation, Fea_r+1Aligned to Fea_rFeature alignment is carried out by taking the features of the adjacent 3 frames as a group to obtain twoAligned to Fea_rThe features of (1);

inputting the two aligned features concat into a space-time feature aggregation module, carrying out channel weighting on the input features through a channel attention network, then carrying out weighting on the spatial information of the input features through a spatial attention network, and finally outputting the features after a global convolution;

the feature alignment described above relies on the principle property of variable convolution, which can apply weights to the offsets of the input features to align the features; for image frame I at time t_tIf the deviation of the pixel point P is expressed by delta P, the aligned characteristic F is obtained^*Expressed as:

where k is the convolution sum position of the variable convolution, W_kThen it is the weight of that location; Δ pk is the linear offset of the characteristics at the time t and t + Δ t;

the space-time feature aggregation module processes the aligned features through an attention mechanism, and guides the model to utilize the importance of different channels and the region of interest on a certain channel from the dimension of the channel and the dimension of the space; the channel attention network uses a global average pooling layer, performs pooling operation on input features, then calculates a channel attention weight map by a full connection layer, and multiplies the map by aligned features; the spatial attention network obtains two 1 xHxW characteristics from input characteristics through a global flat pooling layer and a global maximum pooling layer, reduces channels through a convolution and sigmod activation layer after concat is carried out on the input characteristics, outputs a 1 xHxW spatial attention weight map, multiplies the original input characteristics with the spatial attention weight map to obtain a characteristic map with applied spatial weight, reduces the number of channels by using a 1 x 1 convolution layer, and expands an acceptance domain through the global convolution network.

8. The interactive video matting system based on the mask propagation network as claimed in claim 7, wherein the gradual refinement module utilizes high-level features to identify the internal region of the object, and utilizes foreground information near the depicted boundary of low-level features to improve the final matting effect;

the principle is as follows:

assuming that the number of current decoder layers is l, the gradual thinning module firstly outputs a transparency mask alpha output by the l-1 decoder layer_l-1Upsampled to current layer alpha_lThe dimension of (c); then, the following formula is used to obtain a self-guiding graph g_l(x，y)：

As can be seen from equation (4), the bootstrap map represents the unknown region of the previous layer of prediction mask, and the bootstrap map refers to a single-channel map with 0,1 constituting pixel values, where 1 represents the unknown region and 0 represents the determined region; masking a current level by a bootstrap map_lIs covered with a mask alpha of the previous layer_l-1So that both high-level and low-level information is utilized; the formula is as follows:

α_l＝α_lg_l+α_l-1(1-g_l) (5)

9. A use method of an interactive video matting system based on a mask propagation network is characterized by comprising the following steps:

caching a video to be processed in a video frame mode through a caching module so as to obtain an original input image of each frame;

a user selects an original input image of any frame from a cache module, and an indication image corresponding to the original input image is obtained by clicking or doodling a foreground object, wherein the indication image is a single-channel image in which the value of a pixel point clicked or doodled by the user is set to be 1, three channels connected with the original input image jointly form four-channel input, and the four-channel input is input to an image segmentation network; the image segmentation network roughly segments the foreground target in the original input image according to semantic information provided by the indication image, so as to extract a foreground target Mask (Mask);

step (3), obtaining masks of all frames in the video through a mask time domain propagation module;

after obtaining a Mask (Mask) corresponding to a single-frame original image, a Mask time domain propagation module performs Mask propagation from a positive time domain direction and a negative time domain direction; the method is characterized in that a mask of a query frame (query frame) is predicted according to an existing memory frame (memory frame) in a current Cache (Cache) module, then the query frame with the predicted mask is marked as the memory frame and stored in the Cache (Cache) module, the next frame of a video is taken as a new query frame, the operation is repeated until the next frame is the memory frame or the last frame of the video frame, and the propagation is stopped, which means that the masks of all the frames are obtained;

and (3) if the user is not satisfied with the obtained masks of all the frames, selecting the original image corresponding to the unsatisfied mask as the original input image, and obtaining the masks of all the frames through the steps (2) and (3) again until the user obtains the satisfactory masks of all the frames.

Step (5), when a user obtains the masks of all the satisfied frames, predicting an accurate transparency mask through a fine segmentation module based on space-time feature fusion;

a fine segmentation module based on space-time feature fusion predicts an accurate transparency mask according to all video frame masks output by a mask time domain propagation module and an original input image of each frame of a video stored in a cache module;

each frame of original image F in the video through a sub-segmentation module based on space-time feature fusion_iThe following operations are performed: f is to be_iAnd two adjacent original images F_i-1、F_i+1And a corresponding mask M_i M_i-1、M_i+1Respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision encoder to perform multi-level feature extraction, inputting the coding features of the bottommost layer of the subdivision encoder into an ASPP (advanced standard polypropylene) cavity convolution pooling pyramid to perform multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder to perform layer-by-layer upward decoding; meanwhile, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding level through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding level of the subdivision decoder through jumping connection and adds the aligned and fused feature map and a feature map decoded by the previous level of the subdivision decoder to decode the current level; the characteristics of the last-layer decoding of the subdivision decoder are characteristics obtained by outputting an ASPP hollow convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding upwards layer by layer, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, and the accurate transparency mask (alpha matte) is finally obtained by the gradual thinning cutout result.