CN114549574A - Interactive video matting system based on mask propagation network - Google Patents

Interactive video matting system based on mask propagation network Download PDF

Info

Publication number
CN114549574A
CN114549574A CN202210193688.4A CN202210193688A CN114549574A CN 114549574 A CN114549574 A CN 114549574A CN 202210193688 A CN202210193688 A CN 202210193688A CN 114549574 A CN114549574 A CN 114549574A
Authority
CN
China
Prior art keywords
mask
layer
module
frame
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210193688.4A
Other languages
Chinese (zh)
Inventor
沈蓉豪
戴国骏
周文晖
项雷雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210193688.4A priority Critical patent/CN114549574A/en
Publication of CN114549574A publication Critical patent/CN114549574A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an interactive video matting system based on a mask propagation network and feature fusion, which comprises a cache module, an interactive image rough segmentation module, a mask time domain propagation module and a fine segmentation module based on space-time feature fusion. Compared with the existing video matting method, the system can realize the matting of the whole video foreground target only by clicking or doodling the foreground target of any frame of the video, does not need to provide a tripartite graph for each frame, greatly reduces the workload of a user, achieves the effect of an advanced matting algorithm, effectively solves the problem of space-time consistency among video frames through a space-time feature fusion module, and effectively inhibits the phenomena of artifacts and flicker possibly generated by moving object details.

Description

Interactive video matting system based on mask propagation network
Technical Field
The invention relates to the technical field of image processing, in particular to a video matting system based on a mask propagation network and feature fusion.
Background
Image Matting is a technology focusing on object Foreground extraction, and the core idea is to mathematically model an Image, regard the Image as a convex combination of Foreground and Background parts according to a certain weight (transparency mask), and distinguish the Foreground (forego) and Background (Background) parts through a determined transparency mask (Alpha mate). The solving formula of the mathematical model is as follows:
Iz=αzFz+(1-αz)Bz (1)
wherein z represents a certain pixel point with coordinates (x, y) in the image, IzThen the RGB color value of pixel point z is represented, FzColor value representing a foreground pixel, BzColor value, alpha, representing a background pixelzA transparency mask value representing z, having a value range of [0,1 ]]. To solve this formula, additional supplemental constraints need to be introduced. Common supplementary inputs are trimap (trimap), sketches (scribes), background images, foreground coordinates, etc.
Video matting is the task of extracting moving foreground objects from a given video on the basis of image matting, and compared with image matting, the video matting method brings about two challenges.
Firstly, a video matting method is to perform matting on each frame of image of a video, so that each frame needs to be provided with supplementary input such as a ternary diagram or a sketch, and with the increase of data volume, if the video is manually marked, the time and labor cost expenses are huge, and therefore researchers often strive to weaken the input of supplementary information and enable a network to learn the movement of a foreground object, so that the transparency mask of an object in the video is predicted; in addition, the method selects and utilizes an input section of background video or a background image, and compares the input section of background video or the background image with the original video to obtain foreground prior information, so that a large amount of artificial labeling is avoided.
Second, the transparency mask (alpha matte) obtained by video matting needs to be consistent in the dimensions of time and space. If the image matting algorithm is directly applied to each frame of the video and the result is spliced into the video, artifacts and flickers are inevitably generated on moving objects or detailed parts. The traditional solution is to calculate the motion of the foreground by finding local or non-local affinities among pixel point colors, but the effect is often unsatisfactory, especially when processing complex scenes, such as complex backgrounds and fast moving foregrounds. More recent methods use optical flow estimation to predict the motion of the foreground, but optical flow estimation often fails to address large areas of translucent area.
Disclosure of Invention
The invention provides an interactive video matting system based on a mask propagation network and feature fusion.A user only needs to provide a small amount of clicking or doodling operation on any frame of image of a video and prompts that the position is a foreground or a background, so that matting on all video frames can be completed, and a trisection image is not needed to be provided for each frame of image, thereby greatly reducing the workload of video matting, having performance equivalent to an advanced matting algorithm, and solving the problem of space-time consistency of foreground objects between frames.
An interactive video matting system based on a mask propagation network comprises a cache module, an interactive image rough segmentation module, a mask time domain propagation module and a fine segmentation module based on space-time feature fusion.
The cache module is used for caching the video in a video frame mode so as to obtain an original input image of each frame; and the memory frame is used for caching the mask time domain propagation module mark.
The interactive target rough segmentation module is used for interacting an input image, the interaction comprises two interaction modes of clicking and doodling, a user selects any interaction mode according to actual conditions, foreground target information (an indication image) of the original input image is obtained through single clicking or doodling, the foreground target information is input to an image segmentation network in combination with the original input image, and a primary Mask (Mask) is obtained.
The user can optimize the mask by repeatedly clicking or doodling until a sufficiently accurate mask is obtained, and then the mask is sent to the mask time domain propagation module.
The mask time domain propagation module comprises a space-time memory frame reader (memory frame reader) based on an attention mechanism;
the attention-based spatiotemporal memory frame reader comprises a memory encoder (memory encoder), a query encoder (query encoder) and a mask decoder (query decoder).
After obtaining a Mask (Mask) corresponding to a single-frame original image, the Mask time domain propagation module performs Mask propagation from the positive and negative time domain directions. The principle is that a mask of a query frame (query frame) is predicted according to an existing memory frame (memory frame) in a current Cache (Cache) module, then the query frame with the predicted mask is marked as the memory frame and stored in the Cache (Cache) module, a next frame of a video is taken as a new query frame, the operation is repeated until the next frame is the memory frame or the last frame of the video frame, and the propagation is stopped, which means that the masks of all the frames are obtained.
The specific propagation method is that the current interactive frame is used as a memory frame (memory frame), an adjacent frame is used as a query frame (query frame), matching is carried out through a key feature map of the memory frame and the query frame, then a value feature map of the memory frame is multiplied by a weight generated by key feature matching, finally the value feature map of the query frame is connected and sent to a mask decoder (query decoder) for decoding, and finally the mask of the query frame is predicted.
The subdivision module based on the space-time feature fusion comprises a subdivision encoder, a subdivision decoder, an ASPP (asynchronous serial port) cavity convolution pooling pyramid, a space-time feature fusion module and a gradual thinning module.
The fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask (alpha matte) according to all video frame masks output by the mask time domain propagation module and the original images of the video frames, and eliminates artifacts and flickering phenomena which may appear in video matting by utilizing spatio-temporal information between frames.
The subdivision module based on the space-time feature fusion divides each frame of original image F in the videoiThe following operations are performed: f is to beiAnd two adjacent original images Fi-1、Fi+1And a corresponding mask Mi Mi-1、Mi+1And respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision encoder to perform multi-level feature extraction, inputting the coding features of the bottommost layer of the subdivision encoder into an ASPP (advanced standard protocol) hollow convolutional pooling pyramid to perform multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder to perform layer-by-layer upward decoding. And simultaneously, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding layer through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding layer of the subdivision decoder through jumping connection and adds the aligned and fused feature map and the feature map decoded by the previous layer of the subdivision decoder for decoding of the current layer. The characteristic of the last-layer decoding of the subdivision decoder is the characteristic obtained by outputting the ASPP hole convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding layer by layer upwards. In addition, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, so that the matting result which is gradually thinned can be obtained in the upward decoding process of the subdivision decoder, and finally the accurate transparency mask (alpha matte) is obtained.
A use method of an interactive video matting system based on a mask propagation network comprises the following steps:
caching a video to be processed in a video frame mode through a caching module so as to obtain an original image of each frame;
step (2), carrying out coarse segmentation on a foreground target in an original input image through an interactive target coarse segmentation module, thereby extracting a foreground target Mask (Mask);
a user selects an original image of any frame from a cache module as an original input image, and obtains an indication graph corresponding to the original input image by clicking or doodling a foreground object, wherein the indication graph is a single-channel graph with the value of a pixel point clicked or doodled by the user as 1, three channels connected with the original input image jointly form four-channel input, and the four-channel input is input to an image segmentation network; the image segmentation network performs coarse segmentation on the foreground target in the original input image according to semantic information provided by the indication map, so as to extract a foreground target Mask (Mask).
The user can optimize the mask by repeatedly clicking or doodling until a sufficiently accurate mask is obtained, and then the mask is sent to the mask time domain propagation module.
And (3) obtaining masks of all frames in the video through a mask time domain propagation module.
After obtaining a Mask (Mask) corresponding to a single-frame original image, the Mask time domain propagation module performs Mask propagation from the positive and negative time domain directions. The principle is that a mask of a query frame (query frame) is predicted according to an existing memory frame (memory frame) in a current Cache (Cache) module, then the query frame with the predicted mask is marked as the memory frame and stored in the Cache (Cache) module, a next frame of a video is taken as a new query frame, the operation is repeated until the next frame is the memory frame or the last frame of the video frame, and the propagation is stopped, which means that the masks of all the frames are obtained.
Step (4), judging whether the obtained masks of all frames are satisfied by a user;
and (4) if the user is not satisfied with the obtained masks of all the frames, selecting the original image corresponding to the unsatisfied mask as the original input image, and obtaining the masks of all the frames through the steps (2) and (3) again until the user obtains the satisfactory masks of all the frames.
And (5) when the user obtains the masks of all the satisfactory frames, predicting the accurate transparency mask through a fine segmentation module based on space-time feature fusion.
The fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask according to all video frame masks output by the mask temporal propagation module and the original input image of each frame of the video stored in the cache module.
Each frame of original image F in the video is segmented by a sub-segmentation module based on space-time feature fusioniThe following operations are performed: f is to beiAnd two adjacent original images Fi-1、Fi+1And a corresponding mask Mi Mi-1、Mi+1And respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision encoder to perform multi-level feature extraction, inputting the coding features of the bottommost layer of the subdivision encoder into an ASPP (advanced standard protocol) hollow convolutional pooling pyramid to perform multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder to perform layer-by-layer upward decoding. And simultaneously, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding layer through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding layer of the subdivision decoder through jumping connection and adds the aligned and fused feature map and the feature map decoded by the previous layer of the subdivision decoder for decoding of the current layer. The characteristics of the last-layer decoding of the subdivision decoder are characteristics obtained by outputting an ASPP hollow convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding upwards layer by layer, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, and the accurate transparency mask (alpha matte) is finally obtained by the gradual thinning cutout result.
The invention has the following beneficial effects:
compared with the existing video matting method, the method can realize the matting of the whole video foreground target only by clicking or doodling the foreground target of any frame of the video, does not need to provide a tripartite graph for each frame, greatly reduces the workload of a user, achieves the effect of an advanced matting algorithm, effectively solves the problem of space-time consistency among video frames through a space-time feature fusion module, and effectively inhibits the phenomena of artifacts and flicker possibly generated by moving object details.
Drawings
FIG. 1 is a flow diagram of the overall interactive video matting method and system;
FIG. 2 is a flow chart of a spatiotemporal memory frame reader;
FIG. 3 is a flow diagram of a mask time domain propagation module;
FIG. 4 is a diagram of a feature fusion network module architecture;
FIG. 5 is a block diagram of a subdivided network based spatiotemporal feature aggregation module;
fig. 6 is a sectional view of the embodiment of the invention.
Detailed Description
The embodiments of the present invention will be described in detail with reference to the accompanying drawings so that the advantages and features of the invention can be more easily understood by those skilled in the art, and the scope of the invention will be more clearly defined.
As shown in fig. 1, an interactive video matting system based on a mask propagation network includes a cache module, an interactive image rough segmentation module, a mask time domain propagation module, and a fine segmentation module based on spatio-temporal feature fusion:
firstly, a cache module:
the cache module is used for caching the video in a video frame mode so as to obtain an original input image of each frame; and the memory frame is used for caching the mask time domain propagation module mark.
Secondly, an interactive target rough segmentation module:
as shown in the upper half of fig. 1, in this module, a user selects an original input image of any frame from a cache module, and obtains an indication graph corresponding to the original input image by clicking or doodling a foreground object, where the indication graph is a single-channel graph in which the value of a pixel point clicked or doodled by the user is set to 1, and three channels connecting the original input image together form four-channel input, and the four-channel input is input to an image segmentation network; the image segmentation network performs coarse segmentation on the foreground target in the original input image according to semantic information provided by the indication map, so as to extract a foreground target Mask (Mask).
In the embodiment of the invention, a user operation interface GUI is established, a user can interact with an input image through the user operation interface GUI, the user can select two modes of clicking and doodling for interaction to generate a single-channel indication image, and in addition, the user can click and doodling for multiple times to adjust the generated foreground target mask.
In the embodiment of the invention, a Deeplab V3+ network is adopted as a backbone in an image segmentation network, the network receives six-channel input, wherein three channels are RGB images, a single channel is a mask, and two channels are positive and negative graffiti images, the mask has two conditions, the mask is empty during initial interaction, and the mask is a single-channel image containing an error region when the generated foreground target mask is adjusted.
In the embodiment of the invention, a Deeplab V3+ network is trained on a public data set PASCAL VOC 2012 Segmentation compatibility, in order to enable the Deeplab V3+ network to learn an interaction mode of user doodling, training data of user doodling interaction needs to be collected, but huge workload is brought, so that random experience of whether a mask is empty is set to 0.5, when the mask is not empty, a real transparency mask (groudth alpha matte) is corroded and expanded to obtain the trained mask, the real transparency mask is provided by the public data set, and then a thinning or random Bezier curve strategy is used for generating corresponding input doodling for an error region of the mask to simulate the user doodling.
Thirdly, a mask time domain propagation module:
the mask time domain propagation module comprises a space-time memory frame reader (memory frame reader) based on an attention mechanism;
the attention-based spatiotemporal memory frame reader comprises a memory encoder (memory encoder), a query encoder (query encoder) and a mask decoder (query decoder).
Original input image F selected by interactive target rough segmentation moduleiAnd the generated mask MiPredicting correspondence of all remaining video framesAnd (5) masking.
In a spatiotemporal memory frame reader, processing video is typically such that the second frame picture starts processing each frame picture in sequence, and we consider the video frame with the target mask (mask) before as a memory frame (memory frame) and the video frame without the mask currently as a query frame (query frame).
The space-time memory frame reader based on the attention mechanism comprises a memory encoder (memory encoder), a query encoder (query encoder) and a mask decoder (query decoder). The memory encoder and the query encoder both adopt ResNet50 as a backbone network, and use the characteristic diagram of stage-4(res4) of ResNet50 as a basic characteristic diagram of a calculated key value characteristic diagram.
For the input part, the memory encoder adds an extra input channel in the first convolutional layer, whose inputs are image and Mask (Mask), while the query encoder input is only image.
As shown in fig. 2, two convolution layers are added at the ends of the memory encoder and the query encoder, and two feature maps, namely, a Key Map (Key Map) and a Value Map (Value Map), are generated respectively for calculating the similarity of Key features between the query frame (query frame) and the memory frame (memory frame), and the Key Map and the Value Map are respectively formed by adding two convolution layers to the ends of the memory encoder and the query encoder
Figure BDA0003526017250000081
And
Figure BDA0003526017250000082
representation, where HW represents the original size, CkAnd CvSet to 128 and 512, respectively.
As can be seen from FIG. 2, for each memory frame T, the spatiotemporal memory frame reader calculates its key-value signature by convolution and concatenates the outputs into a memory key map KMSum memory value graph VMAnd then look up the key map KQAnd memory key diagram KMMatching is performed by dot product, and the formula is as follows:
F=(KM)TKQ (2)
wherein the entity F ∈ RTHW*HWRepresenting the affinity of the query and memory points.
Performing a spatiotemporal memory read operation by first measuring the similarity of all pixels between the query key map and the memory key map to calculate VMWeight of, VMMultiplying the sum by the weight and then multiplying the sum by VQThe sums are inputted to a mask decoder.
And after the mask decoder acquires the output of the space-time memory reading operation, reconstructing a target mask of the query frame. The mask thinning network proposed by Facebook is used as a building module, the output of the space-time memory reading operation is compressed to 256 channels by utilizing a convolution layer and a residual block, then the compressed reading operation output is gradually amplified by three mask thinning modules, the amplification is doubled at one time, and the mask thinning module of each stage is connected with an inquiry encoder through jump connection to obtain the output and the characteristic diagram of the previous stage. The output of the last mask refinement module is fed into the convolutional layers for reconstructing the object mask, each convolutional layer of the decoder uses a 3 × 3 convolutional filter to generate 256 channel outputs, and the last convolutional layer outputs a predictive mask proportional to the original image 1/4.
The steps of the embodiment of the invention mainly comprise:
1) and taking the mask extracted by interactive segmentation and the single-frame original image as a memory frame, and taking the adjacent frame needing to be predicted as a query frame.
2) Performing convolution operation on the query frame to obtain a key feature graph K of the query frameQSum value feature graph VQ
3) Computing a key feature map K for a query frameQAnd a key feature map K of the memory frameMAnd then with the value feature map V of the memory frameMMultiplying to obtain an aligned value characteristic diagram VQ
4) And adding the aligned feature maps and the value feature maps of the query frame, and decoding by a decoder to obtain a Mask (Mask) of the query frame.
5) And putting the query frame into a Cache (Cache) module as a memory frame. The prediction of the next frame is continued.
6) And repeating the operation until the next frame is a memory frame or the last frame of the video frame, and stopping propagation.
The space-time memory frame reader realizes the whole derivation process of obtaining the mask prediction of the query frame through the memory frame, and a corresponding mask propagation strategy is required to be specified in order to realize the mask of a single-frame picture and a mask pair for deriving the whole video frame.
As shown in FIG. 3, we input an image F with the originaliAnd a mask M obtained by the interactive target rough segmentation moduleiAs a reference, the frame is propagated to other frames in the front and rear directions in the time domain dimension. In each direction, the following strategy is followed:
and (3) each time, according to the propagation of the current frame to the next frame in the direction, marking the query frame predicted to be masked as a memory frame and storing the memory frame in a Cache (Cache) module, and continuing the propagation process until the next frame is a memory frame or the last frame of a video frame.
Through the mask time domain propagation module, a user can obtain masks corresponding to all video frames.
Fourthly, a fine segmentation module based on space-time feature fusion:
the fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask (alpha matte) according to all video frame masks output by the mask time domain propagation module and the original images of the video frames, and eliminates artifacts and flickering phenomena which may appear in video matting by utilizing spatio-temporal information between frames.
As can be seen from fig. 5, the subdivision module based on spatio-temporal feature fusion includes a subdivision encoder, a subdivision decoder, an ASPP empty convolution pooling pyramid, a spatio-temporal feature fusion module, and a gradual refinement module.
Three continuous frame images F in the buffer moduler-1、Fr、Fr+1And corresponding mask Mr-1、Mr、Mr+1The three groups of 4-channel inputs are connected and respectively transmitted into a fine segmentation coder to extract three groups of depth feature maps Fea with different scalesr-1、Fear、Fear+1Subdivision of the bottommost layer of the encoderInputting the features into an ASPP hollow convolution pooling pyramid for multi-scale feature extraction and fusion, and then outputting the features to the bottom layer of a subdivision decoder for layer-by-layer upward decoding. Three groups of feature maps Fear-1、Fear、Fear+1And features are output to a space-time feature fusion module through jumping connection to carry out feature alignment and feature fusion, then the aligned and fused features of different scales are respectively input to corresponding levels of a subdivision decoder, and are added with a feature map decoded by a level above the subdivision decoder to carry out decoding of the current level, and guidance information is provided for eliminating error prediction. The characteristic of the last-layer decoding of the subdivision decoder is the characteristic obtained by outputting the ASPP hole convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding layer by layer upwards. And step-by-step thinning modules are respectively introduced into the second layer, the third layer and the fourth layer of the subdivision decoder, so that the transparency mask prediction result is gradually thinned in the upward decoding process.
The ASPP cavity convolution pooling pyramid module in the embodiment of the invention mainly refers to a practice mode adopted in Deeplab V3+, and mainly aims to capture semantic information of different scales and fuse the semantic information according to cavity convolution of different sampling rates, and specific principles are not repeated herein.
Next, we aim at a subdivision coder, a subdivision decoder, a spatio-temporal feature fusion module, and a step-by-step refinement module. An introduction is made.
(1) A subdivision encoder and a subdivision decoder:
as shown in FIG. 5, a custom U-Net structure is used for the subdivision encoder and decoder network, and in the input part of the subdivision encoder, the feature input S0 ∈ R of a four-channel composed of RGB image and guide map is used4 *512*512The number of channels is 4, and the size is set to 512 × 512 according to the input size, because the data loading on the network usually cuts the input image. The input characteristic is convolved by two layers to obtain a two-time downsampling characteristic diagram S1E R32 *256*256After each layer of convolution, the spectrum normalization operation and the batch normalization processing are carried out, and the purpose of adding the Lips to the network is to carry out the batch normalization processingThe chitz constant constrains so that the training is more stable. Then obtaining the characteristic S2 epsilon R through convolution of a second layer and the first residual block Res1 in sequence64*128*128Then, the second residual block Res2 of the third layer is passed to obtain the characteristic S3 ∈ R128*64*64And respectively obtaining a 16-time downsampling feature map S4 ∈ R through a third residual block Res3 of the fourth layer and a fourth residual block Res4 of the fifth layer256 *32*32And 32 times down-sampled graph S5 ∈ R512*16*16
In the sub-segmentation decoder part, as shown in the right side of fig. 5, the feature map decoded by each layer is combined with the feature output by the space-time feature fusion module of the corresponding layer, and then is up-sampled and decoded. In addition, the second layer, the third layer and the fifth layer can predict transparency masks with different scales through convolution. These predicted transparency masks are used together with the prediction of the next level as input to the step-by-step refinement module to derive the transparency mask of the next level.
(2) Spatio-temporal feature fusion module
As shown in fig. 4, the spatio-temporal feature fusion module includes a spatio-temporal feature alignment module and a spatio-temporal feature aggregation module, wherein the spatio-temporal feature alignment module is composed of two 3 × 3 general volumes and one 3 × 3 variable convolution network, and the spatio-temporal feature aggregation module is composed of a channel attention network (channel attention), a spatial attention network (spatial attention) and a global convolution network (global convolution network) connected in series.
In this embodiment, the features Fea of two adjacent frames are usedr-1And FearInputting the data into a time-space feature alignment module to obtain Fear-1Aligned to FearIs to FearAnd Fear+1After the same operation, Fear+1Aligned to FearFeatures, using features of adjacent 3 frames as a group to carry out feature alignment, obtaining two alignment to FearThe characteristics of (1).
And after the two aligned features concat are input into a space-time feature aggregation module, channel weighting is carried out on the input features through a channel attention network, then the spatial information of the input features is weighted through a spatial attention network, and finally the features are output after a global convolution.
The feature alignment described above relies on the principle nature of variable convolution, which can apply weights to the offsets of the input features to align the features. For image frame I at time ttIf the deviation of the pixel point P is expressed by delta P, the aligned characteristic F is obtained*Expressed as:
Figure BDA0003526017250000121
where k is the convolution sum position of the variable convolution, wkIt is the weight of that location. Δ pk is then the linear offset of the characteristic at times t and t + Δ t. The alignment feature and learning such offsets allows the model to automatically map the same or similar regions and pixels and also encode timing information into the aligned features.
The spatio-temporal feature aggregation module processes the aligned features by an attention mechanism, and guides the model to utilize the importance of different channels and the region of interest on a certain channel from the dimension of the channel and the dimension of the space. The channel attention network uses a global average pooling layer, performs pooling operations on input features, then a full-link layer to calculate a channel attention weight map, and multiplies the map with aligned features. The spatial attention network obtains two 1 xHxW characteristics from input characteristics through a global flat pooling layer and a global maximum pooling layer, reduces channels through a convolution and sigmod activation layer after concat is carried out on the input characteristics, outputs a 1 xHxW spatial attention weight map, multiplies the original input characteristics with the spatial attention weight map to obtain a characteristic map with applied spatial weight, reduces the number of channels by using a 1 x 1 convolution layer, and expands an acceptance domain through the global convolution network.
(3) Gradual refining module
The purpose of the module is to use foreground information near the drawing boundary of low-level features to improve the final cutout effect in order to identify the internal region of an object by using the high-level features.
The principle is as follows:
assuming that the number of current decoder layers is l, the gradual thinning module firstly outputs a transparency mask alpha output by the l-1 decoder layerl-1Upsampled to current layer alphalThe dimension (c) of (c). Then, the following formula is used to obtain a self-guiding graph gl(x,y):
Figure BDA0003526017250000131
As shown in equation (4), the bootstrap map represents the unknown region of the previous layer of prediction mask, and the bootstrap map refers to a single-channel map with 0,1 pixel values, where 1 represents the unknown region and 0 represents the determined region. Masking a current level by a bootstrap maplIs covered with a mask alphal-1So that both high-level and low-level information is utilized. The formula is as follows:
αl=αlgll-1(1-gl) (5)
after progressive refinement modules are applied to the 2 nd layer, the 3 rd layer and the 5 th layer of the decoder respectively, the unknown region represented by the bootstrap map is also reduced along with upward decoding of the features, and the predicted transparency mask is also progressively refined.
The process is to predict the transparency mask of a single frame of image of the video, and apply a subdivision module of space-time feature fusion to each frame of original image and the corresponding mask so as to obtain a fine matting result aiming at the video.
Fig. 6 is a sectional view of the embodiment of the invention.
The subdivision module based on the spatio-temporal feature fusion in the embodiment of the invention is trained on a public Video data set provided by Deep Video matching, wherein the data set comprises 408 foreground pictures, 87 foreground videos and 6659 background videos in total. I select 48 video foregrounds and 231 picture foregrounds, randomly select 15 background video data sets from 6659 background video data sets, and synthesize 4185 training data sets as training sets. 248 video samples are also selected from the validation set as the validation set.

Claims (9)

1. An interactive video matting system based on a mask propagation network is characterized by comprising a cache module, an interactive image rough segmentation module, a mask time domain propagation module and a fine segmentation module based on space-time feature fusion;
the cache module is used for caching the video in a video frame mode so as to obtain an original input image of each frame; meanwhile, the memory frame is used for caching the mask time domain propagation module mark;
the interactive target rough segmentation module is used for interacting an input image, the interaction comprises two interaction modes of clicking and doodling, a user selects any interaction mode according to actual conditions, foreground target information, namely an indication image, of the original input image is obtained through single clicking or doodling, the indication image is input to an image segmentation network in combination with the original input image, and a primary mask is obtained;
the user can optimize the mask by repeatedly clicking or doodling until the mask which is accurate enough is obtained, and then the mask is sent to the mask time domain propagation module;
the mask time domain propagation module comprises a space-time memory frame reader based on an attention mechanism; the attention-based spatiotemporal memory frame reader comprises a memory encoder, a query encoder and a mask decoder;
after obtaining a mask corresponding to the single-frame original image, the mask time domain propagation module performs mask propagation from the positive and negative time domain directions; the method is characterized in that a mask of a query frame is predicted according to an existing memory frame in a current cache module, then the query frame with the predicted mask is marked as the memory frame and stored in the cache module, the next frame of a video is taken as a new query frame, and the operation is repeated until the next frame is the memory frame or the last frame of the video frame, so that propagation is stopped, and the mask of all frames is obtained;
the subdivision module based on the space-time feature fusion comprises a subdivision encoder, a subdivision decoder, an ASPP (asynchronous serial port) cavity convolution pooling pyramid, a space-time feature fusion module and a gradual thinning module;
the fine segmentation module based on the spatio-temporal feature fusion predicts an accurate transparency mask according to all video frame masks output by the mask time domain propagation module and the original images of the video frames, and eliminates artifacts and flickering phenomena which may appear in video matting by utilizing spatio-temporal information between frames.
2. The interactive video matting system based on the mask propagation network as recited in claim 1, wherein the specific propagation mode is to use the current interactive frame as a memory frame and the adjacent frame as a query frame, match the memory frame with the key feature map of the query frame, multiply the value feature map of the memory frame by the weight generated by the key feature matching, finally connect the value feature map of the query frame and send it to the mask decoder for decoding, and finally predict the mask of the query frame.
3. The interactive video matting system based on the mask propagation network as claimed in claim 2, wherein the segmentation module based on spatio-temporal feature fusion is used for segmenting each frame of original image F in the videoiThe following operations are performed: f is to beiAnd two adjacent original images Fi-1、Fi+1And a corresponding mask Mi Mi-1、Mi+1Respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision coder for multilevel feature extraction, inputting the coding features of the bottommost layer of the subdivision coder into an ASPP (asynchronous response packet) hole convolution pooling pyramid for multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder for layer-by-layer upward decoding; meanwhile, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding layer through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding layer of the subdivision decoder through jumping connection and the feature map and the subdivision decoder are connected with each otherAdding the feature maps decoded at the previous level to decode the current level; the characteristics of the last-layer decoding of the subdivision decoder are characteristics obtained by outputting an ASPP hollow convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding layer by layer upwards; in addition, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, so that the matting result can be gradually thinned in the upward decoding process of the subdivision decoder, and finally the accurate transparency mask is obtained.
4. The interactive video matting system based on the mask propagation network according to claim 1, 2 or 3, characterized in that the image segmentation network of the interactive object rough segmentation module adopts a deep v3+ network as a backbone, the network accepts six-channel input, wherein three channels are RGB images, a single channel is a mask, two channels are positive and negative graffiti images, wherein the mask has two cases, the mask is empty at the time of initial interaction, and when the generated foreground object mask is adjusted, the mask is a single-channel image containing an error region.
5. The interactive video matting system based on mask propagation network as claimed in claim 4, wherein the memory encoder and the query encoder both adopt ResNet50 as backbone network, and use the stage-4 feature map of ResNet50 as a basic feature map of the calculated key-value feature map; for the input part, the memory encoder adds an additional input channel in the first convolution layer, the input of the memory encoder is an image and a mask, and the input of the query encoder is only an image;
two convolution layers are added at the tail ends of the memory encoder and the query encoder to respectively generate a key map and a value map for calculating the similarity of key features between the query frame and the memory frame, wherein the key map and the value map are respectively formed by
Figure FDA0003526017240000031
And
Figure FDA0003526017240000032
representation, where HW represents the original size, CkAnd CvSet to 128 and 512, respectively;
for each memory frame T, the spatio-temporal memory frame reader calculates its key value signature by convolution operation and concatenates the outputs into a memory key map KMAnd memory value map VMAnd then look up the key map KQAnd memory key diagram KMMatching is performed by dot product, and the formula is as follows:
F=(KM)TKQ (2)
wherein the entity F ∈ RTHW*HWRepresents the affinity of the query point and the memory point;
performing a spatiotemporal memory read operation by first measuring the similarity of all pixels between the query key map and the memory key map to calculate VMWeight of, VMMultiplying the sum by the weight and then multiplying the sum by VQAdding and inputting the sum to a mask decoder;
after the mask decoder obtains the output of the time-space memory reading operation, reconstructing a target mask of the query frame; the mask thinning network proposed by Facebook is used as a building module, the output of the space-time memory reading operation is compressed to 256 channels by utilizing a convolution layer and a residual block, then the compressed reading operation output is gradually amplified by three mask thinning modules, the amplification is doubled at one time, and the mask thinning module of each stage is connected with an inquiry encoder through jump connection to obtain the output and the characteristic diagram of the previous stage; the output of the last mask refinement module is fed into the convolutional layers for reconstructing the object mask, each convolutional layer of the decoder uses a 3 × 3 convolutional filter to generate 256 channel outputs, and the last convolutional layer outputs a predictive mask proportional to the original image 1/4.
6. The interactive video matting system based on mask propagation network as claimed in claim 5, wherein said network of sub-division encoders and decoders uses a custom U-Net structure, and the input part of the sub-division encoder is RGB image plus indexFeature input S0 ∈ R of four-channel formed by guide graph4*512*512The number of channels is 4, and the size is set to 512 x 512 according to the input size; the input characteristic is convolved by two layers to obtain a two-time downsampling characteristic diagram S1 ∈ R32*256*256After each layer of convolution, spectral normalization operation and batch normalization processing are carried out, and the purpose of doing so is to add Lipschitz constant constraint to the network, so that the training is more stable; then obtaining the characteristic S2 epsilon R through convolution of a second layer and the first residual block Res1 in sequence64*128*128Then, the second residual block Res2 of the third layer is passed to obtain the characteristic S3 ∈ R128*64*64And respectively obtaining a 16-time downsampling feature map S4 ∈ R through a third residual block Res3 of the fourth layer and a fourth residual block Res4 of the fifth layer256*32*32And 32 times the downsampling map S5 ∈ R512*16*16
In the subdivision decoder part, the feature graph decoded by each layer is combined with the feature output by the space-time feature fusion module of the corresponding layer, and then the feature graph is up-sampled and decoded; in addition, the second layer, the third layer and the fifth layer can predict transparency masks with different scales through convolution; these predicted transparency masks are used together with the prediction of the next level as input to the step-by-step refinement module to derive the transparency mask of the next level.
7. The interactive video matting system based on the mask propagation network as claimed in claim 7, wherein the spatio-temporal feature fusion module includes a spatio-temporal feature alignment module and a spatio-temporal feature aggregation module, wherein the spatio-temporal feature alignment module is composed of two 3 × 3 general convolutions and one 3 × 3 variable convolution, and the spatio-temporal feature aggregation module is composed of a channel attention network, a spatial attention network and a global convolution network connected in series;
feature Fea of two adjacent framesr-1And FearInputting the data into a time-space feature alignment module to obtain Fear-1Aligned to FearIs to FearAnd Fear+1After the same operation, Fear+1Aligned to FearFeature alignment is carried out by taking the features of the adjacent 3 frames as a group to obtain twoAligned to FearThe features of (1);
inputting the two aligned features concat into a space-time feature aggregation module, carrying out channel weighting on the input features through a channel attention network, then carrying out weighting on the spatial information of the input features through a spatial attention network, and finally outputting the features after a global convolution;
the feature alignment described above relies on the principle property of variable convolution, which can apply weights to the offsets of the input features to align the features; for image frame I at time ttIf the deviation of the pixel point P is expressed by delta P, the aligned characteristic F is obtained*Expressed as:
Figure FDA0003526017240000051
where k is the convolution sum position of the variable convolution, WkThen it is the weight of that location; Δ pk is the linear offset of the characteristics at the time t and t + Δ t;
the space-time feature aggregation module processes the aligned features through an attention mechanism, and guides the model to utilize the importance of different channels and the region of interest on a certain channel from the dimension of the channel and the dimension of the space; the channel attention network uses a global average pooling layer, performs pooling operation on input features, then calculates a channel attention weight map by a full connection layer, and multiplies the map by aligned features; the spatial attention network obtains two 1 xHxW characteristics from input characteristics through a global flat pooling layer and a global maximum pooling layer, reduces channels through a convolution and sigmod activation layer after concat is carried out on the input characteristics, outputs a 1 xHxW spatial attention weight map, multiplies the original input characteristics with the spatial attention weight map to obtain a characteristic map with applied spatial weight, reduces the number of channels by using a 1 x 1 convolution layer, and expands an acceptance domain through the global convolution network.
8. The interactive video matting system based on the mask propagation network as claimed in claim 7, wherein the gradual refinement module utilizes high-level features to identify the internal region of the object, and utilizes foreground information near the depicted boundary of low-level features to improve the final matting effect;
the principle is as follows:
assuming that the number of current decoder layers is l, the gradual thinning module firstly outputs a transparency mask alpha output by the l-1 decoder layerl-1Upsampled to current layer alphalThe dimension of (c); then, the following formula is used to obtain a self-guiding graph gl(x,y):
Figure FDA0003526017240000061
As can be seen from equation (4), the bootstrap map represents the unknown region of the previous layer of prediction mask, and the bootstrap map refers to a single-channel map with 0,1 constituting pixel values, where 1 represents the unknown region and 0 represents the determined region; masking a current level by a bootstrap maplIs covered with a mask alpha of the previous layerl-1So that both high-level and low-level information is utilized; the formula is as follows:
αl=αlgll-1(1-gl) (5)
after progressive refinement modules are applied to the 2 nd layer, the 3 rd layer and the 5 th layer of the decoder respectively, the unknown region represented by the bootstrap map is also reduced along with upward decoding of the features, and the predicted transparency mask is also progressively refined.
9. A use method of an interactive video matting system based on a mask propagation network is characterized by comprising the following steps:
caching a video to be processed in a video frame mode through a caching module so as to obtain an original input image of each frame;
step (2), carrying out coarse segmentation on a foreground target in an original input image through an interactive target coarse segmentation module, thereby extracting a foreground target Mask (Mask);
a user selects an original input image of any frame from a cache module, and an indication image corresponding to the original input image is obtained by clicking or doodling a foreground object, wherein the indication image is a single-channel image in which the value of a pixel point clicked or doodled by the user is set to be 1, three channels connected with the original input image jointly form four-channel input, and the four-channel input is input to an image segmentation network; the image segmentation network roughly segments the foreground target in the original input image according to semantic information provided by the indication image, so as to extract a foreground target Mask (Mask);
the user can optimize the mask by repeatedly clicking or doodling until the mask which is accurate enough is obtained, and then the mask is sent to the mask time domain propagation module;
step (3), obtaining masks of all frames in the video through a mask time domain propagation module;
after obtaining a Mask (Mask) corresponding to a single-frame original image, a Mask time domain propagation module performs Mask propagation from a positive time domain direction and a negative time domain direction; the method is characterized in that a mask of a query frame (query frame) is predicted according to an existing memory frame (memory frame) in a current Cache (Cache) module, then the query frame with the predicted mask is marked as the memory frame and stored in the Cache (Cache) module, the next frame of a video is taken as a new query frame, the operation is repeated until the next frame is the memory frame or the last frame of the video frame, and the propagation is stopped, which means that the masks of all the frames are obtained;
step (4), judging whether the obtained masks of all frames are satisfied by a user;
and (3) if the user is not satisfied with the obtained masks of all the frames, selecting the original image corresponding to the unsatisfied mask as the original input image, and obtaining the masks of all the frames through the steps (2) and (3) again until the user obtains the satisfactory masks of all the frames.
Step (5), when a user obtains the masks of all the satisfied frames, predicting an accurate transparency mask through a fine segmentation module based on space-time feature fusion;
a fine segmentation module based on space-time feature fusion predicts an accurate transparency mask according to all video frame masks output by a mask time domain propagation module and an original input image of each frame of a video stored in a cache module;
each frame of original image F in the video through a sub-segmentation module based on space-time feature fusioniThe following operations are performed: f is to beiAnd two adjacent original images Fi-1、Fi+1And a corresponding mask Mi Mi-1、Mi+1Respectively forming three groups of four-channel input data, transmitting the three groups of four-channel input data into a subdivision encoder to perform multi-level feature extraction, inputting the coding features of the bottommost layer of the subdivision encoder into an ASPP (advanced standard polypropylene) cavity convolution pooling pyramid to perform multi-scale feature extraction and fusion, and outputting the features to the bottom layer of a subdivision decoder to perform layer-by-layer upward decoding; meanwhile, each layer in the subdivision encoder outputs the extracted feature map, the feature map of each layer is output to a space-time feature fusion module of a corresponding level through jumping connection for feature alignment and fusion, and the space-time feature fusion module outputs the aligned and fused feature map to the corresponding level of the subdivision decoder through jumping connection and adds the aligned and fused feature map and a feature map decoded by the previous level of the subdivision decoder to decode the current level; the characteristics of the last-layer decoding of the subdivision decoder are characteristics obtained by outputting an ASPP hollow convolution pooling pyramid to the bottom layer of the subdivision decoder and then decoding upwards layer by layer, the output parts of the second layer, the third layer and the fifth layer of the subdivision decoder are respectively connected with a gradual thinning module, and the accurate transparency mask (alpha matte) is finally obtained by the gradual thinning cutout result.
CN202210193688.4A 2022-03-01 2022-03-01 Interactive video matting system based on mask propagation network Pending CN114549574A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210193688.4A CN114549574A (en) 2022-03-01 2022-03-01 Interactive video matting system based on mask propagation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210193688.4A CN114549574A (en) 2022-03-01 2022-03-01 Interactive video matting system based on mask propagation network

Publications (1)

Publication Number Publication Date
CN114549574A true CN114549574A (en) 2022-05-27

Family

ID=81662002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210193688.4A Pending CN114549574A (en) 2022-03-01 2022-03-01 Interactive video matting system based on mask propagation network

Country Status (1)

Country Link
CN (1) CN114549574A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359088A (en) * 2022-10-18 2022-11-18 腾讯科技(深圳)有限公司 Image processing method and device
CN116363150A (en) * 2023-03-10 2023-06-30 北京长木谷医疗科技有限公司 Hip joint segmentation method, device, electronic equipment and computer readable storage medium
CN116630869A (en) * 2023-07-26 2023-08-22 北京航空航天大学 Video target segmentation method
CN117061711A (en) * 2023-10-11 2023-11-14 深圳市爱为物联科技有限公司 Video monitoring safety management method and system based on Internet of things
CN117237397A (en) * 2023-07-13 2023-12-15 天翼爱音乐文化科技有限公司 Portrait segmentation method, system, equipment and storage medium based on feature fusion
CN117252892A (en) * 2023-11-14 2023-12-19 江西师范大学 Automatic double-branch portrait matting model based on light visual self-attention network
CN117237397B (en) * 2023-07-13 2024-05-28 天翼爱音乐文化科技有限公司 Portrait segmentation method, system, equipment and storage medium based on feature fusion

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359088A (en) * 2022-10-18 2022-11-18 腾讯科技(深圳)有限公司 Image processing method and device
CN115359088B (en) * 2022-10-18 2023-01-20 腾讯科技(深圳)有限公司 Image processing method and device
CN116363150A (en) * 2023-03-10 2023-06-30 北京长木谷医疗科技有限公司 Hip joint segmentation method, device, electronic equipment and computer readable storage medium
CN117237397A (en) * 2023-07-13 2023-12-15 天翼爱音乐文化科技有限公司 Portrait segmentation method, system, equipment and storage medium based on feature fusion
CN117237397B (en) * 2023-07-13 2024-05-28 天翼爱音乐文化科技有限公司 Portrait segmentation method, system, equipment and storage medium based on feature fusion
CN116630869A (en) * 2023-07-26 2023-08-22 北京航空航天大学 Video target segmentation method
CN116630869B (en) * 2023-07-26 2023-11-07 北京航空航天大学 Video target segmentation method
CN117061711A (en) * 2023-10-11 2023-11-14 深圳市爱为物联科技有限公司 Video monitoring safety management method and system based on Internet of things
CN117252892A (en) * 2023-11-14 2023-12-19 江西师范大学 Automatic double-branch portrait matting model based on light visual self-attention network
CN117252892B (en) * 2023-11-14 2024-03-08 江西师范大学 Automatic double-branch portrait matting device based on light visual self-attention network

Similar Documents

Publication Publication Date Title
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN114549574A (en) Interactive video matting system based on mask propagation network
CN106960206B (en) Character recognition method and character recognition system
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
CN112396645B (en) Monocular image depth estimation method and system based on convolution residual learning
CN110909594A (en) Video significance detection method based on depth fusion
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
CN111582316A (en) RGB-D significance target detection method
CN109978021B (en) Double-flow video generation method based on different feature spaces of text
CN112084859B (en) Building segmentation method based on dense boundary blocks and attention mechanism
CN111724400A (en) Automatic video matting method and system
CN112258436A (en) Training method and device of image processing model, image processing method and model
CN114693929A (en) Semantic segmentation method for RGB-D bimodal feature fusion
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
CN113393434A (en) RGB-D significance detection method based on asymmetric double-current network architecture
CN115331024A (en) Intestinal polyp detection method based on deep supervision and gradual learning
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN114092774B (en) RGB-T image significance detection system and detection method based on information flow fusion
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN112784831A (en) Character recognition method for enhancing attention mechanism by fusing multilayer features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination