WO2023015755A1 - 一种抠图网络训练方法及抠图方法 - Google Patents

一种抠图网络训练方法及抠图方法 Download PDF

Info

Publication number
WO2023015755A1
WO2023015755A1 PCT/CN2021/130122 CN2021130122W WO2023015755A1 WO 2023015755 A1 WO2023015755 A1 WO 2023015755A1 CN 2021130122 W CN2021130122 W CN 2021130122W WO 2023015755 A1 WO2023015755 A1 WO 2023015755A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
training
image
foreground
sample
Prior art date
Application number
PCT/CN2021/130122
Other languages
English (en)
French (fr)
Inventor
李淼
杨飞宇
钱贝贝
王飞
Original Assignee
奥比中光科技集团股份有限公司
深圳奥芯微视科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 奥比中光科技集团股份有限公司, 深圳奥芯微视科技有限公司 filed Critical 奥比中光科技集团股份有限公司
Publication of WO2023015755A1 publication Critical patent/WO2023015755A1/zh
Priority to US18/375,264 priority Critical patent/US20240029272A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04845Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range for image manipulation, e.g. dragging, rotation, expansion or change of colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/174Segmentation; Edge detection involving the use of two or more images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the technical field of computer vision, in particular to a method for training a matting network and a method for matting.
  • image matting is a commonly used processing method.
  • Static image matting algorithm is widely used in traditional matting methods to guide color feature extraction.
  • the foreground segmentation is determined by using the color features of the foreground and background to constrain the transition region.
  • traditional matting methods can be divided into two categories: sampling-based methods and similarity-based methods.
  • Sampling-based methods use a pair of foreground or background pixels to represent transition region pixels to obtain foreground segmentation.
  • Similarity-based methods determine foreground boundaries through the similarity of neighboring pixels between certain labels and transition regions. Both of these matting methods do not involve semantic information, are computationally intensive, and their predictive performance deteriorates when the foreground and background have similar color features.
  • trimap-based methods have been extensively studied. Although trimap-based methods have high accuracy, they require manual annotation of a given image to add additional constraints to the matting problem. This manual labeling method is very unfriendly to users, so it has poor practicability; in addition, it requires a large amount of calculation.
  • trimap-free matting methods have received more attention.
  • the mainstream scheme is to use a single RGB image to directly predict the forward segmentation scheme.
  • this type of scheme has a large amount of calculation, and the accuracy does not exceed the method based on trimap.
  • it is also sensitive to the scene, and the generalization still needs to be improved, especially when the input contains unknown objects or multiple foregrounds. performance will deteriorate.
  • the embodiments of the present application provide a matting network training method and a matting method, which can solve at least one technical problem in related technologies.
  • an embodiment of the present application provides a matting network training method, including:
  • the training sample set includes a plurality of training samples, and each of the training samples includes an input image sample, and the input image sample includes an image sample to be matted with a foreground, a background image sample, and all The soft segmentation sample of the foreground, the soft segmentation sample is generated by subtracting the depth image corresponding to the background image sample from the depth image corresponding to the image sample to be matted;
  • the initial network includes at least one stage network; the stage The network includes a series environment combination module, a backbone block and a prediction module.
  • the input image sample is input into the environment combination module, and the environment combination module is used to output low-order features and high-order features after feature exchange.
  • the backbone Blocks are used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features, and the prediction module is used to output prediction foreground segmentation according to the fusion features;
  • the soft segmentation including the foreground is used as a priori for model training. Since the soft segmentation prior is used, background matting becomes a task that relies less on semantics but more on structural information, so the network does not need to be too deep. It is beneficial to realize the lightweight of the network and can be deployed on chips with small computing power; the backbone block can achieve better high-level feature fusion, and the environment combination module is lighter than its corresponding residual network, effectively exchanging the input environment features , which is beneficial to the fusion process of contextual information, both modules improve the accuracy to a certain extent and achieve more reliable foreground segmentation prediction.
  • the training sample set includes a plurality of labeled training samples, and each labeled training sample includes the input image sample and its label;
  • train the initial network to obtain a matting network including:
  • supervised training is performed to facilitate the realization of a map-matching network with higher precision.
  • the training sample set includes a plurality of labeled training samples and a plurality of unlabeled training samples, and each labeled training sample includes the input image sample and its label; each The unmarked training samples include the input image samples;
  • train the initial network to obtain a matting network including:
  • the combination of supervised training and distillation learning can make up the difference between the synthetic data set and the real data, which is conducive to further improving the segmentation accuracy of the matting network and providing a network with good generalization.
  • the initial network includes a plurality of stage networks connected in series; the input image sample is used as the input of the first stage network, the image sample to be matted, the background image sample and the previous The predicted foreground segmentation output by the stage network is used as the input of the next stage network.
  • the initial network includes multiple serial stage networks, which can predict finer structural details, thereby further improving the accuracy of foreground segmentation prediction.
  • the stage network includes 3 times of downsampling.
  • the backbone block includes a feature fusion module based on an attention mechanism.
  • a mixed loss function is used for training, and the mixed loss function includes a mean square error loss, a structural similarity loss, and an intersection ratio loss.
  • the foreground and the boundary can be detected more accurately, thereby further improving the accuracy of the foreground segmentation prediction.
  • an embodiment of the present application provides a method for cutting out images, including:
  • the matting network includes at least one stage network; the stage network includes a series environment combination module, a backbone block and a prediction module, the image to be matted, the background image and the foreground
  • the soft segmentation of the input environment combination module, the environment combination module is used to output low-order features and high-order features after feature exchange, and the backbone block is used to fuse the low-order features and the Fusion features are obtained from high-order features, and the prediction module is used to output foreground segmentation according to the fusion features.
  • an embodiment of the present application provides a matting network training device, including:
  • the obtaining module is used to obtain a training sample set and an initial network;
  • the training sample set includes a plurality of training samples, each of which includes an input image sample, and the input image sample includes a foreground image sample to be matted, Background image samples and soft segmentation samples of the foreground, the soft segmentation samples are generated by subtracting the depth images corresponding to the background image samples from the depth images corresponding to the image samples to be matted;
  • the initial network includes at least one stage Network;
  • the stage network includes a series environment combination module, a backbone block and a prediction module, the input image sample is input into the environment combination module, and the environment combination module is used to output low-order features and high-order features after feature exchange features, the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features, and the prediction module is used to output prediction foreground segmentation according to the fusion features;
  • the training module is used for using the training sample set to train the initial network to obtain a matting network.
  • an embodiment of the present application provides a map-cutting device, including:
  • An acquisition module configured to acquire an image to be matted including a foreground, a background image and soft segmentation of the foreground;
  • a matting module comprising a matting network, the matting module is used to input the image to be matted, the background image and the soft segmentation into the matting network, and output the foreground segmentation of the image to be matted
  • the matting network includes at least one stage network; the stage network includes a series environment combination module, a backbone block and a prediction module, the image to be matted, the background image and the soft segmentation input of the foreground
  • the environment combination module, the environment combination module is used to output low-order features and high-order features after feature exchange, and the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features, the prediction module is used to output foreground segmentation according to the fusion features.
  • an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor executes the computer program
  • a kind of cutout network training method is realized, and described cutout network training method comprises:
  • the training sample set includes a plurality of training samples, and each of the training samples includes an input image sample, and the input image sample includes an image sample to be matted with a foreground, a background image sample, and a soft segmentation sample of the foreground , the soft segmentation sample is generated by subtracting the depth image corresponding to the background image sample from the depth image corresponding to the image sample to be matted;
  • the initial network includes at least one stage network;
  • the stage network includes a series environment combination module, a backbone block and a prediction module, the input image sample is input to the environment combination module, and the environment combination module is used to undergo feature exchange
  • the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features
  • the prediction module is used to output predictions based on the fusion features foreground segmentation;
  • an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor executes the computer program Realize a kind of map-cutting method when, described map-cutting method comprises:
  • the image to be matted, the background image and the soft segmentation are input into the matting network, and the foreground segmentation of the image to be matted is output;
  • the matting network includes at least one stage network; the stage network includes a series environment combination module, a backbone block and a prediction module, and the image to be matted, the background image and the soft segmentation input of the foreground
  • the environment combination module, the environment combination module is used to output low-order features and high-order features after feature exchange, and the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain Fusion features
  • the prediction module is used to output foreground segmentation according to the fusion features.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, a method for training a map-matching network is implemented.
  • the matting network training methods include:
  • the training sample set includes a plurality of training samples, and each of the training samples includes an input image sample, and the input image sample includes an image sample to be matted with a foreground, a background image sample, and a soft segmentation sample of the foreground , the soft segmentation sample is generated by subtracting the depth image corresponding to the background image sample from the depth image corresponding to the image sample to be matted;
  • the initial network includes at least one stage network;
  • the stage network includes a series environment combination module, a backbone block and a prediction module, the input image sample is input to the environment combination module, and the environment combination module is used to undergo feature exchange
  • the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features
  • the prediction module is used to output predictions based on the fusion features foreground segmentation;
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, a map-cutting method is implemented, and the map-cutting method Methods include:
  • the image to be matted, the background image and the soft segmentation are input into the matting network, and the foreground segmentation of the image to be matted is output;
  • the matting network includes at least one stage network; the stage network includes a series environment combination module, a backbone block and a prediction module, and the image to be matted, the background image and the soft segmentation input of the foreground
  • the environment combination module, the environment combination module is used to output low-order features and high-order features after feature exchange, and the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain Fusion features
  • the prediction module is used to output foreground segmentation according to the fusion features.
  • an embodiment of the present application provides a computer program product.
  • the computer program product runs on the electronic device, the electronic device executes the cutout network described in the first aspect or any implementation manner of the first aspect. training method, or execute the matting method as described in the second aspect.
  • Fig. 1 is a schematic diagram of the realization flow of a kind of cutout network training method provided by an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of a cutout model provided by an embodiment of the present application.
  • Fig. 3 is a single-channel thermal map from the FH layer provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a real-world image set provided by an embodiment of the present application.
  • Figure 5 is a comparison of the speed and accuracy levels of different models on the Composition-1k dataset
  • Figure 6 is a schematic diagram of the qualitative comparison results between different methods on the Composite-1k test set.
  • FIG. 6 is a schematic diagram of an implementation flow of a cutout method provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of the comparison results between the method provided by an embodiment of the present application and the BM method on real-world images;
  • Fig. 8 is a schematic structural diagram of a matting network training device provided by an embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of a cutout device provided by an embodiment of the present application.
  • Fig. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • One embodiment or “some embodiments” or the like described in the specification of the present application means that a specific feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • FIG. 1 is a schematic flow diagram of a method for training a cutout network provided by an embodiment of the present application.
  • the method for training a cutout network in this embodiment can be executed by an electronic device.
  • Electronic devices include, but are not limited to, computers, tablets, servers, mobile phones, cameras, or wearable devices.
  • the server includes but is not limited to an independent server or a cloud server.
  • the matting network training method may include steps S110 to S120.
  • the initial network is pre-stored in the electronic device as a network model to be trained.
  • the initial network contains a set of network parameters to be learned.
  • the matting network can be obtained.
  • the matting network can have the same network structure as the initial network, or a simpler network structure than the initial network. The network parameters of the two are different.
  • the initial network may include a neural network model based on deep learning.
  • a neural network model based on deep learning.
  • backbone networks such as ResNet or VGG.
  • the current background matting network has a large redundancy, because the backbone network such as ResNet or VGG is usually used, these networks are originally designed for image classification tasks that are highly dependent on semantics, so these networks It is generally down-sampled 5 times to extract strong semantic features.
  • the background matting problem becomes a task with less semantic dependence and higher structural dependence, so these networks have greater redundancy to a certain extent .
  • the initial network adopts a lightweight progressive refinement network (Lightweight Refinement Network, LRN).
  • the initial network includes a stage network (stage) or multiple stage networks (stage) connected in series.
  • the initial network uses the RGB image I of the foreground, the RGB background image B, and the soft segmentation S of the foreground as a priori, and can use the depth image to generate soft segmentation, and achieve lightweight through specific network design.
  • the model output is refined multiple times, which can achieve more reliable foreground segmentation prediction.
  • the initial network can be defined as Contains three input images: RGB image I, RGB background image B and foreground soft segmentation S, Indicates the network parameters to be determined during training.
  • the network structure of the initial network is shown in Figure 2, including a stage network (stage) or multiple stage networks (stages) in series, and a stage includes an environment combination module (Context Combining Module, CCM), a backbone block stem and a prediction module predictor.
  • CCM Context Combining Module
  • the stem block preferably includes a feature fusion module (Feature Fusion Module, FFM), which is used to fuse low-order features based on the attention mechanism.
  • FFM feature Fusion Module
  • First-order features and higher-order features are used to obtain fusion features
  • the prediction module is used to predict foreground segmentation based on the output of fusion features.
  • the low-order feature of F L Cat(F 1I , F 1B , F 1S ), where Cat represents a series operation.
  • each low-order feature is further down-sampled to a single high-order feature
  • F 2I E 2I (F 1I )
  • F 2B E 2B (F 1B )
  • F 2S E 2S (F 1S ).
  • F IS C Is (Cat(F 2I , F 2S ))
  • F H C ISB (Cat( FIB , F IS )) to get the overall high-order features.
  • the above process is called a stage.
  • another stage can be used to further refine the output of the previous stage, and the predicted foreground segmentation of the previous stage is used as the prior soft segmentation of the next stage.
  • the RGB image I and RGB background image B input by the previous stage continue to serve as the priors of the next stage.
  • This process can be repeated many times to form a series network structure.
  • C, B and S are used to denote the number of channels in the convolutional layer, the number of convolutional blocks and the number of stages in a residual block, and further denote The network is represented as LRN-C-B-S.
  • LRN-16-4-3 represents an LRN built from 16-channel convolutional layers, 4 blocks, and 3 stages.
  • the advantage of LRN is that it can easily balance accuracy and speed by adjusting C, B, and S.
  • Figure 2 provides a lighter backbone, that is, the stage, which only needs to be down-sampled 3 times.
  • a simple foreground soft segmentation is input, which is a binary image obtained by subtracting the background depth image (such as the depth image corresponding to the RGB background image B) from the foreground depth image (such as the depth image corresponding to the RGB image I).
  • soft segmentation also provides the attention mechanism of the extracted object, so feature extraction can be achieved with less weight.
  • the stage network adopts the CCM module to exchange the context information of image+soft segmentation, image+background and image+soft segmentation+background to fully extract boundary features.
  • the CCM module Since the late fusion of image and trimap is more effective than the early fusion for the feature extraction of the matting network, this example is extended to 3 inputs. After the CCM module performs pairwise fusion of a single input image, Further post-fusion is performed on the fused features, so that the feature information of different inputs can be matched and learned more effectively. In addition, the CCM module utilizes fewer convolutional layers to extract features from a single input and then concatenates them. With this design, the CCM module is more lightweight than the corresponding ResNet block module because fewer convolution channels are introduced before concatenation.
  • the FFM module responsible for feature fusion is used to replace the traditional splicing operation.
  • the FFM module utilizes the attention mechanism to achieve better fusion of encoder features and decoder features.
  • the structural features are extracted layer by layer to form high-order semantic features, which help the model to use a wider receptive field range to judge the position of the segmentation edge.
  • the foreground and background colors are similar (such as a black human head and a black background)
  • the empirical information (such as the human head is usually round) is used to assist the determination of the segmentation position.
  • the FFM module transforms the high-order semantic features of the encoder into spatial attention masks, which are used to guide the restoration of structural information in the decoder. Since only high-order semantic features can provide accurate spatial attention, low-order features from the encoder are not suitable for FFM modules. Therefore, the FFM module is only applied to the inner skip connection (inner skip connection) during network design, but not to the outer skip connection (outer skip connection).
  • the network has been lightweight designed from three aspects.
  • traditional backbone networks such as ResNet and GoogleNet, which downsample the input 5 times to extract rich semantic cues
  • background matting becomes a task less dependent on semantics but more dependent on structural information due to the soft segmentation prior. Many tasks, so the network does not have to be too deep.
  • the resulting semantic information is sufficient and retains rich structural cues.
  • the soft segmentation of the input provides ready-made features, so fewer channels are required for information extraction.
  • the CCM module is more lightweight than its corresponding residual network. The comparison test proves that if each feature is concatenated and the residual module is used to extract the high-order feature F H , this method produces 1.8G FLOPs of additional operations and 0.3M of additional parameters compared with the use of the CCM module, and Model performance becomes worse than using CCM module.
  • the training sample set includes a plurality of training samples, each training sample includes an input image sample and its label (ground truth), and the input image sample includes 3, that is, an image sample I to be matted with a foreground, a background truth An image sample B and a soft segmented sample S of the foreground.
  • Annotation ⁇ * can be manually annotated ground truth foreground segmentation, for example, the annotation includes a standard transparency mask corresponding to the image sample to be matted.
  • labeled training samples are used, so step S120 includes: performing supervised training on the initial network on the training sample set to obtain a matting network.
  • the initial network was trained using the Adobe dataset containing 493 foreground objects, and a synthetic dataset was created.
  • the image sample to be matted can select non-transparent objects (such as removing glass products, etc.) from the Adobe dataset, or, further, one or more of methods such as cropping, rotating, flipping, and adding Gaussian noise can also be used for it random expansion of combinations.
  • Background image samples can be randomly selected from the MS COCO dataset and augmented by one or more combinations of gamma correction and adding Gaussian noise to avoid strong dependence on the fixed value of the background.
  • the soft segmentation sample of the foreground can be generated by subtracting the depth image corresponding to the background image sample from the depth image corresponding to the image sample to be matted, for example, the depth image corresponding to the image sample to be matted minus the depth image corresponding to the background image sample.
  • Binary image the soft segmented samples of the input foreground can be subjected to one or more combinations of erosion, dilation, and blurring by ground truth foreground segmentation to simulate flawed real-world segmentation.
  • the network is trained using a hybrid loss function comprising a variety of different loss functions, for example comprising mean squared error (MSE) loss, structural similarity (SSIM) loss and intersection over union (IoU) loss.
  • MSE loss function is a regular pixel regression loss for segmentation supervision.
  • the SSIM loss function imposes constraints on the mean and standard deviation to better predict structural consistency.
  • the IoU loss function commonly used in image segmentation tasks pays more attention to the optimization of the global structure.
  • the SSIM loss function is used to predict finer boundaries, while the IoU loss function is used to predict more complete foregrounds. Due to the use of a hybrid loss function, the foreground and border can be detected more accurately.
  • the weighting of three different loss functions is used as a hybrid loss function, or called a joint loss function, which is defined as:
  • L ⁇ 1 L MSE + ⁇ 2 L SSIM + ⁇ 3 L IoU .
  • ⁇ 1 , ⁇ 2 , and ⁇ 3 are the respective weight coefficients of three different loss functions.
  • L MSE is the MSE loss, and L MSE is defined as:
  • L SSIM is the SSIM loss, and L SSIM is defined as:
  • L IoU IoU loss
  • L IoU is defined as:
  • the parameter ⁇ can be set to 5
  • ⁇ i, j is the difficulty index of the pixel (i, j), which can be determined by the following formula:
  • a i,j represents the adjacent pixels of the pixel (i,j).
  • unlabeled real images can also be used for unsupervised knowledge distillation.
  • the training sample set includes multiple labeled training samples and multiple unlabeled training samples, each labeled training sample includes an input image sample and its label; each unlabeled training sample includes an input image sample.
  • the input image samples also include three, that is, the foreground image sample I to be matted, the background image sample B, and the foreground soft segmentation sample S.
  • labeled and unlabeled training samples are used, that is, a mixed data set, so step S120 includes: using multiple labeled training samples to perform supervised training on the initial network, and then using multiple Unlabeled training samples are subjected to unsupervised knowledge distillation to obtain a matting network.
  • a real-world human holding data set is created, including 1259 labeled images as a test set and 11255 unlabeled images as a knowledge distillation training set. All images are recorded with depth camera. As shown in Figure 4, the RGB images and depth images of the background and foreground are real-world image datasets. From the upper left to the lower right are the depth background, depth image, soft segmentation, color background, color image, and ground truth foreground segmentation. Among them, soft segmentation is a binary image obtained by subtracting the background depth from the image depth. With 1259 annotated images from 11 scenes, with an average of 2.5 people per scene, each person showing more than 30 items in 1 to 3 poses, this dataset enables qualitative evaluation of algorithms on real-world datasets.
  • the map-matching method can be applied to electronic devices, and electronic devices are deployed with a map-matching network in advance.
  • the matting network may use an untrained initial network.
  • the matting network in order to improve the precision of the matting, may be an initial network, and the matting network may also be obtained by training using the methods of the foregoing embodiments.
  • the matting network includes at least one stage network; the stage network includes a CCM module connected in series, a stem block stem and a prediction module predictor. When using the matting network to perform background matting on the image to be matted, first obtain three images to be input.
  • the three images include: the image to be matted including the foreground, the background image and the soft segmentation of the foreground; the three images are input
  • the matting network outputs the foreground segmentation of the image to be matted.
  • three images are input to the CCM module, which is used to output low-order features and high-order features after feature exchange, and the backbone block is used to fuse low-order features and high-order features based on the attention mechanism to obtain fusion features.
  • the prediction module Used to output foreground segmentation based on fused features.
  • the embodiment of this application proposes a lightweight real-time background matting network.
  • the network is designed with a shallow structure, and two network modules are proposed at the same time.
  • the FFM module can achieve better high-level feature fusion.
  • the CCM module is lighter than the corresponding traditional residual module, which is conducive to the fusion of context information. process. Both modules improve the accuracy to some extent.
  • a hybrid loss function is introduced, which combines the advantages of MSE, SSIM and IoU loss.
  • a real-world dataset of 1259 labeled images and 11255 unlabeled images was created for quantitative evaluation and knowledge distillation. Experiments on synthetic and real datasets show that the method achieves real-time performance on both PC (111FPS) and Amlogic A311D chip (45FPS).
  • MSE t Four indicators MSE t , SAD t , MSE e and SAD e are used in the experiments to evaluate the model accuracy.
  • MSE and SAD stand for mean squared error and summed absolute error, respectively.
  • the subscripts "t” and "e” denote the evaluation error in the trimap region and the whole image.
  • Previous studies only used MSE t and SAD t metrics, which are sufficient for trimap-based methods since the foreground regions are known. However, for trimap-free methods that need to predict both foreground and unknown regions, the MSE e and SAD e indicators are introduced to get a more complete evaluation.
  • the method of the embodiment of the present application was compared with other 4 learning-based models, including trimap-based CAM and DIM, and trimap-free LFM and BM.
  • the model provided by the embodiment of the present application is also compared with the CAM, DIM and BM models. To clarify, comparisons with traditional methods were excluded, as they have been shown to be far less accurate than learning-based methods.
  • Figure 5 shows the comparison results of the speed and accuracy levels of different models on the Composition-1k dataset.
  • the error and speed comparison results of different models on the Composition-1k dataset are shown in Table 1 below.
  • the model ours in the embodiment of this application adopts the LRN-32-4-4 model.
  • the comparison results of the errors and speeds of different models on the real data set are shown in Table 2 below.
  • the model ours of the embodiment of the present application adopts the LRN-16-4-3 model. Since CAM and DIM are trimap-based methods, there are only SAD t and MSE t indicators.
  • the model (LRN-32-4-4) provided by the embodiment of the present application is superior to other methods in all four indicators, and it is significantly lighter.
  • the method of the embodiment of the present application has 13.0G FLOPs and 2.2M parameters.
  • the FLOPs are reduced by 89.9%, and the number of parameters Param. is reduced by 87.7%.
  • the model inference of 39FPS is realized on the GTX1060ti GPU, which meets the requirements of real-time inference. Real-time means that the inference speed is greater than 30FPS.
  • FIG. 6 shows a schematic diagram of the qualitative comparison results between different methods on the Composite-1k test set.
  • the method provided in the embodiment of the present application has strong robustness to background interference. For example, it shows better foreground and background discrimination, able to detect small background regions surrounded by foreground.
  • FIG. 7 is a schematic diagram of a comparison result between the method provided by an embodiment of the present application and the BM method on real-world images. From Figure 7, it can be seen that the BM method has difficulty detecting the foreground with the same color as the background, for example, a white box in front of a white wall.
  • An embodiment of the present application also provides a matting network training device.
  • a matting network training device For details not described in the cutout network training device, please refer to the description in the foregoing embodiment of the cutout network training method.
  • FIG. 8 is a schematic block diagram of a matting network training device provided by an embodiment of the present application.
  • the matting network training device includes: an acquisition module 81 and a training module 82 .
  • the obtaining module 81 is used to obtain a training sample set and an initial network;
  • the training sample set includes a plurality of training samples, and each of the training samples includes an input image sample, and the input image sample includes a foreground image to be matted Image samples, background image samples, and soft segmentation samples of the foreground, the soft segmentation samples are generated by subtracting the depth images corresponding to the background image samples from the depth images corresponding to the image samples to be matted;
  • the initial network includes At least one stage network;
  • the stage network includes a series environment combination module, a backbone block and a prediction module, the input image sample is input to the environment combination module, and the environment combination module is used to output low-level features after feature exchange and high-order features, the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features, and the prediction module is used to output prediction foreground segmentation according to the fusion features;
  • the training module 82 is configured to use the training sample set to train the initial network to obtain a matting network.
  • the training sample set includes a plurality of labeled training samples, and each labeled training sample includes the input image sample and its label.
  • the training module 82 is specifically used for:
  • the training sample set includes a plurality of labeled training samples and a plurality of unlabeled training samples, each labeled training sample includes the input image sample and its label; each unlabeled The labeled training samples include the input image samples.
  • the training module 82 is specifically used for:
  • the initial network includes a plurality of stage networks connected in series; the input image sample is used as the input of the first stage network, the image sample to be matted, the background image sample and the previous stage network
  • the output predicted foreground segmentation serves as the input of the network in the next stage.
  • the stage network includes 3 downsampling.
  • the backbone block includes a feature fusion module based on an attention mechanism.
  • the training module 82 employs a hybrid loss function including a mean square error loss, a structural similarity loss, and an intersection ratio loss.
  • An embodiment of the present application also provides an image cutout device.
  • the parts not described in detail in the map-cutting device please refer to the description in the above-mentioned embodiment of the map-cutting method.
  • FIG. 9 is a schematic block diagram of a cutout device provided by an embodiment of the present application.
  • the map-cutting device includes: an acquisition module 91 and a map-cutting module 92 .
  • the obtaining module 91 is used to obtain the image to be matted including the foreground, the background image and the soft segmentation of the foreground;
  • the map matting module 92 includes a matting network, and the matting module 92 is used to input the image to be matted, the background image and the soft segmentation into the matting network, and output the foreground segmentation of the image to be matted;
  • the matting network includes at least one stage network; the stage network includes a series environment combination module, a backbone block, and a prediction module, and the soft segmentation of the image to be matted, the background image, and the foreground is input into the environment Combination module, the environment combination module is used to output low-order features and high-order features after feature exchange, and the backbone block is used to fuse the low-order features and the high-order features based on the attention mechanism to obtain fusion features,
  • the prediction module is used to output foreground segmentation according to the fusion feature.
  • the electronic device may include one or more processors 100 (only one is shown in FIG. 10), a memory 101 and a A computer program 102 running on one or more processors 100, for example, a program for matting network training and/or a program for image matting.
  • processors 100 execute the computer program 102
  • various steps in the map matting network training method and/or the map matting method embodiments may be implemented.
  • processors 100 execute the computer program 102, they can realize the functions of each module/unit in the image matting network training device and/or the embodiment of the matting device, which is not limited here.
  • FIG. 10 is only an example of an electronic device, and does not constitute a limitation on the electronic device.
  • the electronic device may include more or less components than those shown in the figure, or some components may be combined, or different components.
  • the electronic device may also include an input and output device, a network access device, a bus, and the like.
  • the so-called processor 100 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the storage 101 may be an internal storage unit of the electronic device, such as a hard disk or memory of the electronic device.
  • the memory 101 can also be an external storage device of the electronic device, such as a plug-in hard disk equipped on the electronic device, a smart memory card (smart media card, SMC), a secure digital (secure digital, SD) card, a flash memory card (flash card) wait.
  • the memory 101 may also include both an internal storage unit of the electronic device and an external storage device.
  • the memory 101 is used to store computer programs and other programs and data required by the electronic device.
  • the memory 101 can also be used to temporarily store data that has been output or will be output.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it can implement a matting network training method and/or matting Steps in the method examples.
  • An embodiment of the present application provides a computer program product.
  • the computer program product When the computer program product is run on an electronic device, the electronic device can realize the steps in the embodiment of the image matting network training method and/or the image matting method.
  • the disclosed device/electronic equipment and method can be implemented in other ways.
  • the device/electronic device embodiments described above are only illustrative, for example, the division of modules or units is only a logical function division, and there may be other division methods in actual implementation, such as multiple units or components May be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • an integrated module/unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the present application realizes all or part of the processes in the methods of the above embodiments, and can also be completed by instructing related hardware through computer programs, and the computer programs can be stored in a computer-readable storage medium.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file or some intermediate form.
  • the computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, USB flash drive, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (read-only memory, ROM), random access Memory (random access memory, RAM), electrical carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained on computer readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer readable media does not include Electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及计算机视觉技术领域,尤其涉及一种抠图网络训练方法及抠图方法。该方法包括:获取训练样本集合和初始网络;训练样本集合包括多个训练样本,每个训练样本包括输入图像样本,输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及前景的软分割样本,软分割样本利用待抠图图像样本对应的深度图像减去背景图像样本对应的深度图像生成;初始网络包括至少一个阶段网络;阶段网络包括串联的环境组合模块、主干区块和预测模块;利用训练样本集合,训练初始网络得到抠图网络。本申请实施例采用轻量化的模型,实现了高精度的前景分割。

Description

一种抠图网络训练方法及抠图方法
本申请要求于2021年8月9日提交中国专利局,申请号为202110910316.4,发明名称为“一种抠图网络训练方法及抠图方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机视觉技术领域,尤其涉及一种抠图网络训练方法及抠图方法。
背景技术
在计算机视觉技术领域,抠图是一种常用的处理手段。
传统抠图方法中广泛使用静态图像抠图算法(trimap)来引导颜色特征提取。利用前景和背景的颜色特征来约束过渡区域,从而确定前景的分割。根据颜色特征的使用方式,传统抠图方法可以分为两类:基于采样的方法和基于相似性的方法。基于采样的方法用一对前景或背景像素表示过渡区域像素,来得到前景分割。基于相似性的方法通过某些标签和过渡区域之间的相邻像素的相似性来确定前景边界。这两种抠图方法都不涉及语义信息,且计算量较大,而且当前景和背景具有相似的颜色特征时,这两种抠图方法的预测效果都会恶化。
随着深度学习的发展,极大地促进了抠图算法的发展。在基于深度学习的抠图算法中,基于trimap的方法受到了广泛的研究。虽然基于trimap的方法有着较高的精度,但其需要对给定图像进行人工标注以增加抠图问题的额外约束。这种人工标注方式对用户十分不友好,因而实用性较差;此外,计算量较大。
近年来,无trimap的抠图方法受到了更多的关注。近两年主流方案为采用单个RGB图像直接预测前进分割的方案。然而,这类方案计算量较大,在精度 上并没有超过基于trimap的方法,此外,也对场景比较敏感,泛化性仍需提升,尤其是当输入包含未知物体或多个前景时,网络的性能会恶化。
为了平衡基于trimap的方法和无trimap的方法的优缺点,目前出现了采用一个背景图像和一个人像的前景软分割替换trimap作为背景分割算法的先验。这种方法计算量大,速度较慢,并且处理持有物品的人或非人的场景时效果会恶化。
发明内容
有鉴于此,本申请实施例提供了一种抠图网络训练方法及抠图方法,可以解决相关技术中的至少一个技术问题。
第一方面,本申请一实施例提供了一种抠图网络训练方法,包括:
获取训练样本集合和初始网络;所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
利用所述训练样本集合,训练所述初始网络得到抠图网络。
本实施例利用包括前景的软分割作为先验进行模型训练,由于采用软分割先验,背景抠图变成一个对语义依赖较少但对结构信息依赖较多的任务,因此网络不必太深,利于实现网络的轻量化,能部署在小算力的芯片上;主干区块可以实现更好的高层特征融合,环境组合模块比其对应的残差网络更轻量化,有效的交换输入的环境特征,有利于上下文信息的融合过程,这两个模块都在一定程度 上提高了精度,实现更可靠的前景分割预测。
作为第一方面一实现方式,所述训练样本集合包括多个带标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;
利用所述训练样本集合,训练所述初始网络得到抠图网络,包括:
在所述训练样本集合上对所述初始网络进行有监督的训练得到抠图网络。
本实现方式中,进行有监督的训练,利于实现精度更高的抠图网络。
作为第一方面一实现方式,所述训练样本集合包括多个带标注的训练样本和多个无标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;每个无标注的所述训练样本包括所述输入图像样本;
利用所述训练样本集合,训练所述初始网络得到抠图网络,包括:
利用多个带标注的训练样本,对所述初始网络进行有监督的训练后,再利用多个无标注的训练样本进行无监督的知识蒸馏,得到抠图网络。
本实现方式中,结合有监督的训练和蒸馏学习,可以弥补合成数据集和真实数据之间的差异,利于进一步提高抠图网络的分割精度,提供泛化性好的网络。
作为第一方面一实现方式,所述初始网络包括多个串联的阶段网络;所述输入图像样本作为第一个阶段网络的输入,所述待抠图图像样本、所述背景图像样本和上一个阶段网络输出的预测前景分割作为下一个阶段网络的输入。
本实现方式中,初始网络包括多个串联的阶段网络,可以预测更加精细的结构细节,从而进一步提高前景分割预测的精度。
作为第一方面一实现方式,所述阶段网络包括3次下采样。
本实现方式中,只需对输入进行3次下采样,便可保留了丰富的结构线索,并得益于背景信息的融入,可更好地平衡了速度和精度。
作为第一方面一实现方式,主干区块包括基于注意力机制的特征融合模块。
作为第一方面一实现方式,训练采用混合损失函数,所述混合损失函数包括均方误差损失、结构相似性损失和交并比损失。
本实现方式中,由于使用了混合损失函数,能够更精确的检测出前景和边界, 从而进一步提高前景分割预测的精度。
第二方面,本申请一实施例提供一种抠图方法,包括:
获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
将所述待抠图图像、所述背景图像以及所述软分割输入采用如第一方面或第一方面任一实现方式所述的抠图网络训练方法得到的抠图网络,输出所述待抠图图像的前景分割;所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
第三方面,本申请一实施例提供一种抠图网络训练装置,包括:
获取模块,用于获取训练样本集合和初始网络;所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
训练模块,用于利用所述训练样本集合,训练所述初始网络得到抠图网络。
第四方面,本申请一实施例提供一种抠图装置,包括:
获取模块,用于获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
抠图模块,包括抠图网络,所述抠图模块用于将所述待抠图图像、所述背景 图像以及所述软分割输入所述抠图网络,输出所述待抠图图像的前景分割;所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
第五方面,本申请一实施例提供一种电子设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种抠图网络训练方法,所述抠图网络训练方法包括:
获取训练样本集合和初始网络;
其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;
所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
利用所述训练样本集合,训练所述初始网络得到抠图网络。
第六方面,本申请一实施例提供一种电子设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种抠图方法,所述抠图方法包括:
获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
将所述待抠图图像、所述背景图像以及所述软分割输入抠图网络,输出所述 待抠图图像的前景分割;
其中,所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
第七方面,本申请一实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现一种抠图网络训练方法,所述抠图网络训练方法包括:
获取训练样本集合和初始网络;
其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;
所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
利用所述训练样本集合,训练所述初始网络得到抠图网络。
第八方面,本申请一实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现一种抠图方法,所述抠图方法包括:
获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
将所述待抠图图像、所述背景图像以及所述软分割输入抠图网络,输出所述 待抠图图像的前景分割;
其中,所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
第九方面,本申请一实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行如第一方面或第一方面任一实现方式所述的抠图网络训练方法,或执行如第二方面所述的抠图方法。
应理解,第二方面至第七方面的有益效果可以参见第一方面及第一方面的实现方式的相关描述,此处不再赘述。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例提供的一种抠图网络训练方法的实现流程示意图;
图2是本申请一实施例提供的一种抠图模型的结构示意图;
图3是本申请一实施例提供的来自F H层的单通道热图;
图4是本申请一实施例提供的一种真实世界图像集的示意图;
图5是不同模型在Composition-1k数据集上的速度和精度水平的比较结果;
图6是在Composite-1k测试集上不同方法之间的定性比较结果示意图。
图6是本申请一实施例提供的一种抠图方法的实现流程示意图;
图7是本申请一实施例提供的方法与BM方法在真实世界图像上的比较结 果示意图;
图8是本申请一实施例提供的一种抠图网络训练装置的结构示意图;
图9是本申请一实施例提供的一种抠图装置的结构示意图;
图10是本申请一实施例提供的一种电子设备的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
在本申请说明书中描述的“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
此外,在本申请的描述中,“多个”的含义是两个或两个以上。术语“第一”和“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。
图1是本申请一实施例提供的一种抠图网络训练方法的实现流程示意图,本实施例中的抠图网络训练方法可由电子设备执行。电子设备包括但不限于计算机、平板电脑、服务器、手机、相机或可穿戴设备等。其中,服务器包括但不 限于独立服务器或云服务器等。如图1所示,抠图网络训练方法可以包括步骤S110至步骤S120。
S110,获取训练样本集合和初始网络。
S120,利用训练样本集合,训练初始网络得到抠图网络。
初始网络预先存储在电子设备中,作为待训练的网络模型。初始网络包含一组待学习的网络参数。初始网络经训练后得到抠图网络,抠图网络可以具备与初始网络相同的网络结构,也可以具备比初始网络简单的网络结构,两者网络参数不一样。
初始网络(或抠图网络)可以包括基于深度学习的神经网络模型。例如ResNet或VGG等主干网络。
需要说明的是,当前的背景抠图网络都具有较大的冗余性,因为通常采用像ResNet或VGG等主干网络,这些网络最初是为高度依赖语义的图像分类任务而设计的,因此这些网络普遍会下采样5次来提取强语义特征。然而,本申请实施例由于有软分割充当抠图的先验特征,背景抠图问题成为一个语义依赖性较小而结构依赖性较高的任务,因此这些网络在一定程度上有较大冗余。
在一个实施例中,初始网络(或抠图网络)采用轻量的逐级精细化网络(Lightweight Refinement Network,LRN)。初始网络(或抠图网络)包括一个阶段网络(stage)或多个串联的阶段网络(stage)。该初始网络使用了包括前景的RGB图像I、RGB背景图像B和前景的软分割S作为先验,可以用深度图像生成软分割,并通过具体的网络设计实现轻量化。模型输出经过多次逐级精细化,可以实现更可靠的前景分割预测。
作为一实现方式,初始网络可以定义为
Figure PCTCN2021130122-appb-000001
包含三个输入图像:RGB图像I,RGB背景图像B以及前景的软分割S,
Figure PCTCN2021130122-appb-000002
表示在训练中要确定的网络参数。初始网络的网络结构如图2所示,包括一个阶段网络(stage)或多个串联的阶段网络(stage),一个stage包括环境组合模块(Context Combining Module,CCM)、主干区块stem和预测模块predictor。3个输入图像输入CCM模块, CCM模块用于经过特征交换后输出低阶特征和高阶特征;主干区块stem优选包括特征融合模块(Feature Fusion Module,FFM),用于基于注意力机制融合低阶特征和高阶特征得到融合特征,预测模块用于根据融合特征输出预测前景分割。
具体地,利用CCM模块对上述三个输入图像进行特征交换产生两个输出特征,一个低阶特征F L和一个高阶特征F H。并对每个输入图像对应的低阶特征单独编码为特征F 1I=E 1I(I),F 1B=E 1B(B)和F 1S=E 1S(S),然后将它们拼接以产生一个整体的低阶特征F L=Cat(F 1I,F 1B,F 1S),其中Cat表示串联操作。在另一分支中,每个低阶特征被进一步下采样到单个高阶特征F 2I=E 2I(F 1I),F 2B=E 2B(F 1B)和F 2S=E 2S(F 1S)。通过将图像特征与其余2个特征融合,得到F IS=C Is(Cat(F 2I,F 2S))和F IB=C IB(Cat(F 2I,F 2B))。并用F H=C ISB(Cat(F IB,F IS))得到整体的高阶特征。在网络的主干区块stem,先由编码器E 3对高阶特征F H进行下采样,再由解码器D 3进行上采样,然后用特征融合模块对其进行融合,得到特征F 2=FFM(F H,D 3(E 3(F H)))。进一步将低阶特征F L合并到主干区块中,得到融合特征F 1=Cat(F L,D 2(F 2))。最后,用α=D 1(F 1)得到预测的前景分割,以便后续利用前景分割与预设的背景图像进行图像合成。
上述过程称为一个stage。为了预测更精细的结构细节,在本申请一些实施例中可以采用另一个stage对前一个stage的输出做进一步的精细化,用上一个stage的预测前景分割作为下一个stage的先验软分割,而上一个stage输入的RGB图像I、RGB背景图像B继续作为下一个stage的先验。这个过程可以重复多次,形成串联的网络结构。为了清晰地表示网络体系结构,用C、B和S来表示卷积层中的通道(channel)数、一个残差块中的卷积区块(block)数和阶段(stage)数,进一步将网络表示为LRN-C-B-S。例如,LRN-16-4-3代表由16通道卷积层、4个区块和3个阶段构建的LRN,LRN的优点是可以很容易地通过调整C、B和S来平衡精度和速度。
需要说明的是,图2提供了一个更轻量的主干,即stage,仅需进行3次下 采样。输入了一个简单的前景软分割,它是一个通过前景深度图像(例如RGB图像I对应的深度图像)减去背景深度图像(例如RGB背景图像B对应的深度图像)得到的二值图像。软分割作为一个简单的现成特征,同时提供了被抠取对象的注意力机制,因此使用较少的权重就能实现特征的提取。此外,阶段网络采用CCM模块来交换图像+软分割、图像+背景和图像+软分割+背景的上下文信息,充分提取边界特征。由于对于图像和trimap的后融合(late fusion)比早融合(early fusion)对于抠图网络的特征提取更加有效,本示例拓展到3个输入,CCM模块对单个输入图像进行两两后融合后,再对融合后的特征做进一步的后融合,使得不同输入的特征信息实现更加有效的匹配和学习。另外,CCM模块利用较少的卷积层来提取单个输入的特征,然后将它们拼接起来。通过这样的设计,CCM模块比对应的ResNet块模块更轻量化,因为在串联之前引入了更少的卷积通道。
此外,使用负责特征融合的FFM模块,以取代传统的拼接操作。FFM模块利用注意力机制来实现更好的编码器特征和解码器特征的融合。在编码其中结构特征经过层层提取形成高阶的语义特征,语义特征有利于模型利用更广阔的感受野(receptive field)范围来判断分割边缘的位置。比如当前景和背景颜色相近时(如黑色人体头部和黑色背景),直接根据局部的结构信息很难判断边界(头部和黑色背景的边界),但是利用高阶的语义信息可以通过图像识别的经验信息(如人的头部通常时圆形)来辅助分割位置的确定。FFM模块将编码器的高阶语义特征转化为空间注意力蒙版,用于指导解码器中结构信息的还原。由于只有高阶的语义特征才能提供准确的空间注意力,因此来自编码器的低阶特征并不适合采用FFM模块。所以在网络设计时仅将FFM模块应用于内部的跳跃连接(inner skip connection),而并不应用于外部跳跃连接(outer skip connection)。
由此可见,该网络从三个方面进行了轻量化设计。一是网络深度较浅。与ResNet和GoogleNet等传统的主干网络不同,它们对输入进行5次下采样以提取丰富的语义线索,由于采用软分割先验,背景抠图变成一个对语义依赖较少但 对结构信息依赖较多的任务,因此网络不必太深。只对输入进行3次下采样,得到的语义信息就已经足够,并保留了丰富的结构线索。二是通道较少。因为背景抠图是并不是一个分类任务,因此一个通道可以为多个对象服务。例如,图3给出了来自F H层的单通道热图。可以注意到,该通道不加区别地捕获不同类别的前景。此外,输入的软分割提供了现成的特征,因此信息提取所需的通道更少。第三,CCM模块比其对应的残差网络更轻量化。经过比对试验证明,若将每个特征串联并利用残差模块提取高阶特征F H,这种方法与使用CCM模块相比,产生了1.8G FLOPs的额外运算和0.3M的额外参数,并且模型性能变得比使用CCM模块更差。
在一些实施例中,训练样本集合包括多个训练样本,每个训练样本包括输入图像样本及其标注(ground truth),输入图像样本包括3个,即具备前景的待抠图图像样本I、背景图像样本B以及前景的软分割样本S。标注α *可以为人工标注的ground truth前景分割,例如,标注包括待抠图图像样本对应的标准透明度蒙版。在这些实施例中,使用的是带标注的训练样本,因而步骤S120包括:在训练样本集合上对初始网络进行有监督的训练得到抠图网络。
作为这些实施例的一实现方式,利用包含493个前景对象的Adobe数据集训练初始网络,并且创建一个合成数据集。待抠图图像样本可以从Adobe数据集中选择非透明对象(比如剔除玻璃制品等),或者,进一步的,还可以对其采用裁剪、旋转、翻转和添加高斯噪声等方法中的一种或多种的组合进行随机扩充。背景图像样本可以从MS COCO数据集中随机抽取,并通过伽马校正和添加高斯噪声等方法中的一种或多种的组合进行扩充,以避免对背景的固定值产生较强的依赖性。前景的软分割样本可以利用待抠图图像样本对应的深度图像减去背景图像样本对应的深度图像生成,例如,待抠图图像样本对应的深度图像减去背景图像样本对应的深度图像所获得的二值图像。或者,输入的前景的软分割样本可以由ground truth前景分割进行腐蚀、膨胀和模糊等操作中的一种或多种的组合来模拟有缺陷的现实世界分割。
在合成数据集(Synthetic Dataset)上的监督训练任务可以定义为更新网络参数
Figure PCTCN2021130122-appb-000003
来降低损失函数L:
Figure PCTCN2021130122-appb-000004
在一个实施例中,利用包含多种不同损失函数,例如包含均方误差(MSE)损失、结构相似性(SSIM)损失和交并比(IoU)损失的混合损失函数来训练网络。MSE损失函数是用于分割监督的常规像素回归损失。SSIM损失函数对均值和标准差施加约束,来更好的预测结构一致性。在图像分割任务中通常使用的IoU损失函数更注重全局结构的优化。SSIM损失函数用于预测更精细的边界,而IoU损失函数用于预测更完整的前景。由于使用了混合损失函数,能够更精确的检测出前景和边界。在一个实施例中,采用三个不同的损失函数的加权作为混合损失函数,或称为联合损失函数,其定义为:
L=λ 1L MSE2L SSIM3L IoU
其中,λ 1,λ 2,λ 3为三个不同损失函数各自的权重系数。在一个实施例中,三个损失函数的权重系数可以分配为λ 1=2,λ 2=2,λ 3=5。L MSE为MSE损失,L MSE定义为:
Figure PCTCN2021130122-appb-000005
其中,H,W分别表示图像高度和宽度;αi,j和
Figure PCTCN2021130122-appb-000006
表示预测的和先验的前景分割。L SSIM为SSIM损失,L SSIM定义为:
Figure PCTCN2021130122-appb-000007
其中,μ,σ和μ *,σ *是α i,j
Figure PCTCN2021130122-appb-000008
的均值和偏差。常数c 1=0.01 2和c 2=0.03 2用于避免被零除。L IoU为IoU损失,L IoU定义为:
Figure PCTCN2021130122-appb-000009
其中,参数γ可以设为5,θ i,j为像素(i,j)的难度指数,可通过下式确定:
Figure PCTCN2021130122-appb-000010
其中,A i,j表示像素(i,j)的相邻像素。
为了弥补合成数据与真实数据之间的差异,在本申请其他一些实施例中,除了利用带标注的合成图像进行有监督的训练,还可以利用未标注的真实图像进行无监督的知识蒸馏。
此时,训练样本集合包括多个带标注的训练样本和多个无标注的训练样本,每个带标注的训练样本包括输入图像样本及其标注;每个无标注的训练样本包括输入图像样本。需要说明的是,在这些实施例中输入图像样本也包括3个,即具备前景的待抠图图像样本I、背景图像样本B以及前景的软分割样本S。在这些实施例中,使用的是带标注和无标注的训练样本,即混合数据集,因而步骤S120包括:利用多个带标注的训练样本,对初始网络进行有监督的训练后,再利用多个无标注的训练样本进行无监督的知识蒸馏,得到抠图网络。
作为这些实施例的一实现方式,创建一个真实世界的人持物数据集,包括1259幅标记图像作为测试集,11255幅未标记图像作为知识蒸馏训练集。所有的图像都是用深度相机录制。如图4所示为背景和前景的RGB图像和深度图像,为真实世界图像数据集,从左上到右下分别是深度背景,深度图像,软分割,彩色背景,彩色图像,ground truth前景分割。其中,软分割是通过从图像深度中减去背景深度得到的二值图像。1259张被标注的图片来自11个场景,平均每个场景包含2.5个人,每个人用1至3个姿势展示30多件商品,该数据集使得能够在真实世界的数据集上定性地评估算法。
利用包含10000张带标注的合成图像和11255张未标记的真实世界图像的混合数据集。在混合数据集上同时进行有监督的训练和无监督的知识蒸馏。在合成数据集上训练的网络被用作教师模型,该网络可为ResNet或VGG等复杂的网络模型,对于带标签的数据,用
Figure PCTCN2021130122-appb-000011
来进行训练;而对于未标记的数据,即,用
Figure PCTCN2021130122-appb-000012
来进行蒸馏学习。其中,
Figure PCTCN2021130122-appb-000013
表示在合成数据集上训练的教师网络,
Figure PCTCN2021130122-appb-000014
表示学生网络,学生网络可为本申请的轻量级抠图网络,L为混合损失函数或联合损失函数。
本申请另一实施例提供了一种抠图方法。抠图方法可以应用于电子设备,电子设备提前部署有抠图网络。在一些实施例中,抠图网络可以采用未经训练的初始网络。在其他一些实施例中,为了提高抠图的精度,抠图网络可以是初始网络,抠图网络亦可以采用前述实施例的方法进行训练得到。抠图网络包括至少一个阶段网络;所述阶段网络包括串联的CCM模块、主干区块stem和预测模块predictor。在使用抠图网络对待抠图图像进行背景抠图时,先获取待输入的三个图像,三个图像包括:包括前景的待抠图图像、背景图像和前景的软分割;将三个图像输入抠图网络输出待抠图图像的前景分割。具体地,三个图像输入CCM模块,CCM模块用于经过特征交换后输出低阶特征和高阶特征,主干区块用于基于注意力机制融合低阶特征和高阶特征得到融合特征,预测模块用于根据融合特征输出前景分割。
应理解,采用抠图网络进行背景抠图的过程可以参照前述抠图网络训练过程的相关描述,此处不再赘述。
本申请实施例提出一个轻量化实时背景抠图网络。对网络进行了较浅的结构设计,同时提出了两个网络模块,FFM模块可以实现更好的高层特征融合,CCM模块相比对应的传统的残差模块更加轻量化,有利于上下文信息的融合过程。这两个模块都在一定程度上提高了精度。为了实现更好的边界预测和前景预测,引入了一种混合损失函数,该函数综合了MSE、SSIM和IoU损失的优点。创建了包含1259幅标记图像和11255幅未标记图像的真实世界数据集,用于定量评估和知识蒸馏。在合成数据集和真实数据集上的实验表明,该方法在PC(111FPS)和Amlogic A311D芯片(45FPS)上均取得了实时的性能表现。
基于本申请实施例提供的方法进行实验,使用学习率为10 -3的Adam优化器,用26900幅合成图像训练LRN-32-4-4模型。之所以选择LRN-32-4-4,是因 为它可以很好地平衡精度和速度。在4个RTX2080ti GPU上,采用batchsize=16和512×512的输入分辨率对模型进行了100轮的训练。采用一个由1000个合成图像组成的测试数据集(Adobe的测试集,也称为Composite-1k)评估模型性能。在合成数据集上对LRN-32-4-4模型进行监督训练之后,再将训练好的LRN-32-4-4模型进行蒸馏学习,获得一个更轻量级的LRN-16-4-3模型,蒸馏学习的参数设置与监督学习相同。
在实验中使用了4个指标MSE t、SAD t、MSE e和SAD e来评估模型精度。MSE和SAD分别代表均方误差和绝对加和误差。下标“t”和“e”表示在trimap区域和整个图像中的评估误差。之前的研究仅使用MSE t和SAD t指标,这对基于trimap的方法是足够的,因为前景区域是已知的。然而,对于无trimap的方法需要同时预测前景和未知区域,引入MSE e和SAD e指标来得到一个更完善的评估。在Composition-1k数据集上,将本申请实施例的方法与其他4种基于学习的模型进行了比较,包括基于trimap的CAM和DIM,以及无trimap的LFM和BM。在真实数据集上,还将本申请实施例提供的模型与CAM、DIM和BM模型进行了对比。需要说明的是,排除了与传统方法进行比较,因为它们已经被证明远不如基于学习的方法精确。
具体地,实验时在Composite-1k测试集上评估了前景分割误差,FLOPs(基于288×288分辨率)和参数数量Param.。图5所示为不同模型在Composition-1k数据集上的速度和精度水平的比较结果。在Composition-1k数据集上不同模型的误差和速度比较结果,如下表1所示,这里本申请实施例的模型ours采用LRN-32-4-4模型。在真实数据集上不同模型的误差和速度的比较结果,如下表2所示,这里本申请实施例的模型ours采用LRN-16-4-3模型。由于CAM和DIM是基于trimap的方法,因此只有SAD t和MSE t指标。从表1和图5可以看出,本申请实施例提供的模型(LRN-32-4-4)在所有4个指标上都优于其他方法,而且它明显更轻量化。例如,在288×288的输入分辨率下,本申请实施例的方法有13.0G的FLOPs和2.2M的参数,与BM方法相比,FLOPs降低了89.9%,参 数数量Param.降低了87.7%。在GTX1060ti GPU上实现了39FPS的模型推理,满足实时推理要求,实时是指推理速度大于30FPS。
表1
Figure PCTCN2021130122-appb-000015
表2
Figure PCTCN2021130122-appb-000016
图6所示为在Composite-1k测试集上不同方法之间的定性比较结果示意图。本申请实施例提供的方法对背景干扰具有较强的鲁棒性。例如,它显示出更好的前景和背景区分能力,能够检测被前景包围的小背景区域。图7所示为本申请一实施例提供的方法与BM方法在真实世界图像上的比较结果示意图。从图7可以看出,BM方法难以检测与背景颜色相同的前景,例如,白墙前面的一个白盒子。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本申请一实施例还提供一种抠图网络训练装置。该抠图网络训练装置中未详细描述之处请详见前述抠图网络训练方法实施例中的描述。
参见图8,图8是本申请一实施例提供的一种抠图网络训练装置的示意框图。所述抠图网络训练装置包括:获取模块81和训练模块82。
其中,获取模块81,用于获取训练样本集合和初始网络;所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
训练模块82,用于利用所述训练样本集合,训练所述初始网络得到抠图网络。
在一些实施例中,所述训练样本集合包括多个带标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注。
训练模块82,具体用于:
在所述训练样本集合上对所述初始网络进行有监督的训练得到抠图网络。
在一些实施例中,所述训练样本集合包括多个带标注的训练样本和多个无标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;每个无标注的所述训练样本包括所述输入图像样本。
训练模块82,具体用于:
利用多个带标注的训练样本,对所述初始网络进行有监督的训练后,再利用多个无标注的训练样本进行无监督的知识蒸馏,得到抠图网络。
在一些实施例中,所述初始网络包括多个串联的阶段网络;所述输入图像样本作为第一个阶段网络的输入,所述待抠图图像样本、所述背景图像样本和上一个阶段网络输出的预测前景分割作为下一个阶段网络的输入。
在一些实施例中,所述阶段网络包括3次下采样。
在一些实施例中,所述主干区块包括基于注意力机制的特征融合模块。
在一些实施例中,训练模块82采用混合损失函数,混合损失函数包括均方误差损失、结构相似性损失和交并比损失。
本申请一实施例还提供一种抠图装置。该抠图装置中未详细描述之处请详见前述抠图方法实施例中的描述。
参见图9,图9是本申请一实施例提供的一种抠图装置的示意框图。所述抠图装置包括:获取模块91和抠图模块92。
其中,获取模块91,用于获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
抠图模块92,包括抠图网络,抠图模块92用于将所述待抠图图像、所述背景图像以及所述软分割输入抠图网络,输出所述待抠图图像的前景分割;所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
本申请一实施例还提供了一种电子设备,如图10所示,电子设备可以包括一个或多个处理器100(图10中仅示出一个),存储器101以及存储在存储器101中并可在一个或多个处理器100上运行的计算机程序102,例如,抠图网络训练的程序和/或图像抠图的程序。一个或多个处理器100执行计算机程序102时可以实现抠图网络训练方法和/或抠图方法实施例中的各个步骤。或者,一个或多个处理器100执行计算机程序102时可以实现抠图网络训练装置和/或抠图装置实施例中各模块/单元的功能,此处不作限制。
本领域技术人员可以理解,图10仅仅是电子设备的示例,并不构成对电子设备的限定。电子设备可以包括比图示更多或更少的部件,或者组合某些部件, 或者不同的部件,例如电子设备还可以包括输入输出设备、网络接入设备、总线等。
在一个实施例中,所称处理器100可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在一个实施例中,存储器101可以是电子设备的内部存储单元,例如电子设备的硬盘或内存。存储器101也可以是电子设备的外部存储设备,例如电子设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,存储器101还可以既包括电子设备的内部存储单元也包括外部存储设备。存储器101用于存储计算机程序以及电子设备所需的其他程序和数据。存储器101还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请一实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现抠图网络 训练方法和/或抠图方法实施例中的步骤。
本申请一实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备可实现抠图网络训练方法和/或抠图方法实施例中的步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/电子设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/电子设备实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,计算机程序包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、电载波信号、电信信号以及软件分发介质等。需要说明的是,计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (17)

  1. 一种抠图网络训练方法,其特征在于,包括:
    获取训练样本集合和初始网络;
    其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;
    所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
    利用所述训练样本集合,训练所述初始网络得到抠图网络。
  2. 如权利要求1所述的抠图网络训练方法,其特征在于,所述训练样本集合包括多个带标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;
    利用所述训练样本集合,训练所述初始网络得到抠图网络,包括:
    在所述训练样本集合上对所述初始网络进行有监督的训练得到抠图网络。
  3. 如权利要求1所述的抠图网络训练方法,其特征在于,所述训练样本集合包括多个带标注的训练样本和多个无标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;每个无标注的所述训练样本包括所述输入图像样本;
    利用所述训练样本集合,训练所述初始网络得到抠图网络,包括:
    利用多个带标注的训练样本,对所述初始网络进行有监督的训练后,再利用多个无标注的训练样本进行无监督的知识蒸馏,得到抠图网络。
  4. 如权利要求1至3任一项所述的抠图网络训练方法,其特征在于,所述初始网络包括多个串联的阶段网络;所述输入图像样本作为第一个阶段网络的输入,所述待抠图图像样本、所述背景图像样本和上一个阶段网络输出的预测前景分割作为下一个阶段网络的输入。
  5. 如权利要求1至3任一项所述的抠图网络训练方法,其特征在于,所述阶段网络包括3次下采样。
  6. 如权利要求1至3任一项所述的抠图网络训练方法,其特征在于,训练采用混合损失函数,所述混合损失函数包括均方误差损失、结构相似性损失和交并比损失。
  7. 一种抠图方法,其特征在于,包括:
    获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
    将所述待抠图图像、所述背景图像以及所述软分割输入抠图网络,输出所述待抠图图像的前景分割;
    其中,所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
  8. 一种抠图网络训练装置,其特征在于,包括:
    获取模块,用于获取训练样本集合和初始网络;所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组 合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
    训练模块,用于利用所述训练样本集合,训练所述初始网络得到抠图网络。
  9. 一种抠图装置,其特征在于,包括:
    获取模块,用于获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
    抠图模块,包括抠图网络,所述抠图模块用于将所述待抠图图像、所述背景图像以及所述软分割输入所述抠图网络,输出所述待抠图图像的前景分割;
    其中,所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
  10. 一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现一种抠图网络训练方法,所述抠图网络训练方法包括:
    获取训练样本集合和初始网络;
    其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;
    所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用 于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
    利用所述训练样本集合,训练所述初始网络得到抠图网络。
  11. 如权力要求10所述的电子设备,其特征在于,所述训练样本集合包括多个带标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;
    利用所述训练样本集合,训练所述初始网络得到抠图网络,包括:
    在所述训练样本集合上对所述初始网络进行有监督的训练得到抠图网络。
  12. 如权力要求10所述的电子设备,其特征在于,所述训练样本集合包括多个带标注的训练样本和多个无标注的训练样本,每个带标注的所述训练样本包括所述输入图像样本及其标注;每个无标注的所述训练样本包括所述输入图像样本;
    利用所述训练样本集合,训练所述初始网络得到抠图网络,包括:
    利用多个带标注的训练样本,对所述初始网络进行有监督的训练后,再利用多个无标注的训练样本进行无监督的知识蒸馏,得到抠图网络。
  13. 如权力要求10至12任一项所述的电子设备,其特征在于,所述初始网络包括多个串联的阶段网络;所述输入图像样本作为第一个阶段网络的输入,所述待抠图图像样本、所述背景图像样本和上一个阶段网络输出的预测前景分割作为下一个阶段网络的输入。
  14. 如权力要求10至12任一项所述的电子设备,其特征在于,训练采用混合损失函数,所述混合损失函数包括均方误差损失、结构相似性损失和交并比损失。
  15. 一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现一种抠图方法,所述抠图方法包括:
    获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
    将所述待抠图图像、所述背景图像以及所述软分割输入抠图网络,输出所述待抠图图像的前景分割;
    其中,所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
  16. 一种计算机存储介质,所述计算机存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现一种抠图网络训练方法,所述抠图网络训练方法包括:
    获取训练样本集合和初始网络;
    其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括输入图像样本,所述输入图像样本包括具备前景的待抠图图像样本、背景图像样本以及所述前景的软分割样本,所述软分割样本利用所述待抠图图像样本对应的深度图像减去所述背景图像样本对应的深度图像生成;
    所述初始网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述输入图像样本输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出预测前景分割;
    利用所述训练样本集合,训练所述初始网络得到抠图网络。
  17. 一种计算机存储介质,所述计算机存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现一种抠图方法,所述抠图方法包括:
    获取包括前景的待抠图图像、背景图像以及所述前景的软分割;
    将所述待抠图图像、所述背景图像以及所述软分割输入抠图网络,输出所述待抠图图像的前景分割;
    其中,所述抠图网络包括至少一个阶段网络;所述阶段网络包括串联的环境组合模块、主干区块和预测模块,所述待抠图图像、所述背景图像以及所述前景的软分割输入所述环境组合模块,所述环境组合模块用于经过特征交换后输出低阶特征和高阶特征,所述主干区块用于基于注意力机制融合所述低阶特征和所述高阶特征得到融合特征,所述预测模块用于根据所述融合特征输出前景分割。
PCT/CN2021/130122 2021-08-09 2021-11-11 一种抠图网络训练方法及抠图方法 WO2023015755A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/375,264 US20240029272A1 (en) 2021-08-09 2023-09-29 Matting network training method and matting method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110910316.4A CN114038006A (zh) 2021-08-09 2021-08-09 一种抠图网络训练方法及抠图方法
CN202110910316.4 2021-08-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/375,264 Continuation US20240029272A1 (en) 2021-08-09 2023-09-29 Matting network training method and matting method

Publications (1)

Publication Number Publication Date
WO2023015755A1 true WO2023015755A1 (zh) 2023-02-16

Family

ID=80139780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130122 WO2023015755A1 (zh) 2021-08-09 2021-11-11 一种抠图网络训练方法及抠图方法

Country Status (3)

Country Link
US (1) US20240029272A1 (zh)
CN (1) CN114038006A (zh)
WO (1) WO2023015755A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167922A (zh) * 2023-04-24 2023-05-26 广州趣丸网络科技有限公司 一种抠图方法、装置、存储介质及计算机设备
CN116740510A (zh) * 2023-03-20 2023-09-12 北京百度网讯科技有限公司 图像处理方法、模型训练方法及装置
CN116862931A (zh) * 2023-09-04 2023-10-10 北京壹点灵动科技有限公司 医学图像分割方法、装置、存储介质及电子设备
CN117351118A (zh) * 2023-12-04 2024-01-05 江西师范大学 一种结合深度信息的轻量化固定背景抠像方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114529574B (zh) * 2022-02-23 2024-07-12 平安科技(深圳)有限公司 基于图像分割的图像抠图方法、装置、计算机设备及介质
CN114926491B (zh) * 2022-05-11 2024-07-09 北京字节跳动网络技术有限公司 一种抠图方法、装置、电子设备及存储介质
CN118411431A (zh) * 2023-01-30 2024-07-30 腾讯科技(深圳)有限公司 一种图像生成方法及相关装置
CN117252892B (zh) * 2023-11-14 2024-03-08 江西师范大学 基于轻量化视觉自注意力网络的双分支人像自动抠图装置
CN118297991B (zh) * 2024-03-19 2024-10-18 北京市遥感信息研究所 一种遥感图像配准模型训练方法及训练系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712145A (zh) * 2018-11-28 2019-05-03 山东师范大学 一种图像抠图方法及系统
CN111223106A (zh) * 2019-10-28 2020-06-02 稿定(厦门)科技有限公司 全自动人像蒙版抠图方法及系统
US20200175700A1 (en) * 2018-11-29 2020-06-04 Adobe Inc. Joint Training Technique for Depth Map Generation
CN112446380A (zh) * 2019-09-02 2021-03-05 华为技术有限公司 图像处理方法和装置
CN113052868A (zh) * 2021-03-11 2021-06-29 奥比中光科技集团股份有限公司 一种抠图模型训练、图像抠图的方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712145A (zh) * 2018-11-28 2019-05-03 山东师范大学 一种图像抠图方法及系统
US20200175700A1 (en) * 2018-11-29 2020-06-04 Adobe Inc. Joint Training Technique for Depth Map Generation
CN112446380A (zh) * 2019-09-02 2021-03-05 华为技术有限公司 图像处理方法和装置
CN111223106A (zh) * 2019-10-28 2020-06-02 稿定(厦门)科技有限公司 全自动人像蒙版抠图方法及系统
CN113052868A (zh) * 2021-03-11 2021-06-29 奥比中光科技集团股份有限公司 一种抠图模型训练、图像抠图的方法及装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740510A (zh) * 2023-03-20 2023-09-12 北京百度网讯科技有限公司 图像处理方法、模型训练方法及装置
CN116167922A (zh) * 2023-04-24 2023-05-26 广州趣丸网络科技有限公司 一种抠图方法、装置、存储介质及计算机设备
CN116167922B (zh) * 2023-04-24 2023-07-18 广州趣丸网络科技有限公司 一种抠图方法、装置、存储介质及计算机设备
CN116862931A (zh) * 2023-09-04 2023-10-10 北京壹点灵动科技有限公司 医学图像分割方法、装置、存储介质及电子设备
CN116862931B (zh) * 2023-09-04 2024-01-23 北京壹点灵动科技有限公司 医学图像分割方法、装置、存储介质及电子设备
CN117351118A (zh) * 2023-12-04 2024-01-05 江西师范大学 一种结合深度信息的轻量化固定背景抠像方法及系统
CN117351118B (zh) * 2023-12-04 2024-02-23 江西师范大学 一种结合深度信息的轻量化固定背景抠像方法及系统

Also Published As

Publication number Publication date
US20240029272A1 (en) 2024-01-25
CN114038006A (zh) 2022-02-11

Similar Documents

Publication Publication Date Title
WO2023015755A1 (zh) 一种抠图网络训练方法及抠图方法
Zhang et al. A late fusion cnn for digital matting
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
CN110176027B (zh) 视频目标跟踪方法、装置、设备及存储介质
EP3815047B1 (en) Image colorization based on reference information
CN112132156B (zh) 多深度特征融合的图像显著性目标检测方法及系统
Johnson et al. Sparse coding for alpha matting
US11393100B2 (en) Automatically generating a trimap segmentation for a digital image by utilizing a trimap generation neural network
US20220044366A1 (en) Generating an image mask for a digital image by utilizing a multi-branch masking pipeline with neural networks
CN111325750B (zh) 一种基于多尺度融合u型链神经网络的医学图像分割方法
CN112418216A (zh) 一种复杂自然场景图像中的文字检测方法
CN113139544A (zh) 一种基于多尺度特征动态融合的显著性目标检测方法
CN113436220B (zh) 一种基于深度图分割的图像背景估计方法
CN114663371A (zh) 基于模态独有和共有特征提取的图像显著目标检测方法
CN114565770A (zh) 基于边缘辅助计算和掩模注意力的图像分割方法及系统
WO2022109922A1 (zh) 抠图实现方法、装置、设备及存储介质
Tang et al. Stroke-based scene text erasing using synthetic data for training
US12051225B2 (en) Generating alpha mattes for digital images utilizing a transformer-based encoder-decoder
Zhou et al. Attention transfer network for nature image matting
Wu et al. Multi-focus image fusion: Transformer and shallow feature attention matters
Zhang et al. Interactive spatio-temporal feature learning network for video foreground detection
CN114708591B (zh) 基于单字连接的文档图像中文字符检测方法
Liu et al. Video decolorization based on the CNN and LSTM neural network
Cao et al. A novel image multitasking enhancement model for underwater crack detection
WO2022226744A1 (en) Texture completion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21953360

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21953360

Country of ref document: EP

Kind code of ref document: A1