WO2021139062A1 - 一种全自动自然图像抠图方法 - Google Patents

一种全自动自然图像抠图方法 Download PDF

Info

Publication number
WO2021139062A1
WO2021139062A1 PCT/CN2020/089942 CN2020089942W WO2021139062A1 WO 2021139062 A1 WO2021139062 A1 WO 2021139062A1 CN 2020089942 W CN2020089942 W CN 2020089942W WO 2021139062 A1 WO2021139062 A1 WO 2021139062A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
convolution
mask
information
level
Prior art date
Application number
PCT/CN2020/089942
Other languages
English (en)
French (fr)
Inventor
杨鑫
魏小鹏
张强
刘宇豪
乔羽
Original Assignee
大连理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大连理工大学 filed Critical 大连理工大学
Priority to US16/963,140 priority Critical patent/US11195044B2/en
Publication of WO2021139062A1 publication Critical patent/WO2021139062A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the technical field of computer vision and relates to a method for fully automatic natural image matting based on deep learning.
  • Image matting is a relatively important task in computer vision. It is based on image segmentation, but it also extends the image segmentation in depth. Image segmentation aims to segment different regions or regions of interest in the image. It is essentially a non-zero or one binary classification problem, and does not require too much detail on the edge to be segmented; while image matting It not only divides the foreground area, but also requires a higher degree of fineness of the segmented objects, such as human hair, animal hair, dense meshes and translucent objects, etc. The segmentation is visible to the naked eye. This high-precision segmentation result has extraordinary uses for image synthesis. It can be used in daily portrait changing background applications, as large as the virtual background production in the film industry and the fine production of industrial parts and other fields.
  • Image matting and image synthesis are essentially mutually reversible processes, and their mathematical model can be expressed by the following formula:
  • I z ⁇ F z +(1- ⁇ )B z , ⁇ [0,1] (1)
  • F and B respectively refer to the foreground and background values after segmentation
  • represents the degree of opacity of the pixel, and its value is between 0 and 1.
  • Time is essentially a regression problem.
  • This formula gives an intuitive explanation for image synthesis, that is, an image is composed of many pixels, and each pixel is composed of a different weighted sum of the foreground and background, and ⁇ is this weighting factor.
  • formula 1 shows that image matting is an under-constrained problem.
  • RGB color image there are 7 unknown quantities but only 3 known quantities. Therefore, some existing methods solve this ill-conditioned problem by adding some additional auxiliary information (for example, Trimap three-part graph, Scribble strokes), in which the alpha value of a partial area is usually manually specified.
  • additional auxiliary information for example, Trimap three-part graph, Scribble strokes
  • the sampling-based method mainly samples the known foreground and background regions to find the candidate colors of the foreground and background of a given pixel, and then uses different evaluation indicators to determine the optimal weighted combination of foreground and background pixels.
  • Different sampling methods have different effects on the weighted combination of pixels, including methods such as sampling pixel pairs along the boundary of the unknown area, sampling based on ray projection, and sampling based on color clustering.
  • the evaluation index here is used to make decisions among the sampled candidate objects, and it mainly includes methods such as the reconstruction error of formula 1, the distance from the pixels in the unknown area, and the similarity measurement of the foreground/background sampling.
  • Equation 1 In the propagation method, ⁇ in Equation 1 is allowed to propagate the value of pixels with known ⁇ to pixels with unknown ⁇ through different propagation algorithms.
  • the most mainstream of the propagation algorithm is to make local smoothing assumptions on the foreground/background, and then find the global optimal alpha mask by solving the linear sparse equations.
  • Other methods include random walk and non-localized propagation.
  • This end-to-end deep learning method makes the method not require any user interaction, and greatly reduces manual labor while ensuring accuracy.
  • Professor Xu Weiwei's laboratory from Zhejiang University proposed a Late-fusion method. From the perspective of classification, this method splits the problem of image matting into rough classification of foreground and background and edge optimization. In the implementation, first perform two classification tasks on an image, and then use multiple convolutional layers to perform a fusion operation. The difference between it and deep portrait segmentation is that deep portrait segmentation uses traditional propagation methods to bridge the training process, while Late-fusion uses full convolution to train in stages.
  • the present invention proposes a fully automatic image matting framework based on attention-guided hierarchical structure fusion.
  • This framework can obtain a finer mask without any additional auxiliary information when only a single RGB image is input.
  • the user enters a single RGB image into the frame, and it will first go through a feature extraction network with an expanded pyramid pooling module to extract the features of the image, and then go through a channel attention module to filter the advanced features. After that, the filtered results and low-level features are sent to the spatial attention module to extract image details. Finally, the obtained mask and label information as well as the original image are sent to the discriminator network for later optimization, and finally a fine image mask is obtained.
  • a fully automatic natural image matting method realizes the masking of fine foreground objects from a single RGB image.
  • the method consists of four parts.
  • the overall framework is shown in the figure. As shown in 1, the specific steps are as follows:
  • the hierarchical feature extraction stage mainly extracts different levels of feature representation from the input image; select ResNext as the basic network and divide it into five blocks.
  • the five blocks are from shallow to deep, and low-level spatial features and texture features are extracted from the shallow layer.
  • Deep extraction of high-level semantic features progressively; as the depth of the network deepens, the network itself learns more of the deep semantic features, so the second block is used to extract low-level spatial features and texture features, as shown in Figure 2.
  • the traditional method After the high-level semantic feature representation is extracted, the traditional method usually performs the next step on the entire feature representation without filtering. Since there are more than one type of objects in the image, there is more than one semantic information activated at the high-level, and objects in the foreground and background have the possibility of being activated (that is, different channels have different responses to the objects), which is very useful for image extraction. It will cause a lot of trouble.
  • the present invention proposes a pyramid feature filtering module (ie, channel attention in hierarchical attention). The specific process is shown in Figure 4.
  • the obtained high-level semantic features are first passed through a maximum pooling operation, so that the Multiple eigenvalues are compressed into one eigenvalue; then the compressed eigenvalues are passed through a shared multilayer perceptron composed of three-layer convolution operation to update the eigenvalues between multiple channels; finally, the nonlinear activation function is passed.
  • the elements of each channel in the obtained channel attention map are multiplied with all the elements of the channel corresponding to the high-level semantic features of the previous stage to achieve the selection of different activation regions; the mathematical expression is as follows:
  • Input represents the high-level semantic features obtained in the first stage;
  • represents the nonlinear activation function, the size of the channel attention map obtained after ⁇ is 1 ⁇ 1 ⁇ n, and n represents the number of channels, and the obtained high-level semantic features
  • the size of is x ⁇ y ⁇ n, x and y represent the length and width of the channel. When the two are multiplied, they will perform a broadcast operation. It is an element of the channel attention map multiplied by all elements of the corresponding channel in the high-level semantic feature ;
  • the existing image matting method based on deep learning directly upsampling the selected high-level semantic features to obtain the final mask, which will largely lose the detailed information and texture information of the foreground object at the edge.
  • the present invention proposes a spatial information extraction module (that is, spatial attention in hierarchical attention), as shown in the figure
  • a spatial information extraction module that is, spatial attention in hierarchical attention
  • the updated high-level semantic features together with the spatial features and texture features extracted from the second block in the hierarchical feature extraction stage are used as input, and the updated high-level semantic features are used as guidance information.
  • the updated high-level semantic features are first subjected to a 3 ⁇ 3 convolution operation, and then the convolution results are obtained from two Then do convolution in the direction, one is to do a 7 ⁇ 1 convolution in the horizontal direction, and then do a 1 ⁇ 7 convolution in the vertical direction on the result; the other is to do a 1 ⁇ 7 convolution in the vertical direction first , Perform a 7 ⁇ 1 convolution in the horizontal direction on the result, and then perform a cascade operation on the results of two parallel but different convolution operations, and use this method to further filter and update the updated high-level semantic features.
  • represents the set of pixels
  • represents the number of pixels in an image, with Represents the mask mask value and the supervision information value at pixel i respectively;
  • the structural similarity error ensures the consistency of the spatial information and texture information extracted from the low-level features, thereby further improving the structure of the foreground object, the calculation formula as follows:
  • the present invention adopts a discriminator network in the later stage of optimization.
  • the obtained mask, input image, and supervision information are sent to the discriminator network.
  • the discriminator network will determine the cascade of the supervision information and the input image as the standard to determine the generated mask.
  • the cascade of the mask and the input image there is a slight difference between the mask mask and the supervision information, the discriminator will return a false, until the two are completely consistent, the discriminator will return true.
  • a more realistic rendering can be obtained when the image is synthesized.
  • the beneficial effect of the present invention Compared with the existing image matting method, the biggest advantage of the present invention is that it does not require any auxiliary information and any additional user interaction information, and only needs to input an RGB image to obtain a fine mask. Matte. On the one hand, it saves a lot of time for scientific researchers and no longer need to manually make auxiliary information such as trigraphs or brush paintings; on the other hand, for users, they no longer need to manually mark some foreground/background when using it. Constraint information.
  • the attention-guided hierarchical structure fusion method in the present invention has enlightening significance for the task of image matting. While getting rid of the dependence on auxiliary information, it ensures the accuracy of the mask. This idea of letting the high-level learn to guide the low-level has a great reference value for other computer vision tasks.
  • Figure 1 is a flow chart of the overall framework.
  • Figure 2 is a representation of the original input image and its corresponding low-level features.
  • Figure 3 is a flow chart of the expansion space pyramid pooling.
  • Figure 4 is a flowchart of the pyramid feature filtering module.
  • Figure 5 is a flow chart of the spatial information extraction module.
  • Figure 6 is a comparison diagram of the effects of different components, (a) is the original input image; (b) is the mask that only contains the feature extraction network and expanded space pyramid pooling; (c) is based on (b) Added the mask mask obtained by the pyramid feature filtering module; (d) is the mask mask obtained by adding the spatial information extraction module on the basis of (c); (e) is the result obtained by the entire framework; (f) For monitoring information.
  • the core of the present invention lies in the hierarchical structure fusion guided by attention, and the invention will be described in detail in combination with specific implementation methods.
  • the invention is divided into four parts.
  • the first part uses the feature extraction network and the expanded space pyramid pooling module to extract features at different levels, as shown in the overall framework flow chart of Figure 1 and the expanded space pyramid pooling flow chart of Figure 3, through Adjust the receptive field of each block of the feature extraction network, so that the final feature map of the network has a relatively large field of view, and avoids being restricted to a local point in the network learning process.
  • the expanded spatial pyramid pooling module can perform feature extraction and fusion of different scales, and has stronger processing capabilities for objects of different scales and scales in the input image.
  • the features after the spatial pyramid pooling module are regarded as low-level structural features; the second part uses the pyramid feature filtering module to filter and filter the high-level semantic features, as shown in Figure 4.
  • an inter-channel attention operation is performed on the feature maps with strong semantic information, so as to adaptively assign strong weights to useful channels, while weakening the channels with less information or even useless information;
  • the three parts use the results of the previous stage as guidance information and send them to the spatial information extraction module for low-level structural feature extraction, and then merge the updated high-level semantic features and low-level structural features, as shown in Figure 5.
  • the spatial information extraction module By using the spatial information extraction module, a good optimization is performed on the edge of the foreground object.
  • features that are irrelevant to the foreground in the low-dimensional information can be filtered out, focusing on the foreground Edge features, the final filtered high-level semantic features and the extracted low-level structural features are fused, and then the final result is obtained; the fourth part uses the discriminator network to further optimize the obtained mask to make its visual effect and
  • the supervision information is more consistent, as shown in Figure 1.
  • the discriminator With the help of the discriminator, the prediction result and the original image are taken as one set of inputs, and the supervision information and the original image are taken as another set of inputs. These two sets of inputs are sent to the discriminator at the same time, so that the discriminator can supervise the prediction results of the network. Good or bad, and then achieve the purpose of optimizing the visual effect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本发明属于计算机视觉技术领域,提供了一种全自动自然图像抠图方法,对于单幅图像的图像抠图来说,主要由高级语义特征和低级结构特征的提取、金字塔特征的过滤、空间结构信息的抽取、利用判别器网络的后期优化四部分构成。本发明可以在不需要任何辅助信息的情况下生成精准的蒙版遮罩,省去科研人员标注辅助信息的时间和用户使用时的交互时间。

Description

一种全自动自然图像抠图方法 技术领域
本发明属于计算机视觉技术领域,涉及深度学习的全自动自然图像抠图的方法。
背景技术
一个前景物体如何无缝的和另一张图像合成一张新的图像,其最为关键的技术便是图像抠取,随着社会的发展和科技的不断进步,我们身边的图像数量正在呈指数速度增长,同时也伴随着众多对图像的处理技术接踵而来。从最初的图像分类到目标检测,再到图像分割等,其背后无不隐藏着人们对于解放双手、减少劳动力的需求,而这些需求的解决正是通过不同的图像处理技术来解决进而便利我们的生活。
图像抠图(Image Matting)是计算机视觉中较为重要的一个任务,它建立在图像分割的基础上,但又对图像分割做了深层的延伸。图像分割旨在从将图像中的不同区域或感兴趣的区域给分割出来,其本质上是一个非0即1的二分类问题,对于被分割的边缘细节不做过多要求;而图像抠图是不仅仅将前景区域分割出来,其还要求被分割出物较高的精细程度,例如人的头发,动物的毛发、密集度较高的网状体和半透明的物体等都是要被精细的分割并且肉眼可见的。这种高精度的分割结果对于图像合成具有意义非凡的用途,小到可以应用在日常的人像换背景等应用,大至电影行业的虚拟背景制作以及工业界的零部件精细制作等领域。
图像抠图和图像合成本质上是一个互为可逆的过程,其数学模型可以通过如下公式表示:
I z=αF z+(1-α)B z,α∈[0,1]  (1)
如上公式中z=(x,y)表示输入图像I中的像素位置,F和B分别指分割之后的前景和背景值,α表示该像素点的不透明程度,其数值是介于0和1之间的,本质上属于一个回归问题。该公式给出了对于图像合成的直观解释,即一幅图像众多像素点组成,而每一个像素点均是由前景和背景的不同加权和组成,α便是这个加权因子。当α=1的时候表示完全不透明,即此像素只由前景构成;当α=0的时候表示完全透明,即此像素只由背景构成;当α∈[0,1]的时候,则是表明该像素是前景和背景的加权和共同组成,此像素所在的区域又称未知区域或过渡区域。
反观公式1可以看出图像抠图是一个欠约束的问题,对于一张RGB的彩色图,存在7个未知量却只有3个已知量。所以现存的一些方法均是通过增加一些额外的辅助信息(例如Trimap三分图、Scribble笔画)来解决这个病态问题,在这些辅助信息中通常会人工指定部分区域的α值。当然,随着科学技术的发展,图像抠图技术及相关领域的研究也在不断取得新的突破,在图像抠取领域的算法种类很多,大致可以分为以下三种类型。
(1)基于采样方式
基于采样的方法主要是通过对已知的前景和背景区域进行采样,以找到给定像素的前景和背景的候选颜色,然后使用不同的评测指标来确定最优的前景和背景像素的加权组合。不同的采样方法对于像素的加权组合也有不同的效果,包括沿着未知区域的边界去采样像素对、基于射线投射的采样、基于颜色聚类的采样等方法。这里的评测指标用于在采样的候补对象中进行决策,其主要包含对公式1的重建误差、距未知区域像素的距离及对前景/背景采样的相似性测量等方法。
(2)基于传播方式
在传播的方法中,公式一中的α被允许通过不同的传播算法将已知α的像素点的值传播到未知α的像素点。传播算法最为主流的便是对前景/背景进行局部平滑假设,随后通过求解线性稀疏方程组来找到全局最优的alpha蒙版遮罩。其它的方法还包括随机游走和非局部化的传播等。
(3)基于深度学习方式
随着深度学习的快速发展,在图像分类、语义分割等视觉领域中越来越多基于深度学习的方法超过了传统的图像处理技术,而深度学习技术在图像抠图领域的应用使得图像合成最终生成图像的质量得到了很大的提升。香港中文大学贾佳亚教授的实验室提出了深度自动人像抠图,其不仅考虑图像的语义预测,同时也考虑了像素级别的蒙版遮罩的优化。在实现的时候首先通过语义分割将输入图像分割为前景、背景和未知区域、然后提出一个新颖的遮罩层使得整个网络可以进行前馈和反馈操作。这种端到端的深度学习方法使得该方法不需要任何的用户交互,在保证精度的同时大幅度减少了人工的劳动力。最近来自浙江大学的许巍巍教授的实验室提出一种Late-fusion的方法,该方法从分类的角度出发,将图像抠图问题拆分为前景和背景的粗分类和边缘优化问题。在实现的时候先对一张图像进行两个分类任务,然后利用多个卷积层来进行一个融合操作。其与深度人像分割的区别在于深度人像分割利用传统的传播方法来进行桥接训练过程,而Late-fusion采用的是全卷积的方式分阶段训练。
发明内容
本发明针对现有方法的不足,提出了一种基于注意力引导的层次结构融合的全自动图像抠图框架。本框架能够在只输入单张RGB图像而没有任何额外辅助信息的情况下获得较精细的蒙版遮罩。用户输入单张RGB的图像到框架,首先会经过一个带有扩张金字塔池化模块的特征提取网络去对图像进行特征的提 取,随后会经过一个通道注意力的模块去对高级特征进行一个过滤,之后会将过滤的结果和低级特征一并送入空间注意力模块以进行图像细节的抽取。最后将得到的蒙版遮罩和标签信息以及原始图像一并送入判别器网络以进行后期的优化,最终得到一个精细的图像蒙版。
本发明的技术方案:
一种全自动自然图像抠图方法,在无需任何额外辅助信息的情况下实现了从单张RGB图像中获取精细的前景物体的蒙版遮罩,该方法共由四部分构成,整体框架如图1所示,具体步骤如下:
(1)层级特征提取阶段
层级特征提取阶段主要从输入的图像中抽取不同层级的特征表示;选取ResNext作为基础网络,将其划分为五个块,五个块由浅到深,从浅层提取低级空间特征和纹理特征,到深层提取高级语义特征,依次递进;随着网络的深度加深,网络本身学习到的更多是深层语义特征,因此利用第二个块去提取低级空间特征和纹理特征,如图2所示展示了图像结构性相关的信息。为了让深层网络获得更大的感受野,先将第五个块的普通卷积操作改为扩张率为2的扩张卷积;为了解决图像中前景物体大小不同的问题,将第五个块提取出来的高级语义特征送入扩张空间金字塔池化模块,如图3所示,对于带扩张率的扩张卷积,设定扩张率分别是6、12和18;随后将这五个并行操作的结果级联起来经过一个3×3的卷积操作得到高层语义特征表示;
(2)金字塔特征过滤阶段
在将高层语义特征表示提取出来之后,传统方法通常是不加筛选的对整个特征表示进行下一步处理。由于图像中的物体种类不止一种,导致高层被激活的语义信息不止一处,前景背景的物体均有被激活的可能性(即不同的通道对于 响应的物体也不同),这对于图像抠取来说会造成很大的困扰。本发明提出了金字塔特征过滤模块(即分层注意力中的通道注意力),具体流程如图4所示,将得到的高级语义特征先通过一个最大池化操作,以此将每一层的多个特征值压缩为一个特征值;接着将压缩后的特征值通过一个由三层卷积操作组成的共享多层感知机以进行多个通道间特征值的更新;最后将通过非线性激活函数得到的通道注意力图中每一个通道的元素和上阶段的高级语义特征对应的该通道所有元素进行相乘操作,以此来达到对不同激活区域的选择;数学表达式如下:
Output=σ(MLP(MaxPool(Input)))×Input(2)
式中,Input表示第一阶段得到的高级语义特征;σ表示非线性激活函数,经过σ之后得到的通道注意力图的大小是1×1×n,n表示通道的数量,而得到的高级语义特征的大小是x×y×n,x和y表示通道的长和宽,二者在执行相乘时会执行广播操作,是通道注意力图的一个元素和高级语义特征中对应通道的全部元素相乘;
(3)空间信息抽取阶段
现存的基于深度学习的图像抠图方法,直接将选择后的高级语义特征做上采样操作得到最终的蒙版遮罩,这会很大程度上丢失前景物体在边缘的细节信息和纹理信息。为了提升蒙版遮罩在物体边缘(例如头发、半透明玻璃、网状物)的精细度,本发明提出了一种空间信息抽取模块(即分层注意力中的空间注意力),如图5所示,将更新后的高级语义特征连同层级特征提取阶段中第二个块抽取出来的空间特征和纹理特征一并作为输入,利用更新过的高级语义特征当作引导信息,以此去有选择性的从空间信息中抽取与前景物体相关的空间特征和纹理特征;具体的,先将更新过的高级语义特征经过一个3×3的卷积操作,随后将卷积后的结果从两个方向再去做卷积,一种是先在横向做7×1的卷积,在该 结果上在纵向做1×7的卷积;另一种则是先在纵向做1×7的卷积,在该结果上在横向做7×1的卷积,然后将两个平行却顺序不同的卷积操作的结果做一个级联操作,通过此方法对更新过的高级语义特征做进一步的筛选和过滤;之后将该结果做一个1×1的卷积以进行深层融合,再经过一个非线性激活函数得到空间注意力图,用此空间注意力图和来自第二个块的低级特征做逐元素相乘操作去得到更新过的低级特征;更新过的低级特征经过一个3×3的卷积之后和更新过的高级语义特征做级联操作,二者的融合特征随后经过一个3×3的卷积得到该阶段的输出;在该阶段为了保证最终生成的蒙版遮罩与标签信息的一致性,为此本专利设计了一个由结构相似性误差(SSIM)和均方误差(MSE)构成的混合误差函数。均方误差用于监督蒙版遮罩与监督信息之间的逐像素一致性,计算方式如下:
Figure PCTCN2020089942-appb-000001
式中,Ω表示像素点集合,|Ω|表示一幅图像中像素点的个数,
Figure PCTCN2020089942-appb-000002
Figure PCTCN2020089942-appb-000003
分别表示在像素点i的蒙版遮罩值和监督信息值;结构相似性误差确保了从低级特征中提取的空间信息和纹理信息的一致性,以此来进一步提升前景物体的结构,计算公式如下:
Figure PCTCN2020089942-appb-000004
式中,
Figure PCTCN2020089942-appb-000005
Figure PCTCN2020089942-appb-000006
分别表示在像素点i的蒙版遮罩值和监督信息值,μ p,μ g和σ p,σ g表示
Figure PCTCN2020089942-appb-000007
Figure PCTCN2020089942-appb-000008
的均值和标准差;
(4)后期优化阶段
为了使生成的蒙版遮罩在视觉效果上更加与监督信息相匹配,在后期优化阶段本发明采用一个判别器网络来进行。如图1所示,将得到的蒙版遮罩、输入图像和监督信息一并送给判别器网络,判别器网络会认定监督信息和输入图 像的级联为标准,以此去判断生成的蒙版遮罩和输入图像的级联,蒙版遮罩与监督信息间有一丝的不同,判别器就会返回一个假,直到二者完全一致,判别器会返回真。通过判别器进一步的优化蒙版遮罩的可视化质量,在图像合成的时候可以得到更加逼真的效果图。
本发明的有益效果:相比于现有的图像抠图方法,本发明最大的优点是不需要任何辅助信息和任何额外的用户交互信息,只需要输入一张RGB图像便可以得到精细的蒙版遮罩。一方面节省了科研人员的大量时间,不再需要去手工制作三分图或画笔画等辅助信息;另一方面对于用户来说,也不再需要他们在使用的时候去手工标注一些前景/背景的约束信息。同时,本发明中的基于注意力引导的层次结构融合方法对于图像抠图这一任务具有启迪意义,它在摆脱了对辅助信息依赖的同时,保证了蒙版遮罩的准确性。这种由高层去指导低层去学习的思想对其它计算机视觉任务有着较大的参考价值。
附图说明
图1是整体框架的流程图。
图2是原始输入图像及其对应的低级特征表示展示图。
图3是扩张空间金字塔池化流程图。
图4是金字塔特征过滤模块流程图。
图5是空间信息抽取模块流程图。
图6是不同组件效果对比图,(a)为原始输入图像;(b)为仅包含特征提取网络和扩张空间金字塔池化得到的蒙版遮罩;(c)为在(b)的基础上增加了金字塔特征过滤模块得到的蒙版遮罩;(d)为在(c)的基础上增加了空间信息抽取模块得到的蒙版遮罩;(e)为整个框架得到的结果;(f)为监督信息。
具体实施方式
以下结合附图和技术方案,进一步说明本发明的具体实施方式。
为了更好的对比不同组件对于整个框架的贡献,我们根据图6来做可视化说明。(a)为原始输入图像;(b)为仅包含特征提取网络和扩张空间金字塔池化得到的蒙版遮罩;(c)为在(b)的基础上增加了金字塔特征过滤模块得到的蒙版遮罩;(d)为在(c)的基础上增加了空间信息抽取模块得到的蒙版遮罩;(e)为整个框架得到的结果;(f)为监督信息。为了方便描述,我们把(b)对应的模型称为基准网络,将原始图像(a)送入基准网络之后,通过结果可以看出来前景的颜色存在大量灰色,在物体的边缘区域和中间区域的颜色存在突变;当在基准网络(b)中加入了金字塔特征过滤模块之后,得到的结果明显可见对于中间区域有了显著的改善,特别是图中两人的衣服部分,但是边缘的网状细节还存在模糊;同时,只在基准网络中中加入了空间信息抽取模块之后,从图(d)可以看出来,对于边缘的网状透明度信息有了很好的改善,但是中间的人物背景信息还过多存在;紧接着在基准网络(b)中同时加入金字塔特征过滤模块和空间信息抽取模块之后,如图(e)所示,得到的便是我们最终的效果图。从这一系列的蒙版遮罩的变化可以看出,背景区域的人和衣服上的字母在增加组件的过程中逐渐消失,而前景的网状和其边缘部位越来越精细化。由此也能进一步印证我们各个模块对提升性能的重要性和不可或缺性。
本发明的核心在于注意力引导的层次结构融合,结合具体实现方式对该发明作详细说明。该发明分为四部分,第一部分利用特征提取网络和扩张空间金字塔池化模块去提取不同层级的特征,如图1的整体框架流程图和图3的扩张空间金字塔池化流程图所示,通过对特征提取网络的各个块的感受野调整,使得网络最后的特征图有一个相对较大的视野范围,避免网络学习过程中受限于一个局部点。而扩张空间金字塔池化模块可以进行不同尺度的特征抽取与融合, 对于输入图片中不同尺度和规模的物体有更强的处理能力,我们把空间金字塔池化模块后的特征当作高级语义特征,将特征抽取模块中第二个块得到的特征当作低级结构特征;第二部分利用金字塔特征过滤模块对高级语义特征进行过滤和过滤,如图4所示。通过采用注意力机制,对于语义信息较强的特征图进行一个通道间的注意力操作,以此来自适应的分配较强的权重给有用的通道,而弱化信息较少甚至无用信息的通道;第三部分利用上一阶段的结果作为指导信息送给空间信息抽取模块用于低级结构特征抽取,随后将更新的高级语义特征和低级结构特征进行融合,如图5所示。通过采用空间信息抽取模块,对于前景物体的边缘进行一个很好的优化,伴随着前一阶段的特征图作为指导,在本阶段可以过滤掉掉低维信息中与前景无关的特征,关注于前景边缘特征,最后过滤后的高级语义特征和抽取过的低级结构特征进行融合,进而得到最终的结果;第四部分通过判别器网络对得到的蒙版遮罩做进一步的优化,使其视觉效果与监督信息更加一致,如图1所示。在判别器的帮助下,将预测结果和原图作为一组输入,将监督信息和原图作为另外一组输入,这两组输入同时送给判别器,可以使得判别器去监督网络预测结果的好坏,进而达到优化视觉效果的目的。

Claims (1)

  1. 一种全自动自然图像抠图方法,在无需任何额外辅助信息的情况下实现了从单张RGB图像中获取精细的前景物体的蒙版遮罩,该方法共由四部分构成,其特征在于,步骤如下:
    (1)层级特征提取阶段
    层级特征提取阶段主要从输入的图像中抽取不同层级的特征表示;选取ResNext作为基础网络,将其划分为五个块,五个块由浅到深,从浅层提取低级空间特征和纹理特征,到深层提取高级语义特征,依次递进;随着网络的深度加深,网络本身学习到的更多是深层语义特征,因此利用第二个块去提取低级空间特征和纹理特征;为了让深层网络获得更大的感受野,先将第五个块的普通卷积操作改为扩张率为2的扩张卷积;为了解决图像中前景物体大小不同的问题,将第五个块提取出来的高级语义特征送入扩张空间金字塔池化模块,对于带扩张率的扩张卷积,设定扩张率分别是6、12和18;随后将这五个并行操作的结果级联起来经过一个3×3的卷积操作得到高层语义特征表示;
    (2)金字塔特征过滤阶段
    提出金字塔特征过滤模块将得到的高级语义特征先通过一个最大池化操作,以此将每一层的多个特征值压缩为一个特征值;接着将压缩后的特征值通过一个由三层卷积操作组成的共享多层感知机以进行多个通道间特征值的更新;最后将通过非线性激活函数得到的通道注意力图中每一个通道的元素和上阶段的高级语义特征对应的该通道所有元素进行相乘操作,以此来达到对不同激活区域的选择;数学表达式如下:
    Output=σ(MLP(MaxPool(Input)))×Input(2)
    式中,Input表示第一阶段得到的高级语义特征;σ表示非线性激活函数,经过σ之后得到的通道注意力图的大小是1×1×n,n表示通道的数量,而得到 的高级语义特征的大小是x×y×n,x和y表示通道的长和宽,二者在执行相乘时会执行广播操作,是通道注意力图的一个元素和高级语义特征中对应通道的全部元素相乘;
    (3)空间信息抽取阶段
    提出一种空间信息抽取模块将更新后的高级语义特征连同层级特征提取阶段中第二个块抽取出来的空间特征和纹理特征一并作为输入,利用更新过的高级语义特征当作引导信息,以此去有选择性的从空间信息中抽取与前景物体相关的空间特征和纹理特征;具体的,先将更新过的高级语义特征经过一个3×3的卷积操作,随后将卷积后的结果从两个方向再去做卷积,一种是先在横向做7×1的卷积,在该结果上在纵向做1×7的卷积;另一种则是先在纵向做1×7的卷积,在该结果上在横向做7×1的卷积,然后将两个平行却顺序不同的卷积操作的结果做一个级联操作,通过此方法对更新过的高级语义特征做进一步的筛选和过滤;之后将该结果做一个1×1的卷积以进行深层融合,再经过一个非线性激活函数得到空间注意力图,用此空间注意力图和来自第二个块的低级特征做逐元素相乘操作去得到更新过的低级特征;更新过的低级特征经过一个3×3的卷积之后和更新过的高级语义特征做级联操作,二者的融合特征随后经过一个3×3的卷积得到该阶段的输出;在该阶段为了保证最终生成的蒙版遮罩与标签信息的一致性,设计一个由结构相似性误差和均方误差构成的混合误差函数;均方误差用于监督蒙版遮罩与监督信息之间的逐像素一致性,计算方式如下:
    Figure PCTCN2020089942-appb-100001
    式中,Ω表示像素点集合,|Ω|表示一幅图像中像素点的个数,
    Figure PCTCN2020089942-appb-100002
    Figure PCTCN2020089942-appb-100003
    分别表示在像素点i的蒙版遮罩值和监督信息值;结构相似性误差确保了从低级特征中提取的空间信息和纹理信息的一致性,以此来进一步提升前景物体的结构, 计算公式如下:
    Figure PCTCN2020089942-appb-100004
    式中,
    Figure PCTCN2020089942-appb-100005
    Figure PCTCN2020089942-appb-100006
    分别表示在像素点i的蒙版遮罩值和监督信息值,μ p,μ g和σ p,σ g表示
    Figure PCTCN2020089942-appb-100007
    Figure PCTCN2020089942-appb-100008
    的均值和标准差;
    (4)后期优化阶段
    为了使生成的蒙版遮罩在视觉效果上更加与监督信息相匹配,在后期优化阶段采用一个判别器网络来进行;将得到的蒙版遮罩、输入图像和监督信息一并送给判别器网络,判别器网络会认定监督信息和输入图像的级联为标准,以此去判断生成的蒙版遮罩和输入图像的级联,蒙版遮罩与监督信息间有一丝的不同,判别器就会返回一个假,直到二者完全一致,判别器会返回真;通过判别器进一步的优化蒙版遮罩的可视化质量,在图像合成时以得到更加逼真的效果图。
PCT/CN2020/089942 2020-01-12 2020-05-13 一种全自动自然图像抠图方法 WO2021139062A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/963,140 US11195044B2 (en) 2020-01-12 2020-05-13 Fully automatic natural image matting method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010029018.X 2020-01-12
CN202010029018.XA CN111223041B (zh) 2020-01-12 2020-01-12 一种全自动自然图像抠图方法

Publications (1)

Publication Number Publication Date
WO2021139062A1 true WO2021139062A1 (zh) 2021-07-15

Family

ID=70832454

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089942 WO2021139062A1 (zh) 2020-01-12 2020-05-13 一种全自动自然图像抠图方法

Country Status (2)

Country Link
CN (1) CN111223041B (zh)
WO (1) WO2021139062A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781309A (zh) * 2021-09-17 2021-12-10 北京金山云网络技术有限公司 图像处理方法、装置及电子设备
CN114037833A (zh) * 2021-11-18 2022-02-11 桂林电子科技大学 一种苗族服饰图像语义分割方法
CN114332116A (zh) * 2021-12-23 2022-04-12 上海科技大学 用于交互式分割的意图感知特征传播网络实现方法
CN114332094A (zh) * 2021-12-07 2022-04-12 海南大学 基于轻量级多尺度信息融合网络的语义分割方法及装置
CN114821340A (zh) * 2022-06-01 2022-07-29 河南大学 一种土地利用分类方法及系统
CN115049695A (zh) * 2022-06-20 2022-09-13 焦点科技股份有限公司 一种自适应生成三分图及融合语义的电商产品抠图方法
CN116740510A (zh) * 2023-03-20 2023-09-12 北京百度网讯科技有限公司 图像处理方法、模型训练方法及装置
CN118115624A (zh) * 2024-04-30 2024-05-31 浙江大学 一种基于稳定扩散模型的图像分层生成系统、方法及装置
CN118470048A (zh) * 2024-07-08 2024-08-09 江西师范大学 一种实时反馈的交互式树木图像抠图方法
CN118469885A (zh) * 2024-07-11 2024-08-09 山东新美达科技材料有限公司 一种彩色涂层钢板图像多阶段优化方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753915B (zh) * 2020-06-29 2023-11-07 广东浪潮大数据研究有限公司 一种图像处理装置、方法、设备及介质
CN111914795B (zh) * 2020-08-17 2022-05-27 四川大学 一种航拍图像中旋转目标检测方法
CN113052242B (zh) * 2021-03-29 2024-07-12 北京达佳互联信息技术有限公司 图像处理网络的训练方法及装置、图像处理方法及装置
CN113744280B (zh) * 2021-07-20 2024-09-17 北京旷视科技有限公司 图像处理方法、装置、设备及介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010049A (zh) * 2017-11-09 2018-05-08 华南理工大学 使用全卷积神经网络分割定格动画中人手部区域的方法
CN109685067A (zh) * 2018-12-26 2019-04-26 江西理工大学 一种基于区域和深度残差网络的图像语义分割方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712145B (zh) * 2018-11-28 2021-01-08 山东师范大学 一种图像抠图方法及系统

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010049A (zh) * 2017-11-09 2018-05-08 华南理工大学 使用全卷积神经网络分割定格动画中人手部区域的方法
CN109685067A (zh) * 2018-12-26 2019-04-26 江西理工大学 一种基于区域和深度残差网络的图像语义分割方法

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781309A (zh) * 2021-09-17 2021-12-10 北京金山云网络技术有限公司 图像处理方法、装置及电子设备
CN114037833B (zh) * 2021-11-18 2024-03-19 桂林电子科技大学 一种苗族服饰图像语义分割方法
CN114037833A (zh) * 2021-11-18 2022-02-11 桂林电子科技大学 一种苗族服饰图像语义分割方法
CN114332094A (zh) * 2021-12-07 2022-04-12 海南大学 基于轻量级多尺度信息融合网络的语义分割方法及装置
CN114332116B (zh) * 2021-12-23 2024-05-17 上海科技大学 用于交互式分割的意图感知特征传播网络实现方法
CN114332116A (zh) * 2021-12-23 2022-04-12 上海科技大学 用于交互式分割的意图感知特征传播网络实现方法
CN114821340A (zh) * 2022-06-01 2022-07-29 河南大学 一种土地利用分类方法及系统
CN115049695A (zh) * 2022-06-20 2022-09-13 焦点科技股份有限公司 一种自适应生成三分图及融合语义的电商产品抠图方法
CN115049695B (zh) * 2022-06-20 2024-05-03 焦点科技股份有限公司 一种自适应生成三分图及融合语义的电商产品抠图方法
CN116740510A (zh) * 2023-03-20 2023-09-12 北京百度网讯科技有限公司 图像处理方法、模型训练方法及装置
CN118115624A (zh) * 2024-04-30 2024-05-31 浙江大学 一种基于稳定扩散模型的图像分层生成系统、方法及装置
CN118470048A (zh) * 2024-07-08 2024-08-09 江西师范大学 一种实时反馈的交互式树木图像抠图方法
CN118469885A (zh) * 2024-07-11 2024-08-09 山东新美达科技材料有限公司 一种彩色涂层钢板图像多阶段优化方法

Also Published As

Publication number Publication date
CN111223041B (zh) 2022-10-14
CN111223041A (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2021139062A1 (zh) 一种全自动自然图像抠图方法
US11195044B2 (en) Fully automatic natural image matting method
CN109903301B (zh) 一种基于多级特征信道优化编码的图像轮廓检测方法
Wang et al. Multifocus image fusion using convolutional neural networks in the discrete wavelet transform domain
CN108830913B (zh) 基于用户颜色引导的语义级别线稿上色方法
CN110610509B (zh) 可指定类别的优化抠图方法及系统
CN103020917B (zh) 一种基于显著性检测的中国古代书法绘画图像复原方法
CN107966447A (zh) 一种基于卷积神经网络的工件表面缺陷检测方法
CN109740686A (zh) 一种基于区域池化和特征融合的深度学习图像多标记分类方法
CN110675462A (zh) 一种基于卷积神经网络的灰度图像彩色化方法
CN109920012A (zh) 基于卷积神经网络的图像着色系统及方法
CN114820579A (zh) 一种基于语义分割的图像复合缺陷的检测方法及系统
Ganga et al. Survey of texture based image processing and analysis with differential fractional calculus methods
Demir et al. Detecting visual design principles in art and architecture through deep convolutional neural networks
CN113052755A (zh) 一种基于深度学习的高分辨率图像智能化抠图方法
Tang et al. Painting and calligraphy identification method based on hyperspectral imaging and convolution neural network
Liu et al. Dunhuang murals contour generation network based on convolution and self-attention fusion
DE102019112595A1 (de) Geführte halluzination für fehlende bildinhalte unter verwendung eines neuronalen netzes
Li et al. Grain depot image dehazing via quadtree decomposition and convolutional neural networks
Kumar et al. A decoupled approach to illumination-robust optical flow estimation
Wang et al. Multi-priors guided dehazing network based on knowledge distillation
CN115830036A (zh) 基于强弱扰动的分级Siamese对比增强的涂鸦标注医学图像分割方法
CN113409321B (zh) 一种基于像素分类和距离回归的细胞核图像分割方法
CN115018729A (zh) 一种面向内容的白盒图像增强方法
Nikolskyy et al. Using LabView for real-time monitoring and tracking of multiple biological objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912612

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912612

Country of ref document: EP

Kind code of ref document: A1