WO2020093210A1 - 基于上下文信息指导的场景分割方法和系统 - Google Patents

基于上下文信息指导的场景分割方法和系统 Download PDF

Info

Publication number
WO2020093210A1
WO2020093210A1 PCT/CN2018/114006 CN2018114006W WO2020093210A1 WO 2020093210 A1 WO2020093210 A1 WO 2020093210A1 CN 2018114006 W CN2018114006 W CN 2018114006W WO 2020093210 A1 WO2020093210 A1 WO 2020093210A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
guidance module
layer
output
level
Prior art date
Application number
PCT/CN2018/114006
Other languages
English (en)
French (fr)
Inventor
唐胜
伍天意
李锦涛
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Priority to PCT/CN2018/114006 priority Critical patent/WO2020093210A1/zh
Publication of WO2020093210A1 publication Critical patent/WO2020093210A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion

Definitions

  • the method belongs to the field of machine learning and computer vision, and particularly relates to a scene segmentation method and system based on contextual information guidance.
  • Scene segmentation is a very important and challenging task in the field of computer vision, and has a wide range of application values in production and life, such as unmanned driving, robot navigation, and video editing.
  • the goal of scene segmentation is to assign each pixel to its category in the scene image.
  • scene segmentation methods based on fully convolutional layers have made significant progress.
  • the current mainstream methods are all through the migration classification network, such as VGG, ResNet and ResNeXt by removing the maximum pooling layer and the fully connected layer, and adding a deconvolution layer and some Decoder modules to generate segmentation results.
  • this type of method usually has a large number of parameters and calculations, and its speed is very slow. This limitation also limits the use of this type of method on the mobile terminal.
  • the present invention proposes a scene segmentation method based on context information guidance, including: constructing a context information-based guidance module, the guidance module having a residual structure; and using multiple 3 ⁇ 3 convolutional layers as the first feature extraction
  • the primary feature map is obtained from the original image; the plurality of guidance modules are used as the second feature extractor, and the intermediate feature map is obtained from the primary feature map; the plurality of guidance modules are used as the third feature extractor, the intermediate feature is used
  • the graph obtains a high-level feature map; with the scene segmentation sub-network, the scene segmentation result of the original image is obtained from the high-level feature map.
  • the formalized representation of the guidance module is f glo (w glo , f pipes (w tenu , f loc (w loc , x), f sur (w sur , x))); where f loc ( ⁇ ) is Local feature learner, w loc is the parameter of the local feature learner, the local feature learner is constructed with a 3 ⁇ 3 convolutional layer, and the local feature learner is trained through the back propagation algorithm to obtain w loc ; f sur ( ⁇ ) Is the surrounding context feature learner, w sur is the parameter of the surrounding context feature learner, the surrounding context feature learner is constructed with a 3 ⁇ 3 dilated convolution layer, and the surrounding context feature learner is obtained through a back propagation algorithm Train to obtain w sur ; f joints ( ⁇ ) is the joint feature learner, w tenu is the parameter of the joint feature learner; f glo ( ⁇ ) is the global feature learner, and w glo is the parameter of the global feature learner ; X is the
  • the second feature extractor has an M-layer guidance module; the first feature guide map is down-sampled with the first-layer guidance module of the second feature extractor to obtain the first-layer guidance module of the second feature extractor The output of each layer of the guidance module is used as the input of the next layer of guidance module to obtain the output of the Mth layer of the second feature extractor guidance module; the first layer of the second feature extractor guidance module The output of and the output of the M-th layer guidance module of the second feature extractor are combined to obtain the intermediate feature map; where M is a positive integer.
  • the third feature extractor has an N-level guidance module; the first-level guidance module of the third feature extractor down-samples the intermediate feature map to obtain the first-level guidance module of the third feature extractor The output of each layer of the guidance module is used as the input of the next layer of guidance module to obtain the output of the Nth layer of the third feature extractor of the guidance module; the first layer of the third feature extractor of the guidance module The output of and the output of the Nth-layer guidance module of the third feature extractor are combined to obtain the high-level feature map; where N is a positive integer.
  • the invention also discloses a scene segmentation system based on context information guidance, including: a guidance module construction module for constructing a guidance module based on context information, the guidance module having a residual structure; a first feature extractor module for A plurality of 3 ⁇ 3 convolutional layers are used as the first feature extractor to obtain the primary feature map from the original image; a second feature extractor module is used to use a plurality of the guidance modules as the second feature extractor from the primary feature Figure to obtain an intermediate feature map; a third feature extractor module, which uses a plurality of the guidance modules as a third feature extractor, and obtains an advanced feature map from the intermediate feature map; a scene segmentation result acquisition module, which is used to segment the scene The network obtains the scene segmentation result of the original image from the high-level feature map.
  • the formalized representation of the guidance module is f glo (w glo , f pipes (w tenu , f loc (w loc , x), f sur (w sur , x))); where f loc ( ⁇ ) is Local feature learner, w loc is the parameter of the local feature learner, the local feature learner is constructed with a 3 ⁇ 3 convolutional layer, and the local feature learner is trained by the back propagation algorithm to obtain w loc ; f sur ( ⁇ ) Is the surrounding context feature learner, w sur is the parameter of the surrounding context feature learner, the surrounding context feature learner is constructed with a 3 ⁇ 3 dilated convolution layer, and the surrounding context feature learner is obtained through a back propagation algorithm Train to obtain w sur ; f joints ( ⁇ ) is the joint feature learner, w tenu is the parameter of the joint feature learner; f glo ( ⁇ ) is the global feature learner, and w glo is the parameter of the global feature learner ; X is the
  • the first feature extractor module specifically includes: down-sampling the original image with the first layer 3 ⁇ 3 convolutional layer to obtain the output of the first layer 3 ⁇ 3 convolutional layer;
  • the output of the ⁇ 3 convolutional layer is the input of the next 3 ⁇ 3 convolutional layer to obtain the output of the last 3 ⁇ 3 convolutional layer; the output of the first 3 ⁇ 3 convolutional layer and the final
  • the output of a 3 ⁇ 3 convolutional layer is combined to obtain the primary feature map.
  • the second feature extractor has an M-layer guidance module; the first feature guide map is down-sampled with the first-layer guidance module of the second feature extractor to obtain the first-layer guidance module of the second feature extractor The output of each layer of the guidance module is used as the input of the next layer of guidance module to obtain the output of the Mth layer of the second feature extractor guidance module; the first layer of the second feature extractor guidance module The output of and the output of the M-th layer guidance module of the second feature extractor are combined to obtain the intermediate feature map; where M is a positive integer.
  • the third feature extractor has an N-level guidance module; the first-level guidance module of the third feature extractor down-samples the intermediate feature map to obtain the first-level guidance module of the third feature extractor The output of each layer of the guidance module is used as the input of the next layer of guidance module to obtain the output of the Nth layer of the third feature extractor of the guidance module; the first layer of the third feature extractor of the guidance module The output of and the output of the Nth-layer guidance module of the third feature extractor are combined to obtain the high-level feature map; where N is a positive integer.
  • the scene segmentation system based on context information guidance of the present invention has a very small amount of parameters, no more than 0.5M, a small memory footprint, and high segmentation performance.
  • FIG. 1A, B, and C are schematic diagrams of a scene segmentation method based on context information guidance.
  • FIG. 2 is a schematic structural diagram of a scene segmentation system based on context information guidance of the present invention.
  • 3A is a framework diagram of a scene segmentation method based on context information guidance of the present invention.
  • 3B is a schematic structural diagram of a guidance module based on context information of the present invention.
  • 3C is a schematic diagram of the down sampling structure of the guidance module based on context information of the present invention.
  • FIG. 4 is a comparison diagram of parameter amounts of the scene segmentation method based on context information guidance of the present invention and the prior art.
  • FIG. 5 is a comparison diagram of the memory occupancy of the scene segmentation method based on context information guidance of the present invention and the prior art.
  • Contextual information is generally understood as information that can perceive and apply objects that can affect objects in scenes and images. Context information comes from the simulation of the human visual system.
  • the human brain has excellent recognition performance. In the case of complex targets and backgrounds, the human visual system can still quickly identify and classify a large number of targets.
  • the illumination, posture, and texture of the target imaging , Deformation, occlusion and other factors have very good adaptability.
  • 1A, B, and C are schematic diagrams of a scene segmentation method based on context information guidance.
  • the present invention first rethinks the essential characteristics of the task of semantic segmentation. Semantic segmentation involves pixel-level classification and target positioning, which should consider spatial dependencies. It is different from the classification network to learn the abstract features of the entire image, or the salient objects in the image. It is worth noting that the human visual system will capture contextual information to understand the scene. Based on the above observations, the present invention proposes to use the context information to guide the module to learn local features and capture spatial dependencies.
  • 2 is a schematic structural diagram of a scene segmentation system based on context information guidance of the present invention. As shown in FIG. 2, the present invention builds a new scene segmentation network based on the context information guidance module.
  • the scene segmentation network (CGNet) proposed by the present invention has only three down-sampling, which helps to protect spatial location information.
  • FIG. 3A is a framework diagram of a scene segmentation method based on context information guidance of the present invention. As shown in FIG. 3A, the present invention discloses a scene segmentation method based on context information, which specifically includes:
  • FIG. 3B is a schematic structural diagram of a guidance module based on context information of the present invention.
  • the guidance module can be formalized as f glo (w glo , f buses (w tenu , f loc (w loc , x), f sur (w sur , x)));
  • f loc ( ⁇ ) is a local feature learner, for example, a standard 3 ⁇ 3 convolutional layer (3 ⁇ 3Conv) construction
  • w loc is the parameter of the local feature learner, which can be obtained by training the local feature learner through the back propagation algorithm
  • f sur ( ⁇ ) is the surrounding context feature learner, for example, 3 ⁇ 3 Inflated convolutional layer (3 ⁇ 3DConv) construction
  • w sur is the parameter of the surrounding context feature learner, which can be obtained by training the local feature learner through the back propagation algorithm
  • f joints ( ⁇ ) is the joint feature learner, for example Can
  • Step S2 the original RGB image to be segmented is used as the input of the first feature extractor to output a low-level feature map (primary feature map);
  • the first feature extractor consists of multiple standard 3 Constituent of ⁇ 3 convolutional layers, for example, 3 standard 3 ⁇ 3 convolutional layers, and the first 3 ⁇ 3 convolutional layer in the first feature extractor downsamples the original RGB image for the first time;
  • Step S3 in the second stage, the primary feature map output by the first feature extractor is used as the input of the second feature extractor, and the middle level feature map (intermediate feature map) is output; the second feature extractor is guided by the M layer
  • the module is constituted, and the first-level guidance module of the second feature extractor performs the second down-sampling on the input primary feature map to obtain the second-stage down-sampled feature map.
  • FIG. 3C is the guidance module based on context information of the present invention.
  • Step S4 in the third stage, the intermediate feature map output by the second feature extractor is used as the input of the third feature extractor, and a high-level feature map (high-level feature map) is output; the third feature extractor is guided by the N layer The module is composed, and the first-level guidance module of the third feature extractor performs the third down-sampling on the input intermediate-level feature map to obtain the third-stage down-sampling feature map.
  • the third-stage guidance module down-sampling structure and the second The stage is the same; take the output of each layer of the guidance module as the input of the next layer of guidance module, then combine the output of the Nth layer of guidance module with the down-sampling feature map of the third stage to obtain the advanced feature map of the third stage; N is a positive integer;
  • Step S5 the advanced feature map output by the third feature extractor is used as the input of the scene segmentation sub-network, and the scene segmentation result of the original RGB image is obtained through the scene segmentation sub-network, and is sampled by the sampling function (Upsample);
  • the split sub-network consists of 1 ⁇ 1 convolutional layers (1 ⁇ 1Conv).
  • the scene segmentation network based on the context information guidance module of the present invention has a small amount of parameters (less than 0.5M), a small memory footprint, and high segmentation performance.
  • the scene segmentation network is divided into three stages. In the first stage, three standard 3x3Conv are used, and in the second and third stages, M and N context information guidance modules are used, respectively.
  • the output of the first guidance module and the output of the last guidance module of the previous stage are used as the input of the first guidance module of the current stage, which helps the internal network information flow and facilitates optimization training.
  • the cross-entropy loss function is used as the loss function of the scene segmentation network guided by context information, and there are only three downsampling.
  • the final output scene segmentation result is one-eighth of the original RGB image.
  • the relevant experiments of the present invention use the Cityscapes data set.
  • the Cityscapes dataset contains street scenes from 50 different cities. This data set is divided into three subsets, including 2975 images in the training set, 500 images in the verification set, and 1525 images in the test set.
  • the data set provides high-quality 19-pixel pixel set annotations.
  • the performance uses the average value of the cross-combination ratio of all classes.
  • the scene segmentation method of the present invention will be compared with other existing scene segmentation methods, including performance, model parameter amount, and speed.
  • the scene segmentation method of the present invention is obtained compared to the model ENet with the same parameter amount 63.8% mean IoU, which is 5.3 percentage points higher and 3.5 percentage points higher than ESPNet; compared with PSPNet, its parameter amount is 130 times that of our method.
  • the scene segmentation method of the present invention is compared with other methods in terms of memory occupation.
  • the memory occupation of the scene segmentation method of the present invention is only 334M, while PSPNet_Ms Need 2180M.
  • the scene segmentation network constructed by the present invention based on the context information guidance module has a small amount of parameters, a small memory footprint, and high segmentation performance.
  • the scene segmentation network is divided into three stages. In the first stage, three standard 3x3Conv are used, and in the second and third stages, M and N context information guidance modules are used, respectively.
  • the output of the first guidance module and the output of the last guidance module of the previous stage are used as the input of the first guidance module of the current stage, which helps the internal network information flow and facilitates optimization training.
  • the cross-entropy loss function is used as the loss function of the scene segmentation network guided by context information, and there are only three downsampling.
  • the final output scene segmentation result is one-eighth of the original RGB image.

Abstract

一种基于上下文信息指导的场景分割方法,包括:以残差结构网络构建基于上下文信息的指导模块;以原始图像为输入,通过多个3×3卷积层输出初级特征图;以该初级特征图为输入,通过多个该指导模块输出中级特征图;以该中级特征图为输入,通过多个该指导模块输出高级特征图;以该高级特征图为输入,通过场景分割子网络,获得该原始图像的场景分割结果。该方法设计的分割网络的参数量小,并且在特征提取时,利用全局特征提取器进一步去修正局部特征和对应的周围上下文特征组合成的联合特征,这使得模型更有利于去学习分割的特征,极大的提高了现有移动端场景分割网络的性能。

Description

基于上下文信息指导的场景分割方法和系统 技术领域
本方法属于机器学习和计算机视觉领域,并特别涉及一种基于上下文信息指导的场景分割方法与系统。
背景技术
场景分割是计算机视觉领域非常重要并且极具挑战的任务,并且在生产和生活中具有广泛的应用价值,如无人驾驶、机器人导航、视频编辑等。场景分割的目标是对场景图像中的每个像素点分配其所属类别。最近,基于全卷积层的场景分割方法取得显著的进步。然而,现在的主流方法都是通过迁移分类网络过来,比如VGG、ResNet和ResNeXt通过去除最大池化层和全连接层,以及增加反卷积层和一些Decoder模块去生成分割结果。但这一类方法通常有着大量的参数和运算量,其速度非常缓慢,这个局限性也限制了这一类方法在移动端使用。目前也有少数工作再面向移动端场景分割,但他们都是采用分类原则来设计分割网络,这也是阻碍当前移动端分割网络精度的一个重要因素。分类与分割之间还是有很大区别,比如经典的分类网络会对原始输入下采样32倍,这样有助于提取到更适合用来分类的特征,但这种网络模型忽视了位置信息,相反的是,分割则需要很精准的位置信息,具体到像素级的位置信息。
发明公开
针对上述问题,本发明提出一种基于上下文信息指导的场景分割方法,包括:构建基于上下文信息的指导模块,该指导模块具有残差结构;以多个3×3卷积层为第一特征提取器,由原始图像获得初级特征图;以多个该指导模块为第二特征提取器,由该初级特征图获得中级特征图;以多个该指导模块为第三特征提取器,由该中级特征图获得高级特征图;以场景分割子网络,由该高级特征图获得该原始图像的场景分割结果。
进一步地,该指导模块的形式化表示为f glo(w glo,f joi(w joi,f loc(w loc,x),f sur(w sur,x)));其中f loc(·)为局部特征学习器,w loc为该局部特征学习器的参数,以3× 3卷积层构建该局部特征学习器,通过反向传播算法对该局部特征学习器进行训练以获得w loc;f sur(·)为周围上下文特征学习器,w sur为该周围上下文特征学习器的参数,以3×3膨胀卷积层构建该周围上下文特征学习器,通过反向传播算法对该周围上下文特征学习器进行训练以获得w sur;f joi(·)为联合特征学习器,w joi为该联合特征学习器的参数;f glo(·)为全局特征学习器,w glo为该全局特征学习器的参数;x为该指导模块的输入。
进一步地,该第二特征提取器具有M层指导模块;以该第二特征提取器的第1层指导模块对该初级特征图进行下采样,获得该第二特征提取器的第一层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第二特征提取器的第M层指导模块的输出;以该第二特征提取器的第1层指导模块的输出和该第二特征提取器的第M层指导模块的输出组合得到该中级特征图;其中,M为正整数。
进一步地,该第三特征提取器具有N层指导模块;以该第三特征提取器的第1层指导模块对该中级特征图进行下采样,获得该第三特征提取器的第1层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第三特征提取器的第N层指导模块的输出;以该第三特征提取器的第1层指导模块的输出和该第三特征提取器的第N层指导模块的输出组合得到该高级特征图;其中,N为正整数。
本发明还公开了一种基于上下文信息指导的场景分割系统,包括:指导模块构建模块,用于构建基于上下文信息的指导模块,该指导模块具有残差结构;第一特征提取器模块,用于以多个3×3卷积层为第一特征提取器,由原始图像获得初级特征图;第二特征提取器模块,用于以多个该指导模块为第二特征提取器,由该初级特征图获得中级特征图;第三特征提取器模块,用于以多个该指导模块为第三特征提取器,由该中级特征图获得高级特征图;场景分割结果获取模块,用于以场景分割子网络,由该高级特征图获得该原始图像的场景分割结果。
进一步地,该指导模块的形式化表示为f glo(w glo,f joi(w joi,f loc(w loc,x),f sur(w sur,x)));其中f loc(·)为局部特征学习器,w loc为该局部特征学习器的参数,以3×3卷积层构建该局部特征学习器,通过反向传播算法对该局部特征学习器进行训练以获得w loc;f sur(·)为周围上下文特征学习器,w sur为该周围上下文特征 学习器的参数,以3×3膨胀卷积层构建该周围上下文特征学习器,通过反向传播算法对该周围上下文特征学习器进行训练以获得w sur;f joi(·)为联合特征学习器,w joi为该联合特征学习器的参数;f glo(·)为全局特征学习器,w glo为该全局特征学习器的参数;x为该指导模块的输入。
进一步地,该第一特征提取器模块具体包括:以第一层3×3卷积层对该原始图像进行下采样,获得该第一层3×3卷积层的输出;以每一层3×3卷积层的输出为下一层3×3卷积层的输入,以获得最后一层3×3卷积层的输出;以该第一层3×3卷积层的输出和该最后一层3×3卷积层的输出组合得到该初级特征图。
进一步地,该第二特征提取器具有M层指导模块;以该第二特征提取器的第1层指导模块对该初级特征图进行下采样,获得该第二特征提取器的第一层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第二特征提取器的第M层指导模块的输出;以该第二特征提取器的第1层指导模块的输出和该第二特征提取器的第M层指导模块的输出组合得到该中级特征图;其中,M为正整数。
进一步地,该第三特征提取器具有N层指导模块;以该第三特征提取器的第1层指导模块对该中级特征图进行下采样,获得该第三特征提取器的第1层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第三特征提取器的第N层指导模块的输出;以该第三特征提取器的第1层指导模块的输出和该第三特征提取器的第N层指导模块的输出组合得到该高级特征图;其中,N为正整数。
本发明的基于上下文信息指导的场景分割系统,其参数量非常少,不超过0.5M,内存占用小,分割性能高。
附图简要说明
图1A、B、C是基于上下文信息指导的场景分割方法示意图。
图2是本发明的基于上下文信息指导的场景分割系统结构示意图。
图3A是本发明的基于上下文信息指导的场景分割方法框架图。
图3B是本发明的基于上下文信息的指导模块结构示意图。
图3C是本发明的基于上下文信息的指导模块下采样结构示意图。
图4是本发明的的基于上下文信息指导的场景分割方法与现有技术的参数量对比图。
图5本发明的的基于上下文信息指导的场景分割方法与现有技术的内存占用量对比图。
实现本发明的最佳方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图,对本发明提出的基于上下文信息的场景分割方法和系统进一步详细说明。应当理解,此处所描述的具体实施方法仅仅用以解释本发明,并不用于限定本发明。
在实际的世界中,目标不可能单独的存在,它一定会与周围其他目标有或多或少的关系,这就是通常所说的上下文信息。上下文信息通常被理解为:察觉并能应用能够影响场景和图像中的对象的信息。上下文信息来源于对人类视觉系统的模拟,人类的大脑具有出色的识别性能,在目标及背景复杂的情况下人类视觉系统依然可以快速识别和分类大量的目标,对于目标成像的光照、姿态、纹理、形变和遮挡等因素均具有非常好的适应性。图1A、B、C是基于上下文信息指导的场景分割方法示意图。如图1A所示,通常情况下,当只关注最小的黑色框区域,很难去给它分类;如图1B所示,当能看到最小的黑色框区域对应的周围上下文信息时(图1B中最小的黑色框外面的对应的尺度更大些的黑色框区域),则能比较容易给最小的黑色框区域分类;如图1C所示,在全局上下文信息的帮助下(对应图1C中最大的黑色框),则可以以一个比较高的置信度去给最小的黑色框区域分类。
为了解决上述问题,本发明首先重新思考了语义分割这个任务的本质特点。语义分割涉及到像素级分类和目标定位,这就应该考虑空间依赖性。不同于分类网络学习整个图像的抽象特征,或者图像中的显著性物体。值得注意的是,人类视觉系统会捕捉上下文信息去理解场景。基于以上观察,本发明提出了以上下文信息指导模块去学习局部特征和捕捉空间依赖性。图2是本发明的基于上下文信息指导的场景分割系统结构示意图。如图2所示,本发明基于上下文信息指导模块,构建了一个新的场景分割网络。本发明提出的场景分割网络(CGNet)只有三个下采样(down-sampling),这样有助于保护空间位置信息。
图3A是本发明的基于上下文信息指导的场景分割方法框架图。如图3A所示,本发明公开了一种基于上下文信息的场景分割方法,具体包括:
步骤S1,构建具有残差结构的上下文信息指导模块;图3B是本发明的基于上下文信息的指导模块结构示意图,如图3B所示,指导模块可以形式化表示为f glo(w glo,f joi(w joi,f loc(w loc,x),f sur(w sur,x)));其中,f loc(·)为局部特征学习器,例如可通过标准的3×3卷积层(3×3Conv)构建,w loc为局部特征学习器的参数,可以通过反向传播算法对局部特征学习器进行训练而获得;f sur(·)为周围上下文特征学习器,例如可通过3×3的膨胀卷积层(3×3DConv)构建,w sur为周围上下文特征学习器的参数,可以通过反向传播算法对局部特征学习器进行训练而获得;f joi(·)为联合特征学习器,例如可以为通道级联层(Concat),w joi为联合特征学习器的参数;f glo(·)为全局特征学习器,例如可以为全局平均池化层(GAP)和多层感知机,w glo为全局特征学习器的参数;x为指导模块的输入;
步骤S2,在第一阶段中,将需要进行场景分割的原始RGB图像作为第一特征提取器的输入,输出低层次的特征图谱(初级特征图);第一特征提取器由多个标准的3×3卷积层构成,例如是3个标准的3×3卷积层,且第一特征提取器中的第一个3×3卷积层对原始RGB图像进行第一次下采样;
步骤S3,在第二阶段中,将第一特征提取器输出的初级特征图作为第二特征提取器的输入,输出中层次的特征图谱(中级特征图);第二特征提取器由M层指导模块构成,且由第二特征提取器的第一层指导模块对输入的初级特征图进行第二次下采样获得第二阶段的下采样特征图,图3C是本发明的基于上下文信息的指导模块下采样结构示意图,如3C所示;将每一层指导模块的输出作为下一层指导模块的输入,则以第M层指导模块的输出,与第二阶段的下采样特征图组合,得到第二阶段的中级特征图;M为正整数;
步骤S4,在第三阶段中,将第二特征提取器输出的中级特征图作为第三特征提取器的输入,输出高层次的特征图谱(高级特征图);第三特征提取器由N层指导模块构成,且由第三特征提取器的第一层指导模块对输入的中级特征图进行第三次下采样获得第三阶段的下采样特征图,第三阶段的指导模块下采样结构与第二阶段相同;将每一层指导模块的输出作为下一层指导模块的输入,则以第N层指导模块的输出,与第三阶段的下采样特征图组合,得到 第三阶段的高级特征图;N为正整数;
步骤S5,以第三特征提取器输出的高级特征图为场景分割子网络的输入,通过场景分割子网络,获得该原始RGB图像的场景分割结果,并由采样函数(Upsample)进行采样;其中场景分割子网络由1×1卷积层(1×1Conv)构成。
为使场景分割网络运行在移动终端,本发明的基于上下文信息指导模块的场景分割网络,其参数量较少(不到0.5M),内存占用小,分割性能高。场景分割网络分为三个阶段,在第一阶段使用3个标准的3x3Conv,在第二阶段和第三阶段分别使用M个和N个上下文信息指导模块。对于第二阶段和第三阶段,将其前一阶段的第一个指导模块输出和最后一个指导模块输出作为当前阶段的第一个指导模块的输入,这样有助于网络内部信息流通,便于优化训练。对于整个场景分割网络,以交叉熵损失函数作为基于上下文信息指导的场景分割网络的损失函数,并只有三个下采样,最终输出的场景分割结果是原始RGB图像的八分之一。
为使本发明的上述特征和效果能阐述的更加明确,下文特列举相关实验对本发明的场景分割方法进行进一步说明。
一、数据集
本发明的相关实验采用Cityscapes数据集。Cityscapes数据集包含来自50个不同城市的街道场景。这个数据集被分成三个子集,包括训练集2975张图片,验证集500张图片,和测试集1525张图片。数据集提供高质量的19类像素集标注。性能采用所有类的交并比的平均值。
二、有效性实验验证
1、为了分析本发明提出的周围上下文特征学习器f sur(·)的有效性,以CGNet_M3N15模型进行验证;表1中的结果表明周围上下文特征学习器f sur(·)能提升Mean IoU 5.1个百分点,其中M=3,N=15。
Method f sur(·) Mean IoU(%)
CGNet_M3N15 w/o 54.6
CGNet_M3N15 w 59.7
表1
2、基于局部特征学习器f loc(·)和周围上下文特征学习器f sur(·)学习到的联合特征,说明全局特征学习器f glo(·)去学习一个权重向量对该联合特征进行 修正。从表2中可以看到全局特征学习器能够将分割性能从58.9%提升到59.7%,其中M=3,N=15。
Method fglo(·) Mean IoU(%)
CGNet_M3N15 w/o 58.9
CGNet_M3N15 w 59.7
表2
3、输入增强机制能提升0.3个百分点,见表3,其中M=3,N=15。
Method Input Injection Mean IoU(%)
CGNet_M3N15 w/o 59.4
CGNet_M3N15 w 59.7
图3
4、PReLU激活函数能提升1.6个百分点,见表4,其中M=3,N=15。
Activation Mean IoU(%)
ReLU 59.4
PReLU 59.7
表4
5、训练提出的CGNet,通过设置不同的M和N。表5显示了模型性能与其参数量的折中。一般情况下,深度网络比浅层的网络性能更好。从表7可以发现,当固定N,分割性能并没有随着M的增加。例如,固定N=12,变化M从3到6,分割性能下降了0.2个百分点。因此,对于本发明提出的场景分割网络,设置M=3。
M N Parameters(M) Mean IoU(%)
3 9 0.34 56.5
3 12 0.38 58.1
6 12 0.39 57.9
3 15 0.41 59.7
6 15 0.41 58.4
3 18 0.45 61.1
3 21 0.49 63.5
表5
6、可以通过改变N,对性能和模型大小做进一步地的权衡,表6显示,当设置M=3,N=21可以取得63.5%mean IoU,全局残差学习性能比局部残差学习性能高出了6.3个百分点。局部残差学习(local residual learning)是图3B和图3C标记LRL连接方式,全局残差学习(global residual learning)是图3B和图3C标记的GRL的连接方式。
Residual connections Mean IoU(%)
LRL 57.2
GRL 63.5
表6
7、之前的很多工作都会在通道级卷积之后使用一个1x1卷积去增强通道间的信息交流。当在BN+PReLU层不使用1x1卷积,表7可以看到1x1卷积性能下降了10.2个百分点。其原因是本发明提出的上下文信息指导模块中的局部特征和其对应的周围上下文特征需要保持通道间的独立性。
Methods 1×1Conv Mean IoU(%)
CGNet_M3N21 w/ 53.3
CGNet_M3N21 w/o 63.5
表7
三、与其它方法比较
接下来将进行本发明的场景分割方法与现有的其他场景分割方法的对比,包括性能、模型参数量和速度三个方面。
1、与现有的场景分割方法PSPNet_Ms、SegNet、ENet和ESPNet比较,如表8所示,可以发现在Cityscpaes数据集上,高精度的模型PSPNet_Ms测试一张图片要超过1s,本发明的场景分割方法为43fps,同时,虽然本发明的场景分割方法的速度略低于ESPNet,但精度却比ESPNet高了3.5个百分点。
Method Mean IoU(%) ms fps
PSPNet_Ms 78.4 >1000 <1
SegNet 56.1 88.0 11
ENet 58.3 61.0 16
ESPNet 60.3 18.6 49
CGNet_M3N21 63.8 23.4 43
表8
2、如图4所示,在没有利用任何的预处理、后处理和复杂的Decoder模块(比如ASPP,PPModule等)的前提下,比起同样参数量的模型ENet,本发明的场景分割方法取得了63.8%mean IoU,高出了5.3个百分点,比ESPNet高出来3.5个百分点;与PSPNet相比,它的参数量是我们方法的130倍。
3、如图5所示,本发明的场景分割方法和其他方法在内存占用方面的比较,对于输入为3×640×360的图像,本发明的场景分割方法的内存占用仅为334M,而PSPNet_Ms需要2180M。
工业应用性
本发明基于上下文信息指导模块构建的场景分割网络,其参数量较少,内存占用小,分割性能高。场景分割网络分为三个阶段,在第一阶段使用3个标准的3x3Conv,在第二阶段和第三阶段分别使用M个和N个上下文信息指导 模块。对于第二阶段和第三阶段,将其前一阶段的第一个指导模块输出和最后一个指导模块输出作为当前阶段的第一个指导模块的输入,这样有助于网络内部信息流通,便于优化训练。对于整个场景分割网络,以交叉熵损失函数作为基于上下文信息指导的场景分割网络的损失函数,并只有三个下采样,最终输出的场景分割结果是原始RGB图像的八分之一。

Claims (10)

  1. 一种基于上下文信息指导的场景分割方法,其特征在于,包括:
    构建基于上下文信息的指导模块,该指导模块具有残差结构;
    以多个3×3卷积层为第一特征提取器,由原始图像获得初级特征图;
    以多个该指导模块为第二特征提取器,由该初级特征图获得中级特征图;
    以多个该指导模块为第三特征提取器,由该中级特征图获得高级特征图;
    以场景分割子网络,由该高级特征图获得该原始图像的场景分割结果。
  2. 如权利要求1所述的场景分割方法,其特征在于,该指导模块的形式化表示为f glo(w glo,f joi(w joi,f loc(w loc,x),f sur(w sur,x)));其中f loc(·)为局部特征学习器,w loc为该局部特征学习器的参数,以3×3卷积层构建该局部特征学习器,通过反向传播算法对该局部特征学习器进行训练以获得w loc;f sur(·)为周围上下文特征学习器,w sur为该周围上下文特征学习器的参数,以3×3膨胀卷积层构建该周围上下文特征学习器,通过反向传播算法对该周围上下文特征学习器进行训练以获得w sur;f joi(·)为联合特征学习器,w joi为该联合特征学习器的参数;f glo(·)为全局特征学习器,w glo为该全局特征学习器的参数;x为该指导模块的输入。
  3. 如权利要求1所述的场景分割方法,其特征在于,以第一层3×3卷积层对该原始图像进行下采样,获得该第一层3×3卷积层的输出;以每一层3×3卷积层的输出为下一层3×3卷积层的输入,以获得最后一层3×3卷积层的输出;以该第一层3×3卷积层的输出和该最后一层3×3卷积层的输出组合得到该初级特征图。
  4. 如权利要求3所述的场景分割方法,其特征在于,该第二特征提取器具有M层指导模块;以该第二特征提取器的第1层指导模块对该初级特征图进行下采样,获得该第二特征提取器的第一层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第二特征提取器的第M层指导模块的输出;以该第二特征提取器的第1层指导模块的输出和该第二特征提取器的第M层指导模块的输出组合得到该中级特征图;其中,M为正整数。
  5. 如权利要求4所述的场景分割方法,其特征在于,该第三特征提取器 具有N层指导模块;以该第三特征提取器的第1层指导模块对该中级特征图进行下采样,获得该第三特征提取器的第1层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第三特征提取器的第N层指导模块的输出;以该第三特征提取器的第1层指导模块的输出和该第三特征提取器的第N层指导模块的输出组合得到该高级特征图;其中,N为正整数。
  6. 一种基于上下文信息指导的场景分割系统,其特征在于,包括:
    指导模块构建模块,用于构建基于上下文信息的指导模块,该指导模块具有残差结构;
    第一特征提取器模块,用于以多个3×3卷积层为第一特征提取器,由原始图像获得初级特征图;
    第二特征提取器模块,用于以多个该指导模块为第二特征提取器,由该初级特征图获得中级特征图;
    第三特征提取器模块,用于以多个该指导模块为第三特征提取器,由该中级特征图获得高级特征图;
    场景分割结果获取模块,用于以场景分割子网络,由该高级特征图获得该原始图像的场景分割结果。
  7. 如权利要求6所述的场景分割系统,其特征在于,该指导模块的形式化表示为f glo(w glo,f joi(w joi,f loc(w loc,x),f sur(w sur,x)));其中f loc(·)为局部特征学习器,w loc为该局部特征学习器的参数,以3×3卷积层构建该局部特征学习器,通过反向传播算法对该局部特征学习器进行训练以获得w loc;f sur(·)为周围上下文特征学习器,w sur为该周围上下文特征学习器的参数,以3×3膨胀卷积层构建该周围上下文特征学习器,通过反向传播算法对该周围上下文特征学习器进行训练以获得w sur;f joi(·)为联合特征学习器,w joi为该联合特征学习器的参数;f glo(·)为全局特征学习器,w glo为该全局特征学习器的参数;x为该指导模块的输入。
  8. 如权利要求7所述的场景分割系统,其特征在于,该第一特征提取器模块具体包括:以第一层3×3卷积层对该原始图像进行下采样,获得该第一层3×3卷积层的输出;以每一层3×3卷积层的输出为下一层3×3卷积层的输入,以获得最后一层3×3卷积层的输出;以该第一层3×3卷积层的输出和该最后一层3×3卷积层的输出组合得到该初级特征图。
  9. 如权利要求1所述的场景分割系统,其特征在于,该第二特征提取器具有M层指导模块;以该第二特征提取器的第1层指导模块对该初级特征图进行下采样,获得该第二特征提取器的第一层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第二特征提取器的第M层指导模块的输出;以该第二特征提取器的第1层指导模块的输出和该第二特征提取器的第M层指导模块的输出组合得到该中级特征图;其中,M为正整数。
  10. 如权利要求1所述的场景分割系统,其特征在于,该第三特征提取器具有N层指导模块;以该第三特征提取器的第1层指导模块对该中级特征图进行下采样,获得该第三特征提取器的第1层指导模块的输出;以每一层指导模块的输出为下一层指导模块的输入,以获得该第三特征提取器的第N层指导模块的输出;以该第三特征提取器的第1层指导模块的输出和该第三特征提取器的第N层指导模块的输出组合得到该高级特征图;其中,N为正整数。
PCT/CN2018/114006 2018-11-05 2018-11-05 基于上下文信息指导的场景分割方法和系统 WO2020093210A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/114006 WO2020093210A1 (zh) 2018-11-05 2018-11-05 基于上下文信息指导的场景分割方法和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/114006 WO2020093210A1 (zh) 2018-11-05 2018-11-05 基于上下文信息指导的场景分割方法和系统

Publications (1)

Publication Number Publication Date
WO2020093210A1 true WO2020093210A1 (zh) 2020-05-14

Family

ID=70612325

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114006 WO2020093210A1 (zh) 2018-11-05 2018-11-05 基于上下文信息指导的场景分割方法和系统

Country Status (1)

Country Link
WO (1) WO2020093210A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111932553A (zh) * 2020-07-27 2020-11-13 北京航空航天大学 基于区域描述自注意力机制的遥感图像语义分割方法
CN114092815A (zh) * 2021-11-29 2022-02-25 自然资源部国土卫星遥感应用中心 一种大范围光伏发电设施遥感智能提取方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570467A (zh) * 2016-10-25 2017-04-19 南京南瑞集团公司 一种基于卷积神经网络的人员离岗检测方法
CN107992854A (zh) * 2017-12-22 2018-05-04 重庆邮电大学 基于机器视觉的林业生态环境人机交互方法
US20180204062A1 (en) * 2015-06-03 2018-07-19 Hyperverge Inc. Systems and methods for image processing
CN108399419A (zh) * 2018-01-25 2018-08-14 华南理工大学 基于二维递归网络的自然场景图像中中文文本识别方法
CN108664974A (zh) * 2018-04-03 2018-10-16 华南理工大学 一种基于rgbd图像与全残差网络的语义分割方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204062A1 (en) * 2015-06-03 2018-07-19 Hyperverge Inc. Systems and methods for image processing
CN106570467A (zh) * 2016-10-25 2017-04-19 南京南瑞集团公司 一种基于卷积神经网络的人员离岗检测方法
CN107992854A (zh) * 2017-12-22 2018-05-04 重庆邮电大学 基于机器视觉的林业生态环境人机交互方法
CN108399419A (zh) * 2018-01-25 2018-08-14 华南理工大学 基于二维递归网络的自然场景图像中中文文本识别方法
CN108664974A (zh) * 2018-04-03 2018-10-16 华南理工大学 一种基于rgbd图像与全残差网络的语义分割方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111932553A (zh) * 2020-07-27 2020-11-13 北京航空航天大学 基于区域描述自注意力机制的遥感图像语义分割方法
CN114092815A (zh) * 2021-11-29 2022-02-25 自然资源部国土卫星遥感应用中心 一种大范围光伏发电设施遥感智能提取方法
CN114092815B (zh) * 2021-11-29 2022-04-15 自然资源部国土卫星遥感应用中心 一种大范围光伏发电设施遥感智能提取方法

Similar Documents

Publication Publication Date Title
CN108647585B (zh) 一种基于多尺度循环注意力网络的交通标识符检测方法
CN109583340B (zh) 一种基于深度学习的视频目标检测方法
CN110443842B (zh) 基于视角融合的深度图预测方法
CN109241982B (zh) 基于深浅层卷积神经网络的目标检测方法
CN109726627B (zh) 一种神经网络模型训练及通用接地线的检测方法
CN113052210B (zh) 一种基于卷积神经网络的快速低光照目标检测方法
CN113033570B (zh) 一种改进空洞卷积和多层次特征信息融合的图像语义分割方法
CN113657388B (zh) 一种融合图像超分辨率重建的图像语义分割方法
CN111524135A (zh) 基于图像增强的输电线路细小金具缺陷检测方法及系统
CN109657538B (zh) 基于上下文信息指导的场景分割方法和系统
CN107103285B (zh) 基于卷积神经网络的人脸深度预测方法
CN111639564B (zh) 一种基于多注意力异构网络的视频行人重识别方法
Yan et al. Combining the best of convolutional layers and recurrent layers: A hybrid network for semantic segmentation
CN109509156B (zh) 一种基于生成对抗模型的图像去雾处理方法
CN111340844A (zh) 基于自注意力机制的多尺度特征光流学习计算方法
CN112307982B (zh) 基于交错增强注意力网络的人体行为识别方法
CN113255837A (zh) 工业环境下基于改进的CenterNet网络目标检测方法
CN111428664A (zh) 一种基于人工智能深度学习技术的计算机视觉的实时多人姿态估计方法
CN111832453A (zh) 基于双路深度神经网络的无人驾驶场景实时语义分割方法
CN111881743A (zh) 一种基于语义分割的人脸特征点定位方法
JP2023059794A (ja) 全方位場所認識のためのリフトされたセマンティックグラフ埋め込み
CN111476133A (zh) 面向无人驾驶的前背景编解码器网络目标提取方法
WO2020093210A1 (zh) 基于上下文信息指导的场景分割方法和系统
CN112819837A (zh) 一种基于多源异构遥感影像的语义分割方法
CN116863194A (zh) 一种足溃疡图像分类方法、系统、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939450

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939450

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 27.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18939450

Country of ref document: EP

Kind code of ref document: A1