WO2020093211A1 - Kronecker convolution-based scene segmentation method and system - Google Patents

Kronecker convolution-based scene segmentation method and system Download PDF

Info

Publication number
WO2020093211A1
WO2020093211A1 PCT/CN2018/114007 CN2018114007W WO2020093211A1 WO 2020093211 A1 WO2020093211 A1 WO 2020093211A1 CN 2018114007 W CN2018114007 W CN 2018114007W WO 2020093211 A1 WO2020093211 A1 WO 2020093211A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
kronecker
convolution
scene segmentation
standard
Prior art date
Application number
PCT/CN2018/114007
Other languages
French (fr)
Chinese (zh)
Inventor
唐胜
伍天意
李锦涛
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Priority to PCT/CN2018/114007 priority Critical patent/WO2020093211A1/en
Publication of WO2020093211A1 publication Critical patent/WO2020093211A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion

Definitions

  • the method belongs to the field of machine learning and computer vision, and particularly relates to a scene segmentation method and system based on Kronecker convolution and tree structure feature aggregation module.
  • Scene segmentation is a very important and challenging task in the field of computer vision, and has a wide range of application values in production and life, such as unmanned driving, robot navigation, and video editing.
  • the goal of scene segmentation is to assign each pixel to its category in the scene image.
  • scene segmentation methods based on fully convolutional networks have made significant progress.
  • the current mainstream method is to migrate the classification network, remove the maximum pooling layer and the fully connected layer, and add a deconvolution layer to generate a segmentation result.
  • a classic classification network will downsample the original input by 32 times, which helps extract features that are more suitable for classification, but this network model ignores location information.
  • dilated convolution solves this problem to a certain extent. It can increase the receptive field of the filter while maintaining the resolution of the feature map, and has achieved better segmentation performance.
  • dilated convolution has a disadvantage. When the expansion coefficient is relatively large, it will lose a lot of local details. In particular, and when the expansion factor approaches the size of the feature map, a 3 ⁇ 3 convolution degenerates into a 1 ⁇ 1 convolution.
  • the objects in the scene are often presented at multiple scales.
  • Another feature is the hierarchical structure of the scene.
  • the car in the center of the image is usually a distant car.
  • the scale is small; and the areas on both sides of the image are usually near cars, and the scale is large.
  • many existing methods use dilated convolution in the basic feature extraction sub-network, and then use cross-layer feature fusion to segment multi-scale objects.
  • the local detail information ignored by the dilated convolution and the simple cross-layer fusion segmentation of multi-scale objects hinder the segmentation performance to a certain extent.
  • the present invention proposes a scene segmentation method based on Kronecker convolution, which includes: constructing a Kronecker convolution layer with a residual structure; using the Kronecker convolution layer and standard convolution Layer to construct a feature extraction sub-network; take the original image as input and output an abstract feature map through the feature extraction sub-network; build a tree-shaped feature aggregation module with the Kronecker convolution layer; take the abstract feature map as input and pass the The tree-shaped feature aggregation module outputs an aggregated feature map; the Kronecker convolutional layer is used to construct a scene segmentation sub-network; using the aggregated feature map as an input, the scene segmentation result of the original image is output through the scene segmentation sub-network.
  • K (c 1, c 2) is a standard convolution kernel
  • c 1, c 2 for the channel convolutional Kronecker index layer c 1 ⁇ [1, C A], c 1 ⁇ [1, C B]
  • C a is the input K (c 1, c 2) the number of channels is characterized in FIG
  • C B is K (c 1, c 2) of FIG channels output characteristics
  • F is a two-dimensional spreading matrix, satisfy
  • K (c 1 , c 2 ) is k ⁇ k
  • K 1 (c 1 , c 2 ) is expanded to (2k + 1) r 1 ⁇ (2k + 1) r 1 ;
  • k is the standard convolution Kernel size
  • r 1 is the expansion factor of the Kronecker convolution layer
  • r 2 is the sharing factor of the Kronecker convolution layer
  • c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer.
  • the feature extraction sub-network includes 5 stages, stage 1 includes a 3-layer cascaded 3 ⁇ 3 standard convolutional layer, stage 2 includes multiple cascaded first bottleneck modules, and stage 3 includes multiple cascaded The first bottleneck module, stage 4 includes multiple cascaded second bottleneck modules, and stage 5 includes multiple cascaded second bottleneck modules; wherein the first bottleneck module includes a cascaded 1 ⁇ 1 standard convolution Layer, a layer of 3 ⁇ 3 standard convolutional layer and a layer of 1 ⁇ 1 standard convolutional layer; the second bottleneck module includes a cascaded layer of 1 ⁇ 1 standard convolutional layer and a layer of the Kronecker convolution Layer and a layer of 1 ⁇ 1 standard convolution layer.
  • the tree-shaped feature aggregation module includes a cascaded aggregation layer, the aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function, and the output of each aggregation layer is used as the next An input of an aggregation layer; the outputs of all aggregation layers in the tree-shaped feature aggregation module are merged with the abstract feature map through a cascade layer to obtain the aggregate feature map.
  • the scene segmentation sub-network includes a cascaded multi-layer 3 ⁇ 3 standard convolutional layer and a layer of 1 ⁇ 1 standard convolutional layer.
  • the invention also discloses a scene segmentation system based on Kronecker convolution, including:
  • Kronecker convolution layer construction module used to construct Kronecker convolution layer with residual structure
  • Feature extraction sub-network for inputting original images to output abstract feature maps, wherein the feature extraction sub-network includes the Kronecker convolution layer and the standard convolution layer;
  • a tree-shaped feature aggregation module for inputting the abstract feature map to output an aggregated feature map, wherein the tree-shaped feature aggregation module includes multiple layers of the Kronecker convolution layer;
  • the scene segmentation sub-network is used to input the aggregate feature map to output the scene segmentation result of the original image, where the scene segmentation sub-network includes multiple layers of the Kronecker convolution layer.
  • FIG. 1 is an overall frame diagram of a scene segmentation method based on Kronecker convolution of the present invention.
  • FIG. 2A is a schematic diagram of an expansion convolution in the prior art
  • 2B is a schematic diagram of the Kronecker convolution of the present invention.
  • FIG. 3 is a schematic structural diagram of a feature extraction sub-network proposed by the present invention.
  • FIG. 4 is a schematic diagram of a tree structure feature aggregation module proposed by the present invention.
  • Figures 5 and 6 are graphs comparing the performance of the scene segmentation method of the present invention with the prior art.
  • FIG. 7 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL VOC 2012 data set.
  • FIG. 8 is an experimental result diagram of the scene segmentation method of the present invention on the Cityscapes data set.
  • FIG. 9 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL-Context data set.
  • the scene segmentation method and system based on Kronecker convolution of the present invention includes feature learning on the original image using Kronecker convolution, and inputting the obtained features into a tree structure feature aggregation module to learn hierarchical context information , And then input the obtained feature and context information to the scene segmentation sub-network to obtain the scene segmentation result of the original image.
  • the present invention proposes a Kronecker convolution for feature extraction, which can increase the receptive field of the filter without adding additional parameters, and can capture local information while obtaining higher segmentation accuracy.
  • the present invention also proposes a tree structure feature aggregation module to segment multi-scale objects and capture hierarchical context information, which greatly improves the performance of the existing scene segmentation model based on full convolution.
  • the Kronecker product is a special form of tensor product, specifically an operation between two matrices of arbitrary size.
  • the formal expression of the Kronecker convolution kernel is:
  • K (c 1 , c 2 ) is a standard convolution kernel, c 1 ⁇ [1, C A ], c 1 ⁇ [1, C B ].
  • C A and C B respectively, corresponding to the number of input channels convolution characteristics and output characteristics graph of FIG.
  • the F matrix is a combination of an all-one matrix of size r 2 ⁇ r 2 in the upper right corner and a zero matrix of size (r 1 -r 2 ) ⁇ (r 1 -r 2 ) in the lower right corner, assuming that the standard convolution kernel is k ⁇ k, then the Kronecker convolution kernel is expanded to (2k + 1) r 1 ⁇ (2k + 1) r 1 ; where r 1 and r 2 are two of the Kronecker convolution layers proposed by the present invention Hyperparameters, r 1 is the expansion factor of the Kronecker convolutional layer, r 2 is the shared factor of the Kronecker convolutional layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer, Represents the Kronecker product operation.
  • FIG. 1 is an overall frame diagram of a scene segmentation method based on Kronecker convolution of the present invention. As shown in FIG. 1, specifically, the scene segmentation method based on Kronecker convolution of the present invention includes:
  • Step S1 constructing a Kronecker convolution layer
  • the present invention proposes a new convolution method, Kronecker convolution, for expanding the receptive field of standard convolution without increasing the number of its parameters.
  • the Kronecker convolution proposed by the present invention is compatible with the entire scene segmentation network, and can be inserted into the scene segmentation network to form a complete structure and perform end-to-end training, where end-to-end is a proper noun , Refers to the structure of the scene segmentation network from the original image input to the final output results can be achieved using a unified scene segmentation network, does not need to be divided into multiple stages for training.
  • FIG. 2A is a schematic diagram of an expansion convolution in the prior art
  • FIG. 2B is a schematic diagram of the Kronecker convolution of the present invention
  • Figure 2A shows a 3 ⁇ 3 dilated convolution
  • f is the dilation factor of dilated convolution
  • the formal expression of the Kronecker convolution kernel is:
  • K (c 1 , c 2 ) is a standard convolution kernel, c 1 ⁇ [1, C A ], c 1 ⁇ [1, C B ].
  • C A and C B respectively, corresponding to the number of input channels convolution characteristics and output characteristics graph of FIG.
  • the F matrix is a combination of an all-one matrix of size r 2 ⁇ r 2 in the upper right corner and a zero matrix of size (r 1 -r 2 ) ⁇ (r 1 -r 2 ) in the lower right corner, assuming that the standard convolution kernel is k ⁇ k, then the Kronecker convolution kernel is expanded to (2k + 1) r 1 ⁇ (2k + 1) r 1 ; where r 1 and r 2 are two of the Kronecker convolution layers proposed by the present invention Hyperparameters, r 1 is the expansion factor of the Kronecker convolutional layer, r 2 is the shared factor of the Kronecker convolutional layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer;
  • Step S2 through the feature extraction sub-network, input the original RGB image I, and output an abstract feature map f l ;
  • FIG. 3 is a schematic diagram of the feature extraction sub-network proposed by the present invention.
  • the feature extraction sub-network includes five stages, and each stage includes multiple standard convolutional layers, or multiple standard convolutions and multiple Kronecker convolutions ; It is worth noting that in the high stage of the feature extraction subnetwork, the channel of the feature map is very large. Typically, for the feature extraction subnetwork, the number of feature channels in stage 4 is 1024, and the number of feature channels in stage 5 is 2048. .
  • the present invention adds Kronecker convolution to a structure with a "bottleneck".
  • This structure with a "bottleneck” can be called a bottleneck module.
  • the beginning and end of the bottleneck module are: The standard 1x1 convolution layer, the 1x1 convolution layer at the beginning of the bottleneck module is used to reduce the number of channels that reduce the input feature map, and the 1x1 convolution layer at the end of the bottleneck module is used to restore the number of channels of the output feature map; Reduces the number of parameters of the feature extraction sub-network.
  • stage 1 of the feature extraction sub-network includes three standard 3 ⁇ 3 convolutional layers arranged in sequence; stage 2 to stage 5 include multiple bottleneck modules, of which stage 2 and stage 3 use one
  • the bottleneck module is called the first bottleneck module.
  • the first bottleneck module includes 2 standard 1 ⁇ 1 convolutional layers and 1 standard 3 ⁇ 3 convolutional layer, and another bottleneck module is used in stages 4 and 5.
  • the second bottleneck module includes 2 standard 1 ⁇ 1 convolution layers and 1 Kronecker convolution layer; using the original RGB image I as the input of stage 1, the output of stage 1 is obtained Image feature map 1 (feature map1), and use image feature map 1 (feature map1) as the input of stage 2, and so on, the image feature map 2 (feature map2) output in stage 2 and the image feature map output in stage 3 3 (feature map3) and stage 4 output image feature map 4 (feature map4) as input, respectively obtain the image feature map 3 (feature map3) output from stage 3, the image feature map 4 (feature map4) and stage output from stage 4 5
  • Step S3 through the tree structure feature aggregation module, input the abstract feature map f l and output the aggregate feature map f c ;
  • N to obtain the scene segmentation result J of the original scene image I
  • the scene segmentation method of the present invention adds a tree structure feature aggregation module between the feature extraction sub-network and the scene segmentation sub-network.
  • 4 is a schematic structural diagram of a tree structure feature aggregation module proposed by the present invention. As shown in Figure 4, the tree-shaped feature aggregation module includes multiple cascaded aggregation layers.
  • the aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function, and the output of each aggregation layer is used as The input of the next aggregation layer; the outputs of all aggregation layers in the tree-shaped feature aggregation module are merged with the abstract feature map through a cascade layer to obtain the aggregated feature map tree structure feature aggregation module including multiple Kronecker
  • the convolutional layer uses a cascade recursive approach.
  • the tree structure feature aggregation module of the present invention has the following expansion rules:
  • the first layer aggregation layer f 1 ( ⁇ ) of the tree structure feature aggregation module input the feature map x output by the previous subnetwork, and output the context information feature map x 1 of the first layer aggregation layer f 1 (x), where f 1 ( ⁇ ) includes Kronecker convolution layer, batch normalization layer and ReLU activation function; the second layer aggregation layer f 2 ( ⁇ ) of the tree structure feature aggregation module takes x 1 as input and outputs The context information feature map x 2 of the layer 2 aggregation layer f 2 (x 1 ); and so on, with the context information feature map x n-1 output by the n-1 layer aggregation layer f n-1 ( ⁇ ) as the first The input of the n- layer aggregation layer f n-1 ( ⁇ ) outputs the context information feature map x n of the n- th aggregation layer f n (x n-1 ); taking x, x 1 , ...,
  • Step S4 the sub-network division by the scene, wherein the input polymerization FIG f c, of the original input scene RGB obtain a prediction image I J segmentation result;
  • the scene segmentation sub-network includes a multi-layer standard 3 ⁇ 3 convolutional layer and a standard 1 ⁇ 1 convolutional layer.
  • the invention also discloses a scene segmentation system based on Kronecker convolution, including:
  • Kronecker convolution layer construction module used to construct Kronecker convolution layer with residual structure
  • Feature extraction sub-network for inputting original RGB image I and outputting abstract feature map f l ;
  • Tree feature aggregation module used to input abstract feature map f l to output aggregate feature map f c ;
  • the relevant experiments of the present invention use the PASCAL VOC 2012 semantic segmentation dataset, Cityscapes dataset and PASCAL-Context dataset.
  • the PASCAL VOC 2012 semantic segmentation data set contains 20 types of foreground objects and 1 background class; the original data set contains 1464 training pictures, 1449 verification pictures and 1456 test pictures, the extended training set is enhanced to 10582 pictures, and the present invention utilizes The average pixel-level merge ratio (mean IoU) of 21 objects is evaluated;
  • the Cityscapes dataset contains street scenes from 50 different cities. This data set is divided into three subsets, where the training set includes 2975 images, the validation set includes 500 images, and the test set includes 1525 images.
  • the invention utilizes the high-quality 19-type pixel set in the data set for labeling. The performance uses the average value of the cross-combination ratio of all categories;
  • the PASCAL-Context data set includes a training set and a validation set.
  • the training set includes 4998 images, and the validation set includes 5105 images.
  • the PASCAL-Context data set provides detailed semantic annotations for the entire scene.
  • the scene segmentation method of the present invention adopts Among them, the most common are 59 categories and 1 background category.
  • the Kronecker convolution proposed by the present invention is 0.8%, 1.7%, 0.7%, 1.5%, 1.6% higher than the corresponding expansion convolution performance, respectively, and the expansion coefficient is from 4 to 12.
  • FIG. 7 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL VOC 2012 data set.
  • FIG. 8 is an experimental result diagram of the scene segmentation method of the present invention on the Cityscapes data set.
  • FIG. 9 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL-Context data set.
  • the scene segmentation method and system based on Kronecker Convolution of the present invention includes feature learning on the original image using Kronecker convolution, and inputting the obtained features to the learning level of the tree structure feature aggregation module Context information, and then input the obtained features and context information into the scene segmentation sub-network to obtain the scene segmentation result of the original image.
  • the Kronecker convolution for feature extraction proposed by the present invention can increase the receptive field of the filter without adding additional parameters, and can capture local information while obtaining higher segmentation accuracy.
  • the present invention also proposes a tree structure feature aggregation module to segment multi-scale objects and capture hierarchical context information, which greatly improves the performance of the existing scene segmentation model based on full convolution.

Abstract

A Kronecker convolution-based scene segmentation method, comprising: constructing a Kronecker convolution layer having a residual structure; constructing a feature extraction sub-network on the basis of the Kronecker convolution layer and a standard convolution layer, taking an original image as an input, and outputting an abstract feature map by means of the feature extraction sub-network; constructing a tree-structured feature aggregation module on the basis of the Kronecker convolution layer, taking the abstract feature map as an input, and outputting an aggregation feature map by means of the tree-structured feature aggregation module; and taking said aggregation feature map as an input, and outputting a scene segmentation result of the original image by means of the scene segmentation sub-network.

Description

基于克罗内克卷积的场景分割方法和系统Scene segmentation method and system based on Kronecker convolution 技术领域Technical field
本方法属于机器学习和计算机视觉领域,特别是涉及一种基于克罗内克卷积和树形结构特征聚合模块的场景分割方法和系统。The method belongs to the field of machine learning and computer vision, and particularly relates to a scene segmentation method and system based on Kronecker convolution and tree structure feature aggregation module.
背景技术Background technique
场景分割是计算机视觉领域非常重要并且极具挑战的任务,并且在生产和生活中具有广泛的应用价值,如无人驾驶、机器人导航、视频编辑等。场景分割的目标是对场景图像中的每个像素点分配其所属类别。最近,基于全卷积网络的场景分割方法取得显著的进步。然而,现在的主流方法都是通过迁移分类网络过来,通过去除最大池化层和全连接层,以及增加反卷积层以生成分割结果。但是分类与分割之间还是有很大区别,比如经典的分类网络会对原始输入下采样32倍,这样有助于提取到更适合用来分类的特征,但这种网络模型忽视了位置信息,恰恰相反的是,分割则需要很精准的位置信息,具体到像素级的位置信息。当前有研究者提出膨胀卷积在一定程度上解决了这个问题,它可以增加滤波器的感受野同时保持特征图的分辨率,并且取得了比较好的分割性能。但膨胀卷积有个缺点,其膨胀系数比较大的情况下,它会损失了许多局部细节信息。特别地,并且当膨胀因子接近特征图的尺寸时,一个3×3卷积退化成1×1卷积。Scene segmentation is a very important and challenging task in the field of computer vision, and has a wide range of application values in production and life, such as unmanned driving, robot navigation, and video editing. The goal of scene segmentation is to assign each pixel to its category in the scene image. Recently, scene segmentation methods based on fully convolutional networks have made significant progress. However, the current mainstream method is to migrate the classification network, remove the maximum pooling layer and the fully connected layer, and add a deconvolution layer to generate a segmentation result. However, there is still a big difference between classification and segmentation. For example, a classic classification network will downsample the original input by 32 times, which helps extract features that are more suitable for classification, but this network model ignores location information. On the contrary, segmentation requires very precise position information, specifically pixel-level position information. At present, some researchers propose that dilated convolution solves this problem to a certain extent. It can increase the receptive field of the filter while maintaining the resolution of the feature map, and has achieved better segmentation performance. However, dilated convolution has a disadvantage. When the expansion coefficient is relatively large, it will lose a lot of local details. In particular, and when the expansion factor approaches the size of the feature map, a 3 × 3 convolution degenerates into a 1 × 1 convolution.
此外,对于场景分割网络,场景当中的物体经常是有多个尺度呈现,还有个特点是场景的层次化结构,比如对于Cityscapes数据集,一般来说,在图像中心位置通常是远处的车子,其尺度较小;而在图像两侧区域通常是近处的车子,其尺度较大。为了解决上述两个问题,很多现有的方法都是通过在基本特征提取子网络使用膨胀卷积,然后利用跨层特征融合去分割多尺度物体。但膨胀卷积忽视的局部细节信息和简单的跨层融合分割多尺度物体在一定程度上阻碍了分割性能。In addition, for the scene segmentation network, the objects in the scene are often presented at multiple scales. Another feature is the hierarchical structure of the scene. For example, for the Cityscapes dataset, in general, the car in the center of the image is usually a distant car. , The scale is small; and the areas on both sides of the image are usually near cars, and the scale is large. In order to solve the above two problems, many existing methods use dilated convolution in the basic feature extraction sub-network, and then use cross-layer feature fusion to segment multi-scale objects. However, the local detail information ignored by the dilated convolution and the simple cross-layer fusion segmentation of multi-scale objects hinder the segmentation performance to a certain extent.
发明公开Invention Disclosure
针对上述问题,本发明提出一种基于克罗内克卷积的场景分割方法,包括:构建具有残差结构的克罗内克卷积层;以该克罗内克卷积层和标准卷积层构建特征提取子网络;以原始图像为输入,通过该特征提取子网络输出抽象特征图;以该克罗内克卷积层构建树形特征聚合模块;以该抽象特征图为输入,通过该树形特征聚合模块输出聚合特征图;以该克罗内克卷积层构建场景分割子网络;以该聚合特征图为输入,通过该场景分割子网络输出该原始图像的场景分割结果。In view of the above problems, the present invention proposes a scene segmentation method based on Kronecker convolution, which includes: constructing a Kronecker convolution layer with a residual structure; using the Kronecker convolution layer and standard convolution Layer to construct a feature extraction sub-network; take the original image as input and output an abstract feature map through the feature extraction sub-network; build a tree-shaped feature aggregation module with the Kronecker convolution layer; take the abstract feature map as input and pass the The tree-shaped feature aggregation module outputs an aggregated feature map; the Kronecker convolutional layer is used to construct a scene segmentation sub-network; using the aggregated feature map as an input, the scene segmentation result of the original image is output through the scene segmentation sub-network.
进一步地,该克罗内克卷积层的形式化表示为
Figure PCTCN2018114007-appb-000001
其中K(c 1,c 2)为标准卷积核,c 1、c 2为该克罗内克卷积层的通道索引,c 1∈[1,C A],c 1∈[1,C B],C A为输入K(c 1,c 2)的特征图的通道数,C B为K(c 1,c 2)输出的特征图的通道数,F为二维的扩展矩阵,满足当K(c 1,c 2)为k×k时,使K 1(c 1,c 2)被扩展为(2k+1)r 1×(2k+1)r 1;k为标准卷积的核大小,r 1为该克罗内克卷积层的扩张因子,r 2为该克罗内克卷积层的共享因子,c 1、c 2、C A、C B、k、r 1、r 2为正整数。
Further, the formalization of the Kronecker convolution layer is expressed as
Figure PCTCN2018114007-appb-000001
Wherein K (c 1, c 2) is a standard convolution kernel, c 1, c 2 for the channel convolutional Kronecker index layer, c 1 ∈ [1, C A], c 1 ∈ [1, C B], C a is the input K (c 1, c 2) the number of channels is characterized in FIG, C B is K (c 1, c 2) of FIG channels output characteristics, F is a two-dimensional spreading matrix, satisfy When K (c 1 , c 2 ) is k × k, K 1 (c 1 , c 2 ) is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; k is the standard convolution Kernel size, r 1 is the expansion factor of the Kronecker convolution layer, r 2 is the sharing factor of the Kronecker convolution layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer.
进一步地,该特征提取子网络包括5个阶段,阶段1包括3层级联的3×3标准卷积层,阶段2包括多个级联的第一瓶颈模块,阶段3包括多个级联的该第一瓶颈模块,阶段4包括多个级联的第二瓶颈模块,阶段5包括多个级联的该第二瓶颈模块;其中该第一瓶颈模块包括级联的一层1×1标准卷积层、一层3×3标准卷积层和一层1×1标准卷积层;该第二瓶颈模块包括级联的一层1×1标准卷积层、一层该克罗内克卷积层和一层1×1标准卷积层。Further, the feature extraction sub-network includes 5 stages, stage 1 includes a 3-layer cascaded 3 × 3 standard convolutional layer, stage 2 includes multiple cascaded first bottleneck modules, and stage 3 includes multiple cascaded The first bottleneck module, stage 4 includes multiple cascaded second bottleneck modules, and stage 5 includes multiple cascaded second bottleneck modules; wherein the first bottleneck module includes a cascaded 1 × 1 standard convolution Layer, a layer of 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer; the second bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer and a layer of the Kronecker convolution Layer and a layer of 1 × 1 standard convolution layer.
进一步地,该树形特征聚合模块包括级联的聚合层,该聚合层包括该克罗内克卷积层、批归一化层和ReLU激活函数,并以每一层聚合层的输出作为下一层聚合层的输入;该树形特征聚合模块中所有聚合层的输出与该抽象特征图通过级联层进行合并,得到该聚合特征图。Further, the tree-shaped feature aggregation module includes a cascaded aggregation layer, the aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function, and the output of each aggregation layer is used as the next An input of an aggregation layer; the outputs of all aggregation layers in the tree-shaped feature aggregation module are merged with the abstract feature map through a cascade layer to obtain the aggregate feature map.
进一步地,该场景分割子网络包括级联的多层3×3标准卷积层和一层1×1标准卷积层。Further, the scene segmentation sub-network includes a cascaded multi-layer 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer.
本发明还公开了一种基于克罗内克卷积的场景分割系统,包括:The invention also discloses a scene segmentation system based on Kronecker convolution, including:
克罗内克卷积层构建模块,用于构建具有残差结构的克罗内克卷积层;Kronecker convolution layer construction module, used to construct Kronecker convolution layer with residual structure;
特征提取子网络,用于输入原始图像以输出抽象特征图,其中该征提取子网络包括该克罗内克卷积层和标准卷积层;Feature extraction sub-network for inputting original images to output abstract feature maps, wherein the feature extraction sub-network includes the Kronecker convolution layer and the standard convolution layer;
树形特征聚合模块,用于输入该抽象特征图以输出聚合特征图,其中该树形特征聚合模块包括多层该克罗内克卷积层;A tree-shaped feature aggregation module for inputting the abstract feature map to output an aggregated feature map, wherein the tree-shaped feature aggregation module includes multiple layers of the Kronecker convolution layer;
场景分割子网络,用于输入该聚合特征图以输出该原始图像的场景分割结果,其中该场景分割子网络包括多层该克罗内克卷积层。The scene segmentation sub-network is used to input the aggregate feature map to output the scene segmentation result of the original image, where the scene segmentation sub-network includes multiple layers of the Kronecker convolution layer.
附图简要说明Brief description of the drawings
图1为本发明基于克罗内克卷积的场景分割方法整体框架图。FIG. 1 is an overall frame diagram of a scene segmentation method based on Kronecker convolution of the present invention.
图2A是现有技术的膨胀卷积示意图;FIG. 2A is a schematic diagram of an expansion convolution in the prior art;
图2B是本发明的克罗内克卷积示意图;2B is a schematic diagram of the Kronecker convolution of the present invention;
图3为本发明提出的特征提取子网络结构示意图;3 is a schematic structural diagram of a feature extraction sub-network proposed by the present invention;
图4为本发明提出的树形结构特征聚合模块示意图;4 is a schematic diagram of a tree structure feature aggregation module proposed by the present invention;
图5、6是本发明的场景分割方法与现有技术的性能比较图。Figures 5 and 6 are graphs comparing the performance of the scene segmentation method of the present invention with the prior art.
图7是本发明的场景分割方法在PASCAL VOC 2012数据集上的实验结果图。FIG. 7 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL VOC 2012 data set.
图8是本发明的场景分割方法在Cityscapes数据集上的实验结果图。FIG. 8 is an experimental result diagram of the scene segmentation method of the present invention on the Cityscapes data set.
图9是本发明的场景分割方法在PASCAL-Context数据集上的实验结果图。FIG. 9 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL-Context data set.
实现本发明的最佳方式Best way to implement the invention
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图,对本发明提出的基于克罗内克卷积(Kronecker Convolution)的场景分割方法和系统进一步详细说明。应当理解,此处所描述的具体实施方法仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the following describes in detail the scene segmentation method and system based on Kronecker Convolution (Kronecker Convolution) proposed by the present invention in conjunction with the drawings. It should be understood that the specific implementation methods described herein are only used to explain the present invention, and are not intended to limit the present invention.
本发明的基于克罗内克卷积的场景分割方法和系统,包括对原始图像利用克罗内克卷积进行特征学习,将得到的特征输入到树形结构特征聚合模块学习层次化的上下文信息,然后将得到特征和上下文信息输入到场景分割子网络,得到原始图像的场景分割结果。本发明提出了一种用于特征抽取的克罗内克卷积,可以在不增加额外参数的情况下增加滤波器的感受野,并且能够捕捉局部信息,同时获得更高的分割精度。此外,本发明还提出了树形结构特征聚合模块去分割多尺度物体和捕捉层次化的上下文信息,这极大地提高了现有基于全 卷积的场景分割模型的性能。The scene segmentation method and system based on Kronecker convolution of the present invention includes feature learning on the original image using Kronecker convolution, and inputting the obtained features into a tree structure feature aggregation module to learn hierarchical context information , And then input the obtained feature and context information to the scene segmentation sub-network to obtain the scene segmentation result of the original image. The present invention proposes a Kronecker convolution for feature extraction, which can increase the receptive field of the filter without adding additional parameters, and can capture local information while obtaining higher segmentation accuracy. In addition, the present invention also proposes a tree structure feature aggregation module to segment multi-scale objects and capture hierarchical context information, which greatly improves the performance of the existing scene segmentation model based on full convolution.
克罗内克积是张量积的特殊形式,具体为两个任意大小的矩阵间的运算。克罗内克卷积核的形式化表达为:The Kronecker product is a special form of tensor product, specifically an operation between two matrices of arbitrary size. The formal expression of the Kronecker convolution kernel is:
Figure PCTCN2018114007-appb-000002
Figure PCTCN2018114007-appb-000002
其中K(c 1,c 2)是标准的卷积核,c 1∈[1,C A],c 1∈[1,C B]。这里C A和C B分别对应卷积输入特征图和输出特征图的通道数。F矩阵是右上角为r 2×r 2大小的全1矩阵与右下角为(r 1-r 2)×(r 1-r 2)大小的零矩阵的组合,假设标准的卷积核是k×k,则克罗内克卷积核被扩展成(2k+1)r 1×(2k+1)r 1;其中r 1和r 2是本发明提出的克罗内克卷积层的两个超参数,r 1为克罗内克卷积层的扩张因子,r 2为克罗内克卷积层的共享因子,c 1、c 2、C A、C B、k、r 1、r 2为正整数,
Figure PCTCN2018114007-appb-000003
表示进行克罗内克积运算。
Where K (c 1 , c 2 ) is a standard convolution kernel, c 1 ∈ [1, C A ], c 1 ∈ [1, C B ]. Where C A and C B, respectively, corresponding to the number of input channels convolution characteristics and output characteristics graph of FIG. The F matrix is a combination of an all-one matrix of size r 2 × r 2 in the upper right corner and a zero matrix of size (r 1 -r 2 ) × (r 1 -r 2 ) in the lower right corner, assuming that the standard convolution kernel is k × k, then the Kronecker convolution kernel is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; where r 1 and r 2 are two of the Kronecker convolution layers proposed by the present invention Hyperparameters, r 1 is the expansion factor of the Kronecker convolutional layer, r 2 is the shared factor of the Kronecker convolutional layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer,
Figure PCTCN2018114007-appb-000003
Represents the Kronecker product operation.
假设标准卷积在输入特征图对应的卷积块的中心坐标是(p t,q t),输入特征图Y t对应的采样点(x ijuv,y ijuv)为: Assuming that the standard coordinates of the central coordinates of the convolution block corresponding to the input feature map are (p t , q t ), the sampling points (x ijuv , y ijuv ) corresponding to the input feature map Y t are:
x ijuv=p t+ir 1+u,y ijuv=q t+jr 1+v x ijuv = p t + ir 1 + u, y ijuv = q t + jr 1 + v
其中,i,j∈[-k,k]∩Z,u,v∈[0,r 2-1]∩Z; Among them, i, j ∈ [-k, k] ∩ Z, u, v ∈ [0, r 2 -1] ∩ Z;
对应的克罗内克卷积运算的形式化为:The corresponding Kronecker convolution operation is formalized as:
Figure PCTCN2018114007-appb-000004
Figure PCTCN2018114007-appb-000004
其中,
Figure PCTCN2018114007-appb-000005
是输入特征图Y t的空间位置索引,B t为输出特征图,
Figure PCTCN2018114007-appb-000006
为输入特征图Y t的特征向量,
Figure PCTCN2018114007-appb-000007
为克罗内克卷积核参数,b为偏置向量,
Figure PCTCN2018114007-appb-000008
为C A维空间。
among them,
Figure PCTCN2018114007-appb-000005
Is the spatial position index of the input feature map Y t , B t is the output feature map,
Figure PCTCN2018114007-appb-000006
Is the feature vector of the input feature map Y t ,
Figure PCTCN2018114007-appb-000007
Is the Kronecker convolution kernel parameter, b is the bias vector,
Figure PCTCN2018114007-appb-000008
C A is the dimension of space.
图1为本发明基于克罗内克卷积的场景分割方法整体框架图。如图1所示,具体来说,本发明的基于克罗内克卷积的场景分割方法包括:FIG. 1 is an overall frame diagram of a scene segmentation method based on Kronecker convolution of the present invention. As shown in FIG. 1, specifically, the scene segmentation method based on Kronecker convolution of the present invention includes:
步骤S1,构建克罗内克卷积层;Step S1, constructing a Kronecker convolution layer;
本发明提出一种新的卷积方式,克罗内克卷积,用于扩大标准卷积的感受野,同时不增加其参数数量。此外,本发明提出的克罗内克卷积与整个场景分割网络是兼容的,可以插入到场景分割网络中形成一个完整的结构,并进行端对端的训练,这里端对端是一个专有名词,指的是在场景分割网络的结构中从原始图像输入到最终的输出结果可以使用一个统一的场景分割网络实现,不需要分成多个阶段进行训练。The present invention proposes a new convolution method, Kronecker convolution, for expanding the receptive field of standard convolution without increasing the number of its parameters. In addition, the Kronecker convolution proposed by the present invention is compatible with the entire scene segmentation network, and can be inserted into the scene segmentation network to form a complete structure and perform end-to-end training, where end-to-end is a proper noun , Refers to the structure of the scene segmentation network from the original image input to the final output results can be achieved using a unified scene segmentation network, does not need to be divided into multiple stages for training.
图2A是现有技术的膨胀卷积示意图,图2B是本发明的克罗内克卷积示 意图。图2A显示了3×3的膨胀卷积,f为膨胀卷积的膨胀因子;如图2B所示,克罗内克卷积核的形式化表达为:
Figure PCTCN2018114007-appb-000009
其中K(c 1,c 2)是标准的卷积核,c 1∈[1,C A],c 1∈[1,C B]。这里C A和C B分别对应卷积输入特征图和输出特征图的通道数。F矩阵是右上角为r 2×r 2大小的全1矩阵与右下角为(r 1-r 2)×(r 1-r 2)大小的零矩阵的组合,假设标准的卷积核是k×k,则克罗内克卷积核被扩展成(2k+1)r 1×(2k+1)r 1;其中r 1和r 2是本发明提出的克罗内克卷积层的两个超参数,r 1为克罗内克卷积层的扩张因子,r 2为克罗内克卷积层的共享因子,c 1、c 2、C A、C B、k、r 1、r 2为正整数;
FIG. 2A is a schematic diagram of an expansion convolution in the prior art, and FIG. 2B is a schematic diagram of the Kronecker convolution of the present invention. Figure 2A shows a 3 × 3 dilated convolution, and f is the dilation factor of dilated convolution; as shown in FIG. 2B, the formal expression of the Kronecker convolution kernel is:
Figure PCTCN2018114007-appb-000009
Where K (c 1 , c 2 ) is a standard convolution kernel, c 1 ∈ [1, C A ], c 1 ∈ [1, C B ]. Where C A and C B, respectively, corresponding to the number of input channels convolution characteristics and output characteristics graph of FIG. The F matrix is a combination of an all-one matrix of size r 2 × r 2 in the upper right corner and a zero matrix of size (r 1 -r 2 ) × (r 1 -r 2 ) in the lower right corner, assuming that the standard convolution kernel is k × k, then the Kronecker convolution kernel is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; where r 1 and r 2 are two of the Kronecker convolution layers proposed by the present invention Hyperparameters, r 1 is the expansion factor of the Kronecker convolutional layer, r 2 is the shared factor of the Kronecker convolutional layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer;
步骤S2,通过特征提取子网络,输入原始RGB图像I,输出抽象特征图f lStep S2, through the feature extraction sub-network, input the original RGB image I, and output an abstract feature map f l ;
图3为本发明提出的特征提取子网络结构示意图。如图3所示,本发明的场景分割方法中,特征提取子网络包括5个阶段,每个阶段都包括多个标准卷积层,或多个标准卷积和多个克罗内克卷积;值得注意的是,在特征提取子网络的高阶段,其特征图的通道非常大,典型情况,对于特征提取子网络在阶段4的特征通道数为1024,在阶段5的特征通道数为2048。如果直接使用克罗内克卷积对这些特征进行再学习,这些数量巨大的参数中包含了大量的冗余,同时也会降低整个场景分割网络的分割速度,增加运算复杂度。为了解决这个问题,本发明将克罗内克卷积加入到一种带“瓶颈”的结构中,这种带“瓶颈”的结构可以称之为瓶颈模块,瓶颈模块的开头和结尾,分别是标准的1x1卷积层,瓶颈模块开头的1x1卷积层用于降低降低输入特征图的通道数,而瓶颈模块结尾的1x1卷积层用于恢复输出特征图的通道数;瓶颈模块可以极大的减少特征提取子网络的参数数量。FIG. 3 is a schematic diagram of the feature extraction sub-network proposed by the present invention. As shown in FIG. 3, in the scene segmentation method of the present invention, the feature extraction sub-network includes five stages, and each stage includes multiple standard convolutional layers, or multiple standard convolutions and multiple Kronecker convolutions ; It is worth noting that in the high stage of the feature extraction subnetwork, the channel of the feature map is very large. Typically, for the feature extraction subnetwork, the number of feature channels in stage 4 is 1024, and the number of feature channels in stage 5 is 2048. . If Kronecker convolution is used to re-learn these features directly, these huge number of parameters include a lot of redundancy, and at the same time, it will reduce the segmentation speed of the entire scene segmentation network and increase the computational complexity. In order to solve this problem, the present invention adds Kronecker convolution to a structure with a "bottleneck". This structure with a "bottleneck" can be called a bottleneck module. The beginning and end of the bottleneck module are: The standard 1x1 convolution layer, the 1x1 convolution layer at the beginning of the bottleneck module is used to reduce the number of channels that reduce the input feature map, and the 1x1 convolution layer at the end of the bottleneck module is used to restore the number of channels of the output feature map; Reduces the number of parameters of the feature extraction sub-network.
本发明的场景分割网络中,特征提取子网络的阶段1包括依次排列的3个标准的3×3卷积层;阶段2~阶段5包括多个瓶颈模块,其中阶段2和阶段3采用一种瓶颈模块,称之为第一瓶颈模块,第一瓶颈模块包括2个标准的1×1卷积层和1个标准的3×3卷积层,阶段4和阶段5采用另一种瓶颈模块,称之为第二瓶颈模块,第二瓶颈模块包括2个标准的1×1卷积层和1个克罗内克卷积层;将原始RGB图像I作为阶段1的输入,获得阶段1输出的图像特征图1(feature map1),并将图像特征图1(feature map1)作为阶段2的输入,以此类推,以阶段2输出的图像特征图2(feature map2)、阶段3输出的 图像特征图3(feature map3)和阶段4输出的图像特征图4(feature map4)作为输入,分别获取阶段3输出的图像特征图3(feature map3)、阶段4输出的图像特征图4(feature map4)和阶段5输出的图像特征图5(feature map5),并以图像特征图5为抽象特征图f lIn the scene segmentation network of the present invention, stage 1 of the feature extraction sub-network includes three standard 3 × 3 convolutional layers arranged in sequence; stage 2 to stage 5 include multiple bottleneck modules, of which stage 2 and stage 3 use one The bottleneck module is called the first bottleneck module. The first bottleneck module includes 2 standard 1 × 1 convolutional layers and 1 standard 3 × 3 convolutional layer, and another bottleneck module is used in stages 4 and 5. Called the second bottleneck module, the second bottleneck module includes 2 standard 1 × 1 convolution layers and 1 Kronecker convolution layer; using the original RGB image I as the input of stage 1, the output of stage 1 is obtained Image feature map 1 (feature map1), and use image feature map 1 (feature map1) as the input of stage 2, and so on, the image feature map 2 (feature map2) output in stage 2 and the image feature map output in stage 3 3 (feature map3) and stage 4 output image feature map 4 (feature map4) as input, respectively obtain the image feature map 3 (feature map3) output from stage 3, the image feature map 4 (feature map4) and stage output from stage 4 5 The output image feature map 5 (feature map5), and the image feature map 5 FIG Abstract wherein f l;
步骤S3,通过树形结构特征聚合模块,输入抽象特征图f l,输出聚合特征图f cStep S3, through the tree structure feature aggregation module, input the abstract feature map f l and output the aggregate feature map f c ;
目前大部分的场景分割框架都基于全卷积神经网络框架,该框架主要包括两个串联的子网络,即特征提取子网络和场景分割子网络;当给定原始场景图像I,通过场景分割网络N,获取原始场景图像I的场景分割结果J,可以将场景分割网络N分解为特征提取子网络N fea和场景分割子网络N seg,因此场景分割网络N可以表示为:J=N seg(N fea(I));其中N fea(I)代表从特征提取子网络得到的抽象特征图f l,这些特征图中包含了从原始场景图像I中学习得到的语义概念和空间位置信息。 Most of the current scene segmentation frameworks are based on a fully convolutional neural network framework, which mainly includes two sub-networks in series, namely feature extraction sub-networks and scene segmentation sub-networks; N, to obtain the scene segmentation result J of the original scene image I, the scene segmentation network N can be decomposed into a feature extraction subnetwork N fea and a scene segmentation subnetwork N seg , so the scene segmentation network N can be expressed as: J = N seg (N fea (I)); where N fea (I) represents an abstract feature map f l obtained from the feature extraction sub-network, these feature maps contain semantic concepts and spatial location information learned from the original scene image I.
本发明的场景分割方法,在特征提取子网络和场景分割子网络之间加入了树形结构特征聚合模块。图4为本发明提出的树形结构特征聚合模块结构示意图。如图4所示,树形特征聚合模块包括多层级联的聚合层,聚合层包括克罗内克卷积层、批归一化层和ReLU激活函数,并以每一层聚合层的输出作为下一层聚合层的输入;该树形特征聚合模块中所有聚合层的输出与该抽象特征图通过级联层进行合并,得到该聚合特征图树形结构特征聚合模块包括多个克罗内克卷积层,采用级联递归的方式。本发明的树形结构特征聚合模块具有如下扩展规则:The scene segmentation method of the present invention adds a tree structure feature aggregation module between the feature extraction sub-network and the scene segmentation sub-network. 4 is a schematic structural diagram of a tree structure feature aggregation module proposed by the present invention. As shown in Figure 4, the tree-shaped feature aggregation module includes multiple cascaded aggregation layers. The aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function, and the output of each aggregation layer is used as The input of the next aggregation layer; the outputs of all aggregation layers in the tree-shaped feature aggregation module are merged with the abstract feature map through a cascade layer to obtain the aggregated feature map tree structure feature aggregation module including multiple Kronecker The convolutional layer uses a cascade recursive approach. The tree structure feature aggregation module of the present invention has the following expansion rules:
通过树形结构特征聚合模块的第1层聚合层f 1(·),输入上一子网络输出的特征图x,输出第1层聚合层f 1(x)的上下文信息特征图x 1,其中f 1(·)包括克罗内克卷积层、批归一化层和ReLU激活函数;通过树形结构特征聚合模块的第2层聚合层f 2(·),以x 1作为输入,输出第2层聚合层f 2(x 1)的上下文信息特征图x 2;以此类推,以第n-1层聚合层f n-1(·)输出的上下文信息特征图x n-1为第n层聚合层f n-1(·)的输入,输出第n层聚合层f n(x n-1)的上下文信息特征图x n;以x、x 1、……、x n作为输入,通过级联层g,得到树形结构特征聚合模块的最终输出H n(x);具体到本发明的的场景分割方法,将特征提取子网络输出的抽象特征图f l作为输入,通过树形结构特征聚合模块,最终输出聚合特征图 f cThrough the first layer aggregation layer f 1 (·) of the tree structure feature aggregation module, input the feature map x output by the previous subnetwork, and output the context information feature map x 1 of the first layer aggregation layer f 1 (x), where f 1 (·) includes Kronecker convolution layer, batch normalization layer and ReLU activation function; the second layer aggregation layer f 2 (·) of the tree structure feature aggregation module takes x 1 as input and outputs The context information feature map x 2 of the layer 2 aggregation layer f 2 (x 1 ); and so on, with the context information feature map x n-1 output by the n-1 layer aggregation layer f n-1 (·) as the first The input of the n- layer aggregation layer f n-1 (·) outputs the context information feature map x n of the n- th aggregation layer f n (x n-1 ); taking x, x 1 , ..., x n as input, Through the cascade layer g, the final output H n (x) of the tree structure feature aggregation module is obtained; specific to the scene segmentation method of the present invention, the abstract feature graph f l output by the feature extraction subnetwork is used as the input, and the tree shape The structural feature aggregation module finally outputs the aggregate feature graph f c .
步骤S4,通过场景分割子网络,输入聚合特征图f c,获得输入原始RGB图像I的预测场景分割结果J; Step S4, the sub-network division by the scene, wherein the input polymerization FIG f c, of the original input scene RGB obtain a prediction image I J segmentation result;
场景分割子网络包括多层标准的3×3卷积层和一层标准的1×1卷积层。The scene segmentation sub-network includes a multi-layer standard 3 × 3 convolutional layer and a standard 1 × 1 convolutional layer.
本发明还公开了一种基于克罗内克卷积的场景分割系统,包括:The invention also discloses a scene segmentation system based on Kronecker convolution, including:
克罗内克卷积层构建模块,用于构建具有残差结构的克罗内克卷积层;Kronecker convolution layer construction module, used to construct Kronecker convolution layer with residual structure;
特征提取子网络,用于输入原始RGB图像I输出抽象特征图f lFeature extraction sub-network for inputting original RGB image I and outputting abstract feature map f l ;
树形特征聚合模块,用于输入抽象特征图f l以输出聚合特征图f cTree feature aggregation module, used to input abstract feature map f l to output aggregate feature map f c ;
场景分割子网络,用于输入该聚合特征图f c以输出该原始图像I的场景分割结果J。 Scene segmentation sub-network, wherein the polymerization for inputting the output f c of FIG scene segmentation of the original image I results J.
为使本发明的上述特征和效果能阐述的更加明确,下文特列举相关实验对本发明的场景分割方法进行进一步说明。In order to make the above features and effects of the present invention clearer, the following experiments are specifically enumerated to further describe the scene segmentation method of the present invention.
一、数据集First, the data set
本发明的相关实验采用PASCAL VOC 2012语义分割数据集、Cityscapes数据集和PASCAL-Context数据集。The relevant experiments of the present invention use the PASCAL VOC 2012 semantic segmentation dataset, Cityscapes dataset and PASCAL-Context dataset.
PASCAL VOC 2012语义分割数据集包含20类前景物体和1个背景类;原始数据集包含1464张训练图片、1449张验证图片和1456张测试图片,扩展的训练集增强到10582张图片,本发明利用其中21类物体的平均的像素级的交并比(mean IoU)进行评估;The PASCAL VOC 2012 semantic segmentation data set contains 20 types of foreground objects and 1 background class; the original data set contains 1464 training pictures, 1449 verification pictures and 1456 test pictures, the extended training set is enhanced to 10582 pictures, and the present invention utilizes The average pixel-level merge ratio (mean IoU) of 21 objects is evaluated;
Cityscapes数据集包含来自50个不同城市的街道场景。这个数据集被分成三个子集,其中训练集包括2975张图片,验证集包括500张图片,测试集包括1525张图片。本发明利用数据集中高质量的19类像素集进行标注。性能采用所有类的交并比的平均值;The Cityscapes dataset contains street scenes from 50 different cities. This data set is divided into three subsets, where the training set includes 2975 images, the validation set includes 500 images, and the test set includes 1525 images. The invention utilizes the high-quality 19-type pixel set in the data set for labeling. The performance uses the average value of the cross-combination ratio of all categories;
PASCAL-Context数据集包括训练集和验证集,训练集包括4998张图像,验证集包括5105张图像,PASCAL-Context数据集为整个场景提供了详细的语义标注,本发明的场景分割方法,采用了其中最常见的59类和1个背景类。The PASCAL-Context data set includes a training set and a validation set. The training set includes 4998 images, and the validation set includes 5105 images. The PASCAL-Context data set provides detailed semantic annotations for the entire scene. The scene segmentation method of the present invention adopts Among them, the most common are 59 categories and 1 background category.
二、克罗内克卷积有效性实验验证:2. Experimental verification of Kronecker's convolution effectiveness:
如图5所示,本发明提出的克罗内克卷积比起对应的膨胀卷积性能分别高出0.8%,1.7%,0.7%,1.5%,1.6%,膨胀系数从4到12。这些结果表面本发明提出的克罗内克卷积比起膨胀卷积性能更好。As shown in FIG. 5, the Kronecker convolution proposed by the present invention is 0.8%, 1.7%, 0.7%, 1.5%, 1.6% higher than the corresponding expansion convolution performance, respectively, and the expansion coefficient is from 4 to 12. These results indicate that the Kronecker convolution proposed by the present invention has better performance than dilated convolution.
三、树形特征聚合模块有效性实验验证:3. The experimental verification of the effectiveness of the tree feature aggregation module:
TFA_S是TFA中配置一个比较小的因子(r 1,r 2)={(6,3),(10,7),(20,15)} TFA_S is a relatively small factor (r 1 , r 2 ) configured in TFA = {(6, 3), (10, 7), (20, 15)}
TFA_L是TFA中配置一个比较大的因子(r 1,r 2)={(10,7),(20,15),(30,25)} TFA_L is a relatively large factor (r 1 , r 2 ) = {(10, 7), (20, 15), (30, 25)} configured in TFA
如图6所示,可以知道KC+TFA_S相对于基准模型有6.87%的提升,相对于Baseline+TFA_S有1.06的提升;而KC+TFA_L相对于基准模型有6.87%的提升,相对于Baseline+TFA_L有1.59%的提示。这说明我们提出的克罗内克卷积和树形特征聚合模块都可以改善分割质量,而且本发明提出的树形聚合模块有很强的泛化能力。As shown in Figure 6, it can be seen that KC + TFA_S has a 6.87% improvement from the baseline model and a 1.06 improvement from Baseline + TFA_S; while KC + TFA_L has a 6.87% improvement from the baseline model and a Baseline + TFA_L There are 1.59% tips. This shows that both the Kronecker convolution and tree feature aggregation modules we proposed can improve the segmentation quality, and the tree aggregation module proposed by the present invention has a strong generalization capability.
四、与其他方法比较:Fourth, compared with other methods:
这一部分是本发明的场景分割方法与其他先进方法进行对比的实验结果。This part is the experimental result of comparing the scene segmentation method of the present invention with other advanced methods.
图7是本发明的场景分割方法在PASCAL VOC 2012数据集上的实验结果图。图8是本发明的场景分割方法在Cityscapes数据集上的实验结果图。图9是本发明的场景分割方法在PASCAL-Context数据集上的实验结果图。FIG. 7 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL VOC 2012 data set. FIG. 8 is an experimental result diagram of the scene segmentation method of the present invention on the Cityscapes data set. FIG. 9 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL-Context data set.
如图7、图8和图9所示,可以看出,本发明的场景分割方法在PASCAL VOC 2012数据集、Cityscapes数据集和PASCAL-Context数据集这三个权威的语义分割数据集上都取得了非常好的性能,这也进一步验证了本发明的有效性。As shown in Figures 7, 8, and 9, it can be seen that the scene segmentation method of the present invention is obtained on three authoritative semantic segmentation datasets: PASCAL VOC 2012 dataset, Cityscapes dataset and PASCAL-Context dataset. Very good performance, which further verifies the effectiveness of the present invention.
工业应用性Industrial applicability
本发明的基于克罗内克卷积(Kronecker Convolution)的场景分割方法和系统,包括对原始图像利用克罗内克卷积进行特征学习,将得到的特征输入到树形结构特征聚合模块学习层次化的上下文信息,然后将得到特征和上下文信息输入到场景分割子网络,得到原始图像的场景分割结果。本发明提出的用于特征抽取的克罗内克卷积,可以在不增加额外参数的情况下增加滤波器的感受野,并且能够捕捉局部信息,同时获得更高的分割精度。此外,本发明还提出了树形结构特征聚合模块去分割多尺度物体和捕捉层次化的上下文信息,这极大地提高了现有基于全卷积的场景分割模型的性能。The scene segmentation method and system based on Kronecker Convolution of the present invention includes feature learning on the original image using Kronecker convolution, and inputting the obtained features to the learning level of the tree structure feature aggregation module Context information, and then input the obtained features and context information into the scene segmentation sub-network to obtain the scene segmentation result of the original image. The Kronecker convolution for feature extraction proposed by the present invention can increase the receptive field of the filter without adding additional parameters, and can capture local information while obtaining higher segmentation accuracy. In addition, the present invention also proposes a tree structure feature aggregation module to segment multi-scale objects and capture hierarchical context information, which greatly improves the performance of the existing scene segmentation model based on full convolution.

Claims (10)

  1. 一种基于克罗内克卷积的场景分割方法,其特征在于,包括:A scene segmentation method based on Kronecker convolution, which is characterized by:
    构建具有残差结构的克罗内克卷积层;Construct a Kronecker convolution layer with residual structure;
    以该克罗内克卷积层和标准卷积层构建特征提取子网络;以原始图像为输入,通过该特征提取子网络输出抽象特征图;Construct the feature extraction sub-network with the Kronecker convolution layer and the standard convolution layer; take the original image as input, and output the abstract feature map through the feature extraction sub-network;
    以该克罗内克卷积层构建树形特征聚合模块;以该抽象特征图为输入,通过该树形特征聚合模块输出聚合特征图;Use the Kronecker convolution layer to construct a tree-shaped feature aggregation module; take the abstract feature map as input, and output the aggregated feature map through the tree-shaped feature aggregation module;
    以该聚合特征图为输入,通过场景分割子网络输出该原始图像的场景分割结果。Taking the aggregate feature map as input, the scene segmentation result of the original image is output through the scene segmentation sub-network.
  2. 如权利要求1所述的场景分割方法,其特征在于,该克罗内克卷积层的形式化表示为
    Figure PCTCN2018114007-appb-100001
    其中K(c 1,c 2)为标准卷积核,c 1、c 2为该克罗内克卷积层的通道索引,c 1∈[1,C A],c 1∈[1,C B],C A为输入K(c 1,c 2)的特征图的通道数,C B为K(c 1,c 2)输出的特征图的通道数,F为二维的扩展矩阵,满足当K(c 1,c 2)为k×k时,使K 1(c 1,c 2)被扩展为(2k+1)r 1×(2k+1)r 1;k为标准卷积的核大小,r 1为该克罗内克卷积层的扩张因子,r 2为该克罗内克卷积层的共享因子,c 1、c 2、C A、C B、k、r 1、r 2为正整数。
    The scene segmentation method according to claim 1, wherein the Kronecker convolutional layer is formalized as
    Figure PCTCN2018114007-appb-100001
    Wherein K (c 1, c 2) is a standard convolution kernel, c 1, c 2 for the channel convolutional Kronecker index layer, c 1 ∈ [1, C A], c 1 ∈ [1, C B], C a is the input K (c 1, c 2) the number of channels is characterized in FIG, C B is K (c 1, c 2) of FIG channels output characteristics, F is a two-dimensional spreading matrix, satisfy When K (c 1 , c 2 ) is k × k, K 1 (c 1 , c 2 ) is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; k is the standard convolution Kernel size, r 1 is the expansion factor of the Kronecker convolution layer, r 2 is the sharing factor of the Kronecker convolution layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer.
  3. 如权利要求1所述的场景分割方法,其特征在于,该特征提取子网络包括5个阶段,阶段1包括3层级联的3×3标准卷积层,阶段2包括多个级联的第一瓶颈模块,阶段3包括多个级联的该第一瓶颈模块,阶段4包括多个级联的第二瓶颈模块,阶段5包括多个级联的该第二瓶颈模块;其中The scene segmentation method according to claim 1, wherein the feature extraction sub-network includes 5 stages, stage 1 includes a 3-layer cascaded 3 × 3 standard convolutional layer, and stage 2 includes multiple cascaded first Bottleneck module, stage 3 includes multiple cascaded first bottleneck modules, stage 4 includes multiple cascaded second bottleneck modules, and stage 5 includes multiple cascaded second bottleneck modules; wherein
    该第一瓶颈模块包括级联的一层1×1标准卷积层、一层3×3标准卷积层和一层1×1标准卷积层;The first bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer;
    该第二瓶颈模块包括级联的一层1×1标准卷积层、一层该克罗内克卷积层和一层1×1标准卷积层。The second bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of the Kronecker convolutional layer and a layer of 1 × 1 standard convolutional layer.
  4. 如权利要求1所述的场景分割方法,其特征在于,该树形特征聚合模块包括级联的聚合层,该聚合层包括该克罗内克卷积层、批归一化层和ReLU激活函数,并以每一层聚合层的输出作为下一层聚合层的输入;该树形特征聚合模块中所有聚合层的输出与该抽象特征图通过级联层进行合并,得到该聚合 特征图。The scene segmentation method according to claim 1, wherein the tree-shaped feature aggregation module includes a cascaded aggregation layer, the aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function , And take the output of each aggregation layer as the input of the next aggregation layer; the output of all aggregation layers in the tree-shaped feature aggregation module and the abstract feature map are merged through the cascade layer to obtain the aggregate feature map.
  5. 如权利要求1所述的场景分割方法,其特征在于,该场景分割子网络包括级联的多层3×3标准卷积层和一层1×1标准卷积层。The scene segmentation method according to claim 1, wherein the scene segmentation sub-network includes a cascade of multiple layers of 3 × 3 standard convolutional layers and a layer of 1 × 1 standard convolutional layers.
  6. 一种基于克罗内克卷积的场景分割系统,其特征在于,包括:A scene segmentation system based on Kronecker convolution is characterized by including:
    克罗内克卷积层构建模块,用于构建具有残差结构的克罗内克卷积层;Kronecker convolution layer construction module, used to construct Kronecker convolution layer with residual structure;
    特征提取子网络,用于输入原始图像以输出抽象特征图,其中该征提取子网络包括该克罗内克卷积层和标准卷积层;Feature extraction sub-network for inputting original images to output abstract feature maps, wherein the feature extraction sub-network includes the Kronecker convolution layer and the standard convolution layer;
    树形特征聚合模块,用于输入该抽象特征图以输出聚合特征图,其中该树形特征聚合模块包括多层该克罗内克卷积层;A tree-shaped feature aggregation module for inputting the abstract feature map to output an aggregated feature map, wherein the tree-shaped feature aggregation module includes multiple layers of the Kronecker convolution layer;
    场景分割子网络,用于输入该聚合特征图以输出该原始图像的场景分割结果,其中该场景分割子网络包括多层该克罗内克卷积层。The scene segmentation sub-network is used to input the aggregate feature map to output the scene segmentation result of the original image, where the scene segmentation sub-network includes multiple layers of the Kronecker convolution layer.
  7. 如权利要求6所述的场景分割系统,其特征在于,该克罗内克卷积层的形式化表示为
    Figure PCTCN2018114007-appb-100002
    其中K(c 1,c 2)为标准卷积核,c 1、c 2为该克罗内克卷积层的通道索引,c 1∈[1,C A],c 1∈[1,C B],C A为输入K(c 1,c 2)的特征图的通道数,C B为K(c 1,c 2)输出的特征图的通道数,F为二维的扩展矩阵,满足当K(c 1,c 2)为k×k时,使K 1(c 1,c 2)被扩展为(2k+1)r 1×(2k+1)r 1;k为标准卷积的核大小,r 1为该克罗内克卷积层的扩张因子,r 2为该克罗内克卷积层的共享因子,c 1、c 2、C A、C B、k、r 1、r 2为正整数。
    The scene segmentation system according to claim 6, wherein the Kronecker convolutional layer is formalized as
    Figure PCTCN2018114007-appb-100002
    Wherein K (c 1, c 2) is a standard convolution kernel, c 1, c 2 for the channel convolutional Kronecker index layer, c 1 ∈ [1, C A], c 1 ∈ [1, C B], C a is the input K (c 1, c 2) the number of channels is characterized in FIG, C B is K (c 1, c 2) of FIG channels output characteristics, F is a two-dimensional spreading matrix, satisfy When K (c 1 , c 2 ) is k × k, K 1 (c 1 , c 2 ) is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; k is the standard convolution Kernel size, r 1 is the expansion factor of the Kronecker convolution layer, r 2 is the sharing factor of the Kronecker convolution layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer.
  8. 如权利要求6所述的场景分割系统,其特征在于,该特征提取子网络包括5个子模块,子模块1包括3层级联的3×3标准卷积层,子模块2包括多个级联的第一瓶颈模块,子模块3包括多个级联的该第一瓶颈模块,子模块4包括多个级联的第二瓶颈模块,子模块5包括多个级联的该第二瓶颈模块;其中The scene segmentation system according to claim 6, wherein the feature extraction sub-network includes 5 sub-modules, sub-module 1 includes a 3-layer cascaded 3 × 3 standard convolutional layer, and sub-module 2 includes multiple cascaded A first bottleneck module, the sub-module 3 includes multiple cascaded first bottleneck modules, the sub-module 4 includes multiple cascaded second bottleneck modules, and the sub-module 5 includes multiple cascaded second bottleneck modules; wherein
    该第一瓶颈模块包括级联的一层1×1标准卷积层、一层3×3标准卷积层和一层1×1标准卷积层;The first bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer;
    该第二瓶颈模块包括级联的一层1×1标准卷积层、一层该克罗内克卷积层和一层1×1标准卷积层。The second bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of the Kronecker convolutional layer and a layer of 1 × 1 standard convolutional layer.
  9. 如权利要求1所述的场景分割系统,其特征在于,该树形特征聚合模块包括级联的聚合层,该聚合层包括该克罗内克卷积层、批归一化层和ReLU激活函数,并以每一层聚合层的输出作为下一层聚合层的输入;该树形特征聚 合模块中所有聚合层的输出与该抽象特征图通过级联层进行合并,得到该聚合特征图。The scene segmentation system according to claim 1, wherein the tree-shaped feature aggregation module includes a cascaded aggregation layer including the Kronecker convolution layer, batch normalization layer, and ReLU activation function , And take the output of each aggregation layer as the input of the next aggregation layer; the output of all aggregation layers in the tree-shaped feature aggregation module and the abstract feature map are merged through the cascade layer to obtain the aggregate feature map.
  10. 如权利要求1所述的场景分割系统,其特征在于,该场景分割子网络包括级联的多层3×3标准卷积层和一层1×1标准卷积层。The scene segmentation system according to claim 1, wherein the scene segmentation sub-network includes a cascade of multiple layers of 3 × 3 standard convolutional layers and a layer of 1 × 1 standard convolutional layers.
PCT/CN2018/114007 2018-11-05 2018-11-05 Kronecker convolution-based scene segmentation method and system WO2020093211A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/114007 WO2020093211A1 (en) 2018-11-05 2018-11-05 Kronecker convolution-based scene segmentation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/114007 WO2020093211A1 (en) 2018-11-05 2018-11-05 Kronecker convolution-based scene segmentation method and system

Publications (1)

Publication Number Publication Date
WO2020093211A1 true WO2020093211A1 (en) 2020-05-14

Family

ID=70610769

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114007 WO2020093211A1 (en) 2018-11-05 2018-11-05 Kronecker convolution-based scene segmentation method and system

Country Status (1)

Country Link
WO (1) WO2020093211A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068644A (en) * 2015-07-24 2015-11-18 山东大学 Method for detecting P300 electroencephalogram based on convolutional neural network
CN105894045A (en) * 2016-05-06 2016-08-24 电子科技大学 Vehicle type recognition method with deep network model based on spatial pyramid pooling
CN106709511A (en) * 2016-12-08 2017-05-24 华中师范大学 Urban rail transit panoramic monitoring video fault detection method based on depth learning
CN106841216A (en) * 2017-02-28 2017-06-13 浙江工业大学 Tunnel defect automatic identification equipment based on panoramic picture CNN
CN107577737A (en) * 2017-08-25 2018-01-12 北京百度网讯科技有限公司 Method and apparatus for pushed information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068644A (en) * 2015-07-24 2015-11-18 山东大学 Method for detecting P300 electroencephalogram based on convolutional neural network
CN105894045A (en) * 2016-05-06 2016-08-24 电子科技大学 Vehicle type recognition method with deep network model based on spatial pyramid pooling
CN106709511A (en) * 2016-12-08 2017-05-24 华中师范大学 Urban rail transit panoramic monitoring video fault detection method based on depth learning
CN106841216A (en) * 2017-02-28 2017-06-13 浙江工业大学 Tunnel defect automatic identification equipment based on panoramic picture CNN
CN107577737A (en) * 2017-08-25 2018-01-12 北京百度网讯科技有限公司 Method and apparatus for pushed information

Similar Documents

Publication Publication Date Title
Jiang et al. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation
Wang et al. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection
Zhou et al. Contextual ensemble network for semantic segmentation
JP6395158B2 (en) How to semantically label acquired images of a scene
CN109583340B (en) Video target detection method based on deep learning
US20210150252A1 (en) Systems and methods for virtual and augmented reality
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN113052254B (en) Multi-attention ghost residual fusion classification model and classification method thereof
CN112489050A (en) Semi-supervised instance segmentation algorithm based on feature migration
CN109670506B (en) Scene segmentation method and system based on kronecker convolution
CN114223019A (en) Feedback decoder for parameter efficient semantic image segmentation
Biasutti et al. RIU-Net: Embarrassingly simple semantic segmentation of 3D LiDAR point cloud
CN114743027B (en) Weak supervision learning-guided cooperative significance detection method
CN115545166A (en) Improved ConvNeXt convolutional neural network and remote sensing image classification method thereof
Jiang et al. Mirror complementary transformer network for RGB‐thermal salient object detection
CN110751271A (en) Image traceability feature characterization method based on deep neural network
CN110110775A (en) A kind of matching cost calculation method based on hyper linking network
WO2020093211A1 (en) Kronecker convolution-based scene segmentation method and system
Zhang et al. GFANet: Group fusion aggregation network for real time stereo matching
Chelali et al. Violence detection from video under 2D spatio-temporal representations
WO2020093210A1 (en) Scene segmentation method and system based on contenxtual information guidance
CN113191367B (en) Semantic segmentation method based on dense scale dynamic network
CN115641449A (en) Target tracking method for robot vision
Le et al. Sst-gcn: Structure aware spatial-temporal gcn for 3d hand pose estimation
CN113284232B (en) Optical flow tracking method based on quadtree

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939207

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939207

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 27.09.2021)