WO2020093211A1

WO2020093211A1 - Kronecker convolution-based scene segmentation method and system

Info

Publication number: WO2020093211A1
Application number: PCT/CN2018/114007
Authority: WO
Inventors: 唐胜; 伍天意; 李锦涛
Original assignee: 中国科学院计算技术研究所
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2020-05-14

Abstract

A Kronecker convolution-based scene segmentation method, comprising: constructing a Kronecker convolution layer having a residual structure; constructing a feature extraction sub-network on the basis of the Kronecker convolution layer and a standard convolution layer, taking an original image as an input, and outputting an abstract feature map by means of the feature extraction sub-network; constructing a tree-structured feature aggregation module on the basis of the Kronecker convolution layer, taking the abstract feature map as an input, and outputting an aggregation feature map by means of the tree-structured feature aggregation module; and taking said aggregation feature map as an input, and outputting a scene segmentation result of the original image by means of the scene segmentation sub-network.

Description

Scene segmentation method and system based on Kronecker convolution

Technical field

The method belongs to the field of machine learning and computer vision, and particularly relates to a scene segmentation method and system based on Kronecker convolution and tree structure feature aggregation module.

Background technique

Scene segmentation is a very important and challenging task in the field of computer vision, and has a wide range of application values in production and life, such as unmanned driving, robot navigation, and video editing. The goal of scene segmentation is to assign each pixel to its category in the scene image. Recently, scene segmentation methods based on fully convolutional networks have made significant progress. However, the current mainstream method is to migrate the classification network, remove the maximum pooling layer and the fully connected layer, and add a deconvolution layer to generate a segmentation result. However, there is still a big difference between classification and segmentation. For example, a classic classification network will downsample the original input by 32 times, which helps extract features that are more suitable for classification, but this network model ignores location information. On the contrary, segmentation requires very precise position information, specifically pixel-level position information. At present, some researchers propose that dilated convolution solves this problem to a certain extent. It can increase the receptive field of the filter while maintaining the resolution of the feature map, and has achieved better segmentation performance. However, dilated convolution has a disadvantage. When the expansion coefficient is relatively large, it will lose a lot of local details. In particular, and when the expansion factor approaches the size of the feature map, a 3 × 3 convolution degenerates into a 1 × 1 convolution.

In addition, for the scene segmentation network, the objects in the scene are often presented at multiple scales. Another feature is the hierarchical structure of the scene. For example, for the Cityscapes dataset, in general, the car in the center of the image is usually a distant car. , The scale is small; and the areas on both sides of the image are usually near cars, and the scale is large. In order to solve the above two problems, many existing methods use dilated convolution in the basic feature extraction sub-network, and then use cross-layer feature fusion to segment multi-scale objects. However, the local detail information ignored by the dilated convolution and the simple cross-layer fusion segmentation of multi-scale objects hinder the segmentation performance to a certain extent.

Invention Disclosure

In view of the above problems, the present invention proposes a scene segmentation method based on Kronecker convolution, which includes: constructing a Kronecker convolution layer with a residual structure; using the Kronecker convolution layer and standard convolution Layer to construct a feature extraction sub-network; take the original image as input and output an abstract feature map through the feature extraction sub-network; build a tree-shaped feature aggregation module with the Kronecker convolution layer; take the abstract feature map as input and pass the The tree-shaped feature aggregation module outputs an aggregated feature map; the Kronecker convolutional layer is used to construct a scene segmentation sub-network; using the aggregated feature map as an input, the scene segmentation result of the original image is output through the scene segmentation sub-network.

Further, the formalization of the Kronecker convolution layer is expressed as

Wherein K (c _1, c ₂₎ is a standard convolution kernel, c _1, c ₂ for the channel convolutional Kronecker index _{layer, c 1 ∈ [1, C} A], c 1 ∈ [1, C _B], C _a is the input K (c _1, c ₂₎ the number of channels is characterized in FIG, C _B is K (c _1, c ₂₎ of FIG channels output characteristics, F is a two-dimensional spreading matrix, satisfy When K (c ₁ , c ₂ ) is k × k, K ¹ (c ₁ , c ₂ ) is expanded to (2k + 1) r ₁ × (2k + 1) r ₁ ; k is the standard convolution Kernel size, r ₁ is the expansion factor of the Kronecker convolution layer, r ₂ is the sharing factor of the Kronecker convolution layer, c ₁ , c ₂ , C _A , C _B , k, r ₁ , r ₂ is a positive integer.

Further, the feature extraction sub-network includes 5 stages, stage 1 includes a 3-layer cascaded 3 × 3 standard convolutional layer, stage 2 includes multiple cascaded first bottleneck modules, and stage 3 includes multiple cascaded The first bottleneck module, stage 4 includes multiple cascaded second bottleneck modules, and stage 5 includes multiple cascaded second bottleneck modules; wherein the first bottleneck module includes a cascaded 1 × 1 standard convolution Layer, a layer of 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer; the second bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer and a layer of the Kronecker convolution Layer and a layer of 1 × 1 standard convolution layer.

Further, the tree-shaped feature aggregation module includes a cascaded aggregation layer, the aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function, and the output of each aggregation layer is used as the next An input of an aggregation layer; the outputs of all aggregation layers in the tree-shaped feature aggregation module are merged with the abstract feature map through a cascade layer to obtain the aggregate feature map.

Further, the scene segmentation sub-network includes a cascaded multi-layer 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer.

The invention also discloses a scene segmentation system based on Kronecker convolution, including:

Kronecker convolution layer construction module, used to construct Kronecker convolution layer with residual structure;

Feature extraction sub-network for inputting original images to output abstract feature maps, wherein the feature extraction sub-network includes the Kronecker convolution layer and the standard convolution layer;

A tree-shaped feature aggregation module for inputting the abstract feature map to output an aggregated feature map, wherein the tree-shaped feature aggregation module includes multiple layers of the Kronecker convolution layer;

The scene segmentation sub-network is used to input the aggregate feature map to output the scene segmentation result of the original image, where the scene segmentation sub-network includes multiple layers of the Kronecker convolution layer.

Brief description of the drawings

FIG. 1 is an overall frame diagram of a scene segmentation method based on Kronecker convolution of the present invention.

FIG. 2A is a schematic diagram of an expansion convolution in the prior art;

2B is a schematic diagram of the Kronecker convolution of the present invention;

3 is a schematic structural diagram of a feature extraction sub-network proposed by the present invention;

4 is a schematic diagram of a tree structure feature aggregation module proposed by the present invention;

Figures 5 and 6 are graphs comparing the performance of the scene segmentation method of the present invention with the prior art.

FIG. 7 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL VOC 2012 data set.

FIG. 8 is an experimental result diagram of the scene segmentation method of the present invention on the Cityscapes data set.

FIG. 9 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL-Context data set.

Best way to implement the invention

In order to make the objectives, technical solutions and advantages of the present invention clearer, the following describes in detail the scene segmentation method and system based on Kronecker Convolution (Kronecker Convolution) proposed by the present invention in conjunction with the drawings. It should be understood that the specific implementation methods described herein are only used to explain the present invention, and are not intended to limit the present invention.

The scene segmentation method and system based on Kronecker convolution of the present invention includes feature learning on the original image using Kronecker convolution, and inputting the obtained features into a tree structure feature aggregation module to learn hierarchical context information , And then input the obtained feature and context information to the scene segmentation sub-network to obtain the scene segmentation result of the original image. The present invention proposes a Kronecker convolution for feature extraction, which can increase the receptive field of the filter without adding additional parameters, and can capture local information while obtaining higher segmentation accuracy. In addition, the present invention also proposes a tree structure feature aggregation module to segment multi-scale objects and capture hierarchical context information, which greatly improves the performance of the existing scene segmentation model based on full convolution.

The Kronecker product is a special form of tensor product, specifically an operation between two matrices of arbitrary size. The formal expression of the Kronecker convolution kernel is:

Where K (c ₁ , c ₂ ) is a standard convolution kernel, c ₁ ∈ [1, C _A ], c ₁ ∈ [1, C _B ]. Where C _A and C _B, respectively, corresponding to the number of input channels convolution characteristics and output characteristics graph of FIG. The F matrix is a combination of an all-one matrix of size r ₂ × r ₂ in the upper right corner and a zero matrix of size (r ₁ -r ₂ ) × (r ₁ -r ₂ ) in the lower right corner, assuming that the standard convolution kernel is k × k, then the Kronecker convolution kernel is expanded to (2k + 1) r ₁ × (2k + 1) r ₁ ; where r ₁ and r ₂ are two of the Kronecker convolution layers proposed by the present invention Hyperparameters, r ₁ is the expansion factor of the Kronecker convolutional layer, r ₂ is the shared factor of the Kronecker convolutional layer, c ₁ , c ₂ , C _A , C _B , k, r ₁ , r ₂ is a positive integer,

Represents the Kronecker product operation.

Assuming that the standard coordinates of the central coordinates of the convolution block corresponding to the input feature map are (p ^t , q ^t ), the sampling points (x _ijuv , y _ijuv ) corresponding to the input feature map Y ^t are:

x _ijuv = p ^t + ir ₁ + u, y _ijuv = q ^t + jr ₁ + v

Among them, i, j ∈ [-k, k] ∩ Z, u, v ∈ [0, r ₂ -1] ∩ Z;

The corresponding Kronecker convolution operation is formalized as:

among them,

Is the spatial position index of the input feature map Y ^t , B ^t is the output feature map,

Is the feature vector of the input feature map Y ^t ,

Is the Kronecker convolution kernel parameter, b is the bias vector,

C _A is the dimension of space.

FIG. 1 is an overall frame diagram of a scene segmentation method based on Kronecker convolution of the present invention. As shown in FIG. 1, specifically, the scene segmentation method based on Kronecker convolution of the present invention includes:

Step S1, constructing a Kronecker convolution layer;

The present invention proposes a new convolution method, Kronecker convolution, for expanding the receptive field of standard convolution without increasing the number of its parameters. In addition, the Kronecker convolution proposed by the present invention is compatible with the entire scene segmentation network, and can be inserted into the scene segmentation network to form a complete structure and perform end-to-end training, where end-to-end is a proper noun , Refers to the structure of the scene segmentation network from the original image input to the final output results can be achieved using a unified scene segmentation network, does not need to be divided into multiple stages for training.

FIG. 2A is a schematic diagram of an expansion convolution in the prior art, and FIG. 2B is a schematic diagram of the Kronecker convolution of the present invention. Figure 2A shows a 3 × 3 dilated convolution, and f is the dilation factor of dilated convolution; as shown in FIG. 2B, the formal expression of the Kronecker convolution kernel is:

Where K (c ₁ , c ₂ ) is a standard convolution kernel, c ₁ ∈ [1, C _A ], c ₁ ∈ [1, C _B ]. Where C _A and C _B, respectively, corresponding to the number of input channels convolution characteristics and output characteristics graph of FIG. The F matrix is a combination of an all-one matrix of size r ₂ × r ₂ in the upper right corner and a zero matrix of size (r ₁ -r ₂ ) × (r ₁ -r ₂ ) in the lower right corner, assuming that the standard convolution kernel is k × k, then the Kronecker convolution kernel is expanded to (2k + 1) r ₁ × (2k + 1) r ₁ ; where r ₁ and r ₂ are two of the Kronecker convolution layers proposed by the present invention Hyperparameters, r ₁ is the expansion factor of the Kronecker convolutional layer, r ₂ is the shared factor of the Kronecker convolutional layer, c ₁ , c ₂ , C _A , C _B , k, r ₁ , r ₂ is a positive integer;

Step S2, through the feature extraction sub-network, input the original RGB image I, and output an abstract feature map f _l ;

FIG. 3 is a schematic diagram of the feature extraction sub-network proposed by the present invention. As shown in FIG. 3, in the scene segmentation method of the present invention, the feature extraction sub-network includes five stages, and each stage includes multiple standard convolutional layers, or multiple standard convolutions and multiple Kronecker convolutions ; It is worth noting that in the high stage of the feature extraction subnetwork, the channel of the feature map is very large. Typically, for the feature extraction subnetwork, the number of feature channels in stage 4 is 1024, and the number of feature channels in stage 5 is 2048. . If Kronecker convolution is used to re-learn these features directly, these huge number of parameters include a lot of redundancy, and at the same time, it will reduce the segmentation speed of the entire scene segmentation network and increase the computational complexity. In order to solve this problem, the present invention adds Kronecker convolution to a structure with a "bottleneck". This structure with a "bottleneck" can be called a bottleneck module. The beginning and end of the bottleneck module are: The standard 1x1 convolution layer, the 1x1 convolution layer at the beginning of the bottleneck module is used to reduce the number of channels that reduce the input feature map, and the 1x1 convolution layer at the end of the bottleneck module is used to restore the number of channels of the output feature map; Reduces the number of parameters of the feature extraction sub-network.

In the scene segmentation network of the present invention, stage 1 of the feature extraction sub-network includes three standard 3 × 3 convolutional layers arranged in sequence; stage 2 to stage 5 include multiple bottleneck modules, of which stage 2 and stage 3 use one The bottleneck module is called the first bottleneck module. The first bottleneck module includes 2 standard 1 × 1 convolutional layers and 1 standard 3 × 3 convolutional layer, and another bottleneck module is used in

stages

4 and 5. Called the second bottleneck module, the second bottleneck module includes 2 standard 1 × 1 convolution layers and 1 Kronecker convolution layer; using the original RGB image I as the input of stage 1, the output of stage 1 is obtained Image feature map 1 (feature map1), and use image feature map 1 (feature map1) as the input of stage 2, and so on, the image feature map 2 (feature map2) output in stage 2 and the image feature map output in stage 3 3 (feature map3) and stage 4 output image feature map 4 (feature map4) as input, respectively obtain the image feature map 3 (feature map3) output from stage 3, the image feature map 4 (feature map4) and stage output from stage 4 5 The output image feature map 5 (feature map5), and the image feature map 5 FIG Abstract wherein f _l;

Step S3, through the tree structure feature aggregation module, input the abstract feature map f _l and output the aggregate feature map f _c ;

Most of the current scene segmentation frameworks are based on a fully convolutional neural network framework, which mainly includes two sub-networks in series, namely feature extraction sub-networks and scene segmentation sub-networks; N, to obtain the scene segmentation result J of the original scene image I, the scene segmentation network N can be decomposed into a feature extraction subnetwork N _fea and a scene segmentation subnetwork N _seg , so the scene segmentation network N can be expressed as: J = N _seg (N _fea (I)); where N _fea (I) represents an abstract feature map f _l obtained from the feature extraction sub-network, these feature maps contain semantic concepts and spatial location information learned from the original scene image I.

The scene segmentation method of the present invention adds a tree structure feature aggregation module between the feature extraction sub-network and the scene segmentation sub-network. 4 is a schematic structural diagram of a tree structure feature aggregation module proposed by the present invention. As shown in Figure 4, the tree-shaped feature aggregation module includes multiple cascaded aggregation layers. The aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function, and the output of each aggregation layer is used as The input of the next aggregation layer; the outputs of all aggregation layers in the tree-shaped feature aggregation module are merged with the abstract feature map through a cascade layer to obtain the aggregated feature map tree structure feature aggregation module including multiple Kronecker The convolutional layer uses a cascade recursive approach. The tree structure feature aggregation module of the present invention has the following expansion rules:

Through the first layer aggregation layer f ₁ (·) of the tree structure feature aggregation module, input the feature map x output by the previous subnetwork, and output the context information feature map x _{1 of the} first layer aggregation layer f ₁ (x), where f ₁ (·) includes Kronecker convolution layer, batch normalization layer and ReLU activation function; the second layer aggregation layer f ₂ (·) of the tree structure feature aggregation module takes x ₁ as input and outputs The context information feature map x _{2 of the} layer 2 aggregation layer f ₂ (x ₁ ); and so on, with the context information feature map x _n-1 output by the n-1 layer aggregation layer f _n-1 (·) as the first The input of the _n- layer aggregation layer f _n-1 (·) outputs the context information feature map x _{n of the} _n- th aggregation layer f _n (x _n-1 ); taking x, x ₁ , ..., x _n as input, Through the cascade layer g, the final output H _n (x) of the tree structure feature aggregation module is obtained; specific to the scene segmentation method of the present invention, the abstract feature graph f _l output by the feature extraction subnetwork is used as the input, and the tree shape The structural feature aggregation module finally outputs the aggregate feature graph f _c .

Step S4, the sub-network division by the scene, wherein the input polymerization FIG f _c, of the original input scene RGB obtain a prediction image I J segmentation result;

The scene segmentation sub-network includes a multi-layer standard 3 × 3 convolutional layer and a standard 1 × 1 convolutional layer.

Feature extraction sub-network for inputting original RGB image I and outputting abstract feature map f _l ;

Tree feature aggregation module, used to input abstract feature map f _l to output aggregate feature map f _c ;

Scene segmentation sub-network, wherein the polymerization for inputting the output f _c of FIG scene segmentation of the original image I results J.

In order to make the above features and effects of the present invention clearer, the following experiments are specifically enumerated to further describe the scene segmentation method of the present invention.

First, the data set

The relevant experiments of the present invention use the PASCAL VOC 2012 semantic segmentation dataset, Cityscapes dataset and PASCAL-Context dataset.

The PASCAL VOC 2012 semantic segmentation data set contains 20 types of foreground objects and 1 background class; the original data set contains 1464 training pictures, 1449 verification pictures and 1456 test pictures, the extended training set is enhanced to 10582 pictures, and the present invention utilizes The average pixel-level merge ratio (mean IoU) of 21 objects is evaluated;

The Cityscapes dataset contains street scenes from 50 different cities. This data set is divided into three subsets, where the training set includes 2975 images, the validation set includes 500 images, and the test set includes 1525 images. The invention utilizes the high-quality 19-type pixel set in the data set for labeling. The performance uses the average value of the cross-combination ratio of all categories;

The PASCAL-Context data set includes a training set and a validation set. The training set includes 4998 images, and the validation set includes 5105 images. The PASCAL-Context data set provides detailed semantic annotations for the entire scene. The scene segmentation method of the present invention adopts Among them, the most common are 59 categories and 1 background category.

2. Experimental verification of Kronecker's convolution effectiveness:

As shown in FIG. 5, the Kronecker convolution proposed by the present invention is 0.8%, 1.7%, 0.7%, 1.5%, 1.6% higher than the corresponding expansion convolution performance, respectively, and the expansion coefficient is from 4 to 12. These results indicate that the Kronecker convolution proposed by the present invention has better performance than dilated convolution.

3. The experimental verification of the effectiveness of the tree feature aggregation module:

TFA_S is a relatively small factor (r ₁ , r ₂ ) configured in TFA = {(6, 3), (10, 7), (20, 15)}

TFA_L is a relatively large factor (r ₁ , r ₂ ) = {(10, 7), (20, 15), (30, 25)} configured in TFA

As shown in Figure 6, it can be seen that KC + TFA_S has a 6.87% improvement from the baseline model and a 1.06 improvement from Baseline + TFA_S; while KC + TFA_L has a 6.87% improvement from the baseline model and a Baseline + TFA_L There are 1.59% tips. This shows that both the Kronecker convolution and tree feature aggregation modules we proposed can improve the segmentation quality, and the tree aggregation module proposed by the present invention has a strong generalization capability.

Fourth, compared with other methods:

This part is the experimental result of comparing the scene segmentation method of the present invention with other advanced methods.

FIG. 7 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL VOC 2012 data set. FIG. 8 is an experimental result diagram of the scene segmentation method of the present invention on the Cityscapes data set. FIG. 9 is an experimental result diagram of the scene segmentation method of the present invention on the PASCAL-Context data set.

As shown in Figures 7, 8, and 9, it can be seen that the scene segmentation method of the present invention is obtained on three authoritative semantic segmentation datasets: PASCAL VOC 2012 dataset, Cityscapes dataset and PASCAL-Context dataset. Very good performance, which further verifies the effectiveness of the present invention.

Industrial applicability

The scene segmentation method and system based on Kronecker Convolution of the present invention includes feature learning on the original image using Kronecker convolution, and inputting the obtained features to the learning level of the tree structure feature aggregation module Context information, and then input the obtained features and context information into the scene segmentation sub-network to obtain the scene segmentation result of the original image. The Kronecker convolution for feature extraction proposed by the present invention can increase the receptive field of the filter without adding additional parameters, and can capture local information while obtaining higher segmentation accuracy. In addition, the present invention also proposes a tree structure feature aggregation module to segment multi-scale objects and capture hierarchical context information, which greatly improves the performance of the existing scene segmentation model based on full convolution.

Claims

A scene segmentation method based on Kronecker convolution, which is characterized by:

Construct a Kronecker convolution layer with residual structure;

Construct the feature extraction sub-network with the Kronecker convolution layer and the standard convolution layer; take the original image as input, and output the abstract feature map through the feature extraction sub-network;

Use the Kronecker convolution layer to construct a tree-shaped feature aggregation module; take the abstract feature map as input, and output the aggregated feature map through the tree-shaped feature aggregation module;

Taking the aggregate feature map as input, the scene segmentation result of the original image is output through the scene segmentation sub-network.
The scene segmentation method according to claim 1, wherein the Kronecker convolutional layer is formalized as
Wherein K (c 1, c 2) is a standard convolution kernel, c 1, c 2 for the channel convolutional Kronecker index layer, c 1 ∈ [1, C A], c 1 ∈ [1, C B], C a is the input K (c 1, c 2) the number of channels is characterized in FIG, C B is K (c 1, c 2) of FIG channels output characteristics, F is a two-dimensional spreading matrix, satisfy When K (c 1 , c 2 ) is k × k, K 1 (c 1 , c 2 ) is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; k is the standard convolution Kernel size, r 1 is the expansion factor of the Kronecker convolution layer, r 2 is the sharing factor of the Kronecker convolution layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer.
The scene segmentation method according to claim 1, wherein the feature extraction sub-network includes 5 stages, stage 1 includes a 3-layer cascaded 3 × 3 standard convolutional layer, and stage 2 includes multiple cascaded first Bottleneck module, stage 3 includes multiple cascaded first bottleneck modules, stage 4 includes multiple cascaded second bottleneck modules, and stage 5 includes multiple cascaded second bottleneck modules; wherein

The first bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer;

The second bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of the Kronecker convolutional layer and a layer of 1 × 1 standard convolutional layer.
The scene segmentation method according to claim 1, wherein the tree-shaped feature aggregation module includes a cascaded aggregation layer, the aggregation layer includes the Kronecker convolution layer, batch normalization layer, and ReLU activation function , And take the output of each aggregation layer as the input of the next aggregation layer; the output of all aggregation layers in the tree-shaped feature aggregation module and the abstract feature map are merged through the cascade layer to obtain the aggregate feature map.
The scene segmentation method according to claim 1, wherein the scene segmentation sub-network includes a cascade of multiple layers of 3 × 3 standard convolutional layers and a layer of 1 × 1 standard convolutional layers.
A scene segmentation system based on Kronecker convolution is characterized by including:

Kronecker convolution layer construction module, used to construct Kronecker convolution layer with residual structure;

Feature extraction sub-network for inputting original images to output abstract feature maps, wherein the feature extraction sub-network includes the Kronecker convolution layer and the standard convolution layer;

A tree-shaped feature aggregation module for inputting the abstract feature map to output an aggregated feature map, wherein the tree-shaped feature aggregation module includes multiple layers of the Kronecker convolution layer;

The scene segmentation sub-network is used to input the aggregate feature map to output the scene segmentation result of the original image, where the scene segmentation sub-network includes multiple layers of the Kronecker convolution layer.
The scene segmentation system according to claim 6, wherein the Kronecker convolutional layer is formalized as
Wherein K (c 1, c 2) is a standard convolution kernel, c 1, c 2 for the channel convolutional Kronecker index layer, c 1 ∈ [1, C A], c 1 ∈ [1, C B], C a is the input K (c 1, c 2) the number of channels is characterized in FIG, C B is K (c 1, c 2) of FIG channels output characteristics, F is a two-dimensional spreading matrix, satisfy When K (c 1 , c 2 ) is k × k, K 1 (c 1 , c 2 ) is expanded to (2k + 1) r 1 × (2k + 1) r 1 ; k is the standard convolution Kernel size, r 1 is the expansion factor of the Kronecker convolution layer, r 2 is the sharing factor of the Kronecker convolution layer, c 1 , c 2 , C A , C B , k, r 1 , r 2 is a positive integer.
The scene segmentation system according to claim 6, wherein the feature extraction sub-network includes 5 sub-modules, sub-module 1 includes a 3-layer cascaded 3 × 3 standard convolutional layer, and sub-module 2 includes multiple cascaded A first bottleneck module, the sub-module 3 includes multiple cascaded first bottleneck modules, the sub-module 4 includes multiple cascaded second bottleneck modules, and the sub-module 5 includes multiple cascaded second bottleneck modules; wherein

The first bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of 3 × 3 standard convolutional layer and a layer of 1 × 1 standard convolutional layer;

The second bottleneck module includes a cascaded layer of 1 × 1 standard convolutional layer, a layer of the Kronecker convolutional layer and a layer of 1 × 1 standard convolutional layer.
The scene segmentation system according to claim 1, wherein the tree-shaped feature aggregation module includes a cascaded aggregation layer including the Kronecker convolution layer, batch normalization layer, and ReLU activation function , And take the output of each aggregation layer as the input of the next aggregation layer; the output of all aggregation layers in the tree-shaped feature aggregation module and the abstract feature map are merged through the cascade layer to obtain the aggregate feature map.
The scene segmentation system according to claim 1, wherein the scene segmentation sub-network includes a cascade of multiple layers of 3 × 3 standard convolutional layers and a layer of 1 × 1 standard convolutional layers.