CN111091130A

CN111091130A - Real-time image semantic segmentation method and system based on lightweight convolutional neural network

Info

Publication number: CN111091130A
Application number: CN201911280783.2A
Authority: CN
Inventors: 周全; 刘嘉; 王杰; 李圣华; 强勇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-05-01

Abstract

The invention discloses a real-time image semantic segmentation method and a system based on a lightweight convolutional neural network, wherein the system comprises a down-sampling unit, an up-sampling unit and an extremely efficient residual error module; the down-sampling unit reduces the resolution and changes the number of channels; the up-sampling unit improves the resolution and changes the number of channels; the extremely efficient residual module is composed of a 1D decomposed convolution, an efficient depth separable convolution and an expansion convolution with different expansion rates, and is used for extracting features. The whole network architecture is an efficient asymmetric multi-branch coding and decoding structure, an additional post-processing strategy and a pre-training model are not used, and compared with the most advanced light-weight network model at present, the network architecture and the segmentation method provided by the invention realize the optimal balance between the segmentation precision and the implementation efficiency, and become an effective method for solving the real-time image semantic segmentation task.

Description

Real-time image semantic segmentation method and system based on lightweight convolutional neural network

Technical Field

The invention belongs to the field of image semantic segmentation, and particularly relates to a real-time image semantic segmentation method based on a lightweight convolutional neural network.

Background

Semantic segmentation is always an important field in computer vision, and with the popularity of deep learning, semantic segmentation tasks also make great progress. As a core problem of computer vision, scene understanding is increasingly important, because more and more application scenes in reality need to infer related knowledge or semantics (i.e., a process from concrete to abstract) from images, and these applications include automatic driving, human-computer interaction, computational photography, image search engines, augmented reality, and the like. Image semantic segmentation based on deep learning is one of the best solutions because it is robust enough in analyzing complex environments, divides the captured image into several regions, and identifies the class (object) of each pixel, and thus can be considered as a pixel-level classification. Unlike the target detection and image classification tasks, image semantic segmentation identifies object classes in an image and finds the location of objects in the image. In addition, it provides accurate object boundary information. Especially in the field of automatic driving, stable and reliable analysis of surrounding scenes is crucial for a safe driving environment.

In recent years, convolutional neural networks, particularly full convolutional networks and codec networks, have become a major trend to solve the problem of semantic segmentation. While dramatic advances have been made by designing deeper and larger networks (e.g., vgnet and ResNet), these typically consume large amounts of resources and are not suitable for mobile devices, such as cell phones, robots, and unmanned aircraft, which have memory limitations and insufficient computing power. In order to adapt to the real-world scene requiring real-time prediction and decision-making, recent research is more inclined to construct lightweight networks with shallow architecture, and the design idea of such networks can be roughly divided into three categories: (1) the method based on network compression removes the redundancy of the pre-trained model through a pruning technology so as to improve the efficiency. (2) Low-bit based approaches, such as Xnor-net, use quantization techniques to improve efficiency, where the learned model weights are represented by a few digits rather than high precision floating points. Unlike the network compression methods, these models do not generally change the network structure, but the performance of segmentation is generally poor. (3) Methods based on lightweight convolutional neural networks, which are computationally inexpensive, are often used to deconvolute the model size to improve efficiency. For example, ShuffleNet and MobileNet use depth separable convolution to save computational budget, decompose standard convolution into depth convolution and 1x1 point convolution, first channel-by-channel convolution, and then learn the linear combination of input channels by using 1x1 point convolution to recover channel dependency, making the network more efficient. ERFNet decomposes one 2D convolution (e.g., 3x3) into two 1D factor convolutions (e.g., 3x1 and 1x3), effectively reducing network parameters, but with a significant reduction in feature extraction capability. Yet another approach to make the network lighter is to use group convolution, where the input channels and convolution kernel are decomposed into groups accordingly, each group being convolved independently. Although all achieve better results, most previous networks tend to adopt shallow network architectures in order to reduce the complexity of the model, which can weaken the expressive power of the data, resulting in a reduction in the segmentation performance. Therefore, pursuing the optimal balance between the segmentation accuracy and the implementation efficiency is still an open research problem of the real-time image semantic segmentation task, and is also a problem to be solved in the current lightweight network design task.

Disclosure of Invention

The purpose of the invention is as follows: in view of the above-mentioned drawbacks and deficiencies of the prior art, an object of the present invention is to provide a lightweight convolutional neural network architecture for real-time image semantic segmentation, which achieves an optimal balance between model segmentation accuracy and efficiency.

The invention content is as follows: the invention relates to a real-time image semantic segmentation method based on a lightweight convolutional neural network, which comprises the following steps of:

(1) preprocessing an input original image to obtain a down-sampled image which is used as an input image of an encoder;

(2) performing double downsampling on the downsampled image by using a downsampling unit to obtain a first feature map with 16 channels;

(3) performing double downsampling by using the first feature map of the downsampling unit to obtain a second feature map with 64 channels;

(4) performing repeated convolution operation on the second feature map by using an extremely efficient residual error module to obtain a third feature map; the extremely efficient residual module is composed of a 1D decomposition convolution, an efficient depth separable convolution and expansion convolutions with different expansion rates;

(5) a down-sampling unit is used for carrying out two times of down-sampling on the third feature map to obtain a fourth feature map with the channel number being 128;

(6) performing repeated convolution operation by using the fourth characteristic diagram of the extremely efficient residual error module to obtain a fifth characteristic diagram, namely the output of the encoder;

(7) performing double upsampling on the fifth feature map by using an upsampling unit to obtain a sixth feature map with 64 channel numbers;

(8) performing convolution operation on the third feature map by using an extremely efficient residual error module to obtain a first branch feature map with the channel number of 64, and fusing the information of the first branch feature map and the information of the sixth feature map to form a new seventh feature map;

(9) carrying out repeated convolution operation by utilizing the seventh characteristic diagram of the extremely efficient residual error module to obtain an eighth characteristic diagram with the channel number of 64;

(10) performing double upsampling on the eighth feature map by using an upsampling unit to obtain a ninth feature map with the channel number of 16;

(11) performing double upsampling by using a third feature map of an upsampling unit to obtain a second branch feature map with the channel number of 16;

(12) performing convolution operation on the second branch feature map by using an extremely efficient residual error module to obtain a new second branch feature map with the channel number of 16, and fusing information of the new second branch feature map with information of the ninth feature map to form a tenth feature map;

(13) performing repeated convolution operation on the tenth feature map by using an extremely efficient residual error module to obtain an eleventh feature map with 16 channels;

(14) and performing double upsampling on the eleventh feature map by using an upsampling unit, and mapping the upsampled feature map to the segmentation classes to obtain a feature map with the channel number being the segmentation class number C, namely the output of a decoder, wherein the feature map is used as a final segmentation result map of the whole encoder-decoder network, and the resolution of the feature map is consistent with the input image of the encoder.

Further, the image preprocessing process in the step (1) is as follows: the original image is zoomed to half of the original size, the zoomed image is overturned left and right, then the zoomed image is translated randomly, and the image with half of the original size is cut out from the translated image.

Further, the resolution and the number of feature channels of the third feature map in the step (4) are the same as those of the second feature map.

Further, the up-sampling unit in the step (7) is formed by sequentially stacking an anti-convolution layer, an activation layer and a batch normalization layer.

Further, the upsampling unit in step (14) is composed of a deconvolution layer.

The invention also provides a real-time image semantic segmentation system based on the lightweight convolutional neural network, which comprises a down-sampling unit, an up-sampling unit and an extremely efficient residual error module; the down-sampling unit reduces resolution and changes the number of channels; the up-sampling unit improves the resolution and changes the number of channels; the extremely efficient residual error module is composed of a 1D decomposition convolution, an efficient depth separable convolution and expansion convolutions with different expansion rates and is used for extracting features; the system also includes two jump junction branches that fuse deep and shallow features.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the efficient asymmetric network architecture is composed of a series of Extremely Efficient Residual Modules (EERMs), on one hand, the EERMs adopt widely used 1D decomposition convolution, efficient depth separable convolution and expansion convolution with different expansion rates in a residual layer, the 1D decomposition convolution and the depth separable convolution can effectively reduce the computation complexity of the network, so that the network can keep very few model parameters to accelerate reasoning speed, and the use of the expansion convolution enlarges the receptive field (the area size of each pixel point on a feature map output by each layer of a convolutional neural network mapped on an original image), thereby not only increasing the computation burden, but also improving the effect of feature extraction; on the other hand, the whole network adopts a deeper network structure, two jump connection branches are added, deep and shallow features are fused, more context information is collected, the feature expression capability of the network is further improved, and the optimal balance between accuracy and efficiency of the network is ensured.

Drawings

FIG. 1 is a diagram of a real-time image semantic segmentation system architecture based on a lightweight convolutional neural network according to the present invention;

FIG. 2 is a diagram of EERM versus other residual modules;

FIG. 3 is a graph comparing qualitative results of segmentation on a Cityscapes basis for a plurality of lightweight networks according to the present invention;

FIG. 4 is a diagram of qualitative results of segmentation on a Camvid basis according to the system and method of the present invention.

Detailed Description

In order to make the objects, technical solutions and innovations of the present invention more clear, the present invention is described in detail below with reference to the accompanying drawings.

Fig. 1 shows that the real-time image semantic segmentation system based on the lightweight convolutional neural network provided by the invention is composed of three basic components, including a down-sampling unit, an up-sampling unit and an extremely efficient convolution module. We refer to the entire framework in FIG. 1 as a lightweight convolutional Neural Network (FDDWNet, A lightweight convolutional Neural Network for Real-Time Semantic Segmentation) for Real-Time image Semantic Segmentation.

The network structure of the system is a high-efficiency asymmetric multi-branch coding and decoding architecture, namely, an encoder extracts features and gradually reduces the image resolution, a corresponding decoder performs up-sampling on a deep low-resolution feature map to match the input image resolution, restores the spatial information of the image and maps the spatial information to segmentation categories, and finally generates a semantic segmentation result map with the same resolution as the input image of the encoder.

The core component of the whole network structure is a novel residual error module, namely an extremely efficient residual error module (EERM), which adopts widely used 1D decomposition convolution, efficient depth separable convolution and expansion convolution with different expansion rates. The 1D decomposition convolution and the depth separable convolution can effectively reduce the calculation complexity of the network, so that the network can keep very few model parameters to accelerate the reasoning speed, and the use of the expansion convolution enlarges the receptive field (the area size of each pixel point on the feature map output by each layer of the convolutional neural network mapped on the original image), thereby not only not increasing the calculation burden, but also improving the effect of feature extraction. Moreover, the whole network adopts a deeper network structure, two jump connection branches are added, deep and shallow features are fused, more context information is collected, the feature expression capability of the network is further improved, and the best compromise between accuracy and efficiency of the network is ensured.

By stacking the three basic components, the efficient asymmetric coding and decoding network architecture is constructed. The encoder is used for extracting features to obtain a low-resolution deep semantic feature map, the decoder gradually upsamples the low-resolution deep semantic feature map to the size of an input image of the encoder, and finally a C-dimensional segmentation prediction result map is output (C is the total number of classes contained in the training data set, and generally comprises a foreground class and a background class).

The overall network framework according to fig. 1 can efficiently extract semantic information of deep layers of images, and can be trained end to generate semantic segmentation results with the same input resolution. Compared with the recent mainstream lightweight network, the designed network architecture realizes the optimal balance between the segmentation accuracy and the implementation efficiency.

With reference to fig. 1, the invention provides a real-time image semantic segmentation method based on a lightweight convolutional neural network, which specifically includes the following steps:

and S1, preprocessing the input original image to obtain a down-sampled image with half resolution of the original image, wherein the down-sampled image is used as the input image of the encoder. Specifically, the preprocessing process includes firstly scaling the original image to half of the original size, then turning the scaled image left and right, then performing random translation (0-2 pixel points), and then cutting out an image half of the original image from the translated image, where the down-sampled image is used as the input image of the encoder.

S2, performing double Downsampling on the input image of the encoder in S1 by using a Downsampling Unit (the Downsampling Unit is composed of parallel branches, the size of a convolution kernel adopted by one side of each parallel branch is 3 multiplied by 3, the number of the convolution kernels is 16-3 to 13, the step length is 2, the number of channels of an output feature diagram of each side branch is 16-3 to 13, the number of channels of the output feature diagram of each side branch is Max-Pooling, the number of the channels of the input feature diagram of the Downsampling Unit is 3, then connecting the feature diagrams obtained by the two branches on the channels, namely stacking the channels, outputting the feature diagrams as Downsampling units, and obtaining a first feature diagram with the number of channels being 16;

s3, performing double Downsampling on the first feature map obtained in the S2 by using a Downsampling Unit (Downsampling Unit) to obtain a second feature map with 64 channels;

s4, convolving the second feature map obtained in S3 with an extremely efficient residual error module (EERM, n is 3). Repeating the convolution operation for 5 times, setting r to be 1 by adopting the same expansion rate each time, wherein the number of convolution kernels in each convolution is 64, and finally obtaining a third feature graph with 64 channel numbers, wherein the resolution and the feature channel number of the third feature graph are the same as those of the second feature graph;

s5, performing double Downsampling on the third feature map obtained in the S4 by using a Downsampling Unit (Downsampling Unit) to obtain a fourth feature map with the channel number of 128;

s6, the fourth feature map obtained in S5 is convolved with an extremely efficient residual error module (EERM, n is 3). Repeating the convolution operation 16 times, wherein different expansion rates are adopted each time, the r is set to be 1, 2, 5, 9, 1, 2, 5, 9, 2, 5, 9, 17, 2, 5, 9 and 17 in sequence, the number of convolution kernels in each convolution is 128, and finally, a fifth feature diagram with the channel number of 128, namely the output of an encoder, is obtained;

and S7, performing double Upsampling on the fifth feature map obtained in the S6 by using an Upsampling Unit (Upsampling Unit), and obtaining a sixth feature map with the channel number of 64. Specifically, the up-sampling Unit is formed by sequentially stacking a deconvolution layer, an active layer (ReLU), and a Batch Normalization layer (BN);

s8, performing a convolution operation on the third feature map obtained in S4 by using an extreme efficient residual error module (EERM, n is 3, r is 1) to obtain a first branch feature map with the number of channels being 64, and fusing information of the first branch feature map with information of the sixth feature map obtained in S7 to form a new seventh feature map;

s9 is performed by convolving the seventh feature map obtained in S8 with an extremely efficient residual error module (EERM, n is 3, r is 1). Repeating the convolution operation twice, wherein the number of convolution kernels in each convolution is 64, and finally obtaining an eighth feature map with the channel number of 64, wherein the resolution and the feature channel number of the eighth feature map are the same as those of the seventh feature map;

and S10, performing double Upsampling on the eighth feature map obtained in the S9 by using an Upsampling Unit (Upsampling Unit), and obtaining a ninth feature map with the channel number being 16. Specifically, the up-sampling unit is formed by sequentially stacking an anti-convolution layer, an activation layer and a batch normalization layer;

and S11, performing double Upsampling on the third feature map obtained in the S4 by using an Upsampling Unit (Upsampling Unit), and obtaining a second branch feature map with the channel number of 16. Specifically, the up-sampling unit is formed by sequentially stacking an anti-convolution layer, an activation layer and a batch normalization layer;

s12, performing a convolution operation on the second branch feature map obtained in S11 by using an extreme efficient residual error module (EERM, n is 3, r is 1) to obtain a new second branch feature map with the number of channels being 16, and fusing information of the new second branch feature map with information of the ninth feature map obtained in S10 to form a tenth feature map;

s13 is performed by convolving the tenth feature map obtained in S12 with an extremely efficient residual error module (EERM, n is 3, r is 1). Repeating the convolution operation twice, wherein the number of convolution kernels in each convolution is 16, and finally obtaining an eleventh feature map with 16 channels, wherein the resolution and the feature channel number of the eleventh feature map are the same as those of the tenth feature map;

s14, using an Upsampling Unit (Upsampling Unit) to perform double Upsampling on the eleventh feature map obtained in S13, and mapping the Upsampling to the segmentation classes to obtain a feature map with the number of channels being the number of segmentation classes C, that is, the output of the decoder, which is the final segmentation result map of the entire encoder-decoder network, and the resolution of the feature map is consistent with the input image of the encoder. It should be noted that the last upsampling unit is directly formed by a deconvolution layer, and an activation layer and a batch normalization layer are not needed.

Specifically, the following description is provided: the above steps S1-S14 are illustrated in the overall network structure diagram of fig. 1, and the steps and fig. 1 can be mutually verified. It should be noted that the overall network designed by the present invention is tested on two common image semantic segmentation references cityscaps and cammid, so the parameters identified in fig. 1: the number of x channels is wide and x high, and can be changed correspondingly for different data sets, and the cityscaps data set is used as an example in fig. 1.

Fig. 2 is a comparison diagram of the core module EERM of the entire network with other residual modules, as follows:

fig. 2(a) is a common building block in a residual network ResNet, called a bottleneck residual block (BottleneckBlock), which is mainly used by ENet. It is characterized in that the main branch three-layer structure is 1x1, nxn, 1x1 convolutional layers respectively, wherein two 1x1 convolutional layers (1x1 convolutional is also commonly called point convolutional) are used for reducing and increasing the channel dimension, and an activation function is added between the convolutional layers, so that the layers are normalized in batches. And shortcut connection is adopted for the side branch, and pixel point-by-pixel point addition is directly carried out on the side branch and the output of the main branch. Such building blocks are called bottleneck residual blocks, since the main branches are hourglass-shaped or bottleneck-shaped. The residual error module has the advantages that the dimension of the 1x1 structure is increased and reduced, so that the model parameters are reduced, a more compact network structure is constructed, the depth of the network can be further increased, and the lower operation amount is still kept;

fig. 2(b) is the core module of MobileNet, a deep separable convolution. The standard Convolution is decomposed into a Depth-wise Convolution (Depth-wise Convolution) and a 1x1 Point-wise Convolution (Point-wise Convolution), channel-wise Convolution is performed first, then linear combination of input channels is learned through the use of 1x1 Point Convolution, channel dependency is recovered, network parameters are greatly reduced, and implementation efficiency is improved;

fig. 2(c) is a one-dimensional Non-bottleneck residual block (Non-bottle-1D), i.e., a general 3x3 convolution is split into 3x1 and 1x3 convolutions by using the principle of convolution decomposition. The method is characterized in that module parameters are greatly reduced, and particularly, when a deep network is built by utilizing the residual module, the network parameters are greatly reduced. The inherent reason of the design is that the use process of the residual block is deeply analyzed to find that a large amount of redundant channel information exists in the network, which provides a basis for network parameter compression, so that the design adopts the convolution decomposition principle to split the common convolution, so that the 2D convolution kernel is reduced to the 1D convolution kernel, although the calculation burden is lightened, the local receptive field range of the convolution kernel is limited, and the characteristic expression capability is insufficient;

FIG. 2(d) is the core module of ShuffleNet, the Channel Shuffle module Channel Shuffle. Because a large number of 1x1 convolutions consume a lot of computing resources, the ShuffleNet proposes Point-wise group convolution (Point-wise group convolution) to help reduce the computing complexity and averagely divide the channels of the features into different groups; however, the use of point-by-point group convolution has an amplitude effect, and Channel Shuffle (Channel Shuffle) is proposed to help information circulation. Thus, on the premise of given computational complexity budget, ShuffleNet allows more feature mapping channels to be used, which is beneficial to coding more information on a small network;

fig. 2(e) is a core component of the network architecture designed by the present invention, which is referred to as EERM. EERM adopts widely used 1D decomposition convolution in a residual layer, high-efficiency depth separable convolution and 1 expansion convolution with different expansion rates, the D decomposition convolution and the depth separable convolution can effectively reduce the computation complexity of a network, so that the network can keep very few model parameters to accelerate the reasoning speed, and the use of the expansion convolution expands the receptive field (i.e. the size of the area of each pixel point on a feature map output by each layer of a convolutional neural network, which is mapped on an original image), thereby not only increasing the computation burden, but also improving the effect of feature extraction. Supplementary explanation: the expanding Convolution, also commonly referred to as a punctured or hole Convolution (or associated Convolution), is intended to expand the receptive field of the neurons, and it is noted that when the expansion rate of the expanding Convolution is 1, the expanding Convolution is not different from the normal Convolution, and when the expansion rate is >1, the expanding Convolution introduces an interval between each value (neuron) of the Convolution kernel of the normal Convolution, that is, inserts a zero value of the expansion rate-1 between two adjacent neurons, and under the same computational complexity, the expanding Convolution provides a larger receptive field.

FIG. 3 is a graph comparing qualitative results of partitioning between a designed network and a plurality of lightweight networks on the Cityscapes basis. In order to verify the accuracy and implementation efficiency of the network design of the invention, the model is trained, evaluated and predicted on a widely used Cityscapes data set. The cityscaps contain a data set with Fine annotations (gtFine, group pitch Fine annotation), where the training set/validation set/test set contains 2975/500/1525 images each, and a data set with coarse annotations (gtCoarse, group pitch coarse annotation), which contains 20k images with coarse annotations. The number of segmentation classes is set to 20 during training, namely 19 target classes and 1 background class, after training, 6 most advanced lightweight networks are selected as baselines in fig. 3 by qualitatively comparing segmentation results output by a plurality of lightweight networks, wherein the segmentation results comprise DABNet, DSNet, Fast-scnn, ESPNetv2, ERFNet and CGNet. To evaluate the segmentation performance, the evaluation Index uses a standard Jaccard Index (also commonly referred to as PASCAL VOC cross-linking (or cross-over ratio) metric, which is defined as IoU (interaction-over-Union) following formula:

wherein, TP, FP and FN respectively represent the number of true (true Positive), False Positive (False Positive) and False Negative (False Negative) pixel points determined on the whole Cityscapes test set. Model FDDWNet designed by the invention is IoU on the test set under the condition of only using Fine Annotation training_classAnd IoU_category71.5%, 88.2%, respectively, with 13 of the 19 categories obtaining the best score value and a speed of 60 FPS measured with an input picture size of 1024x 512. Experimental results show that the designed efficient symmetric network realizes the optimal balance between the segmentation precision and the implementation efficiency, and the performance of the efficient symmetric network exceeds a plurality of advanced models to a great extent. It can be seen from fig. 3 that FDDWNet performs best in terms of segmentation accuracy compared to the segmentation results of DABNet, DSNet, Fast-scnn, ESPNetv2, ERFNet and CGNet. Compared with DSNet (0.91M, 37 FPS, 69.3%), the invention (0.80M, 60 FPS, 71.5%) not only makes the model smaller with 2% higher precision, but also runs at 1.6 times faster, and in terms of segmentation precision, the model size is 1.4 times that of FDDWNet, although Fast-scnn (1.11M, 124FPS, 68.0%) runs at twice as Fast, but also has 3% lower precision.

FIG. 4 is a graph comparing the qualitative results of segmentation of the network designed by the present invention on the basis of CamVid. IoU on its test set_class66.9%, and a velocity of 79FPS was measured at an input picture size of 480x 360. Experimental results show that the FDDWNet designed by the invention not only can correctly classify objects with different scales at pixel level, but also can generate consistent qualitative results for all classes. Whether quantitative or qualitative results are compared, the superiority of the lightweight convolutional neural network in the real-time image semantic segmentation task is fully demonstrated.

Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the specific embodiments, and modifications and equivalents within the scope of the claims may be made by those skilled in the art and are included in the scope of the present invention.

Claims

1. A real-time image semantic segmentation method based on a lightweight convolutional neural network is characterized by comprising the following steps:

2. The method for semantically segmenting the real-time image based on the lightweight convolutional neural network as claimed in claim 1, wherein the image preprocessing process in step (1) is as follows: the original image is zoomed to half of the original size, the zoomed image is overturned left and right, then the zoomed image is translated randomly, and the image with half of the original size is cut out from the translated image.

3. The method for semantically segmenting the real-time image based on the lightweight convolutional neural network as claimed in claim 1, wherein the resolution and the number of feature channels of the third feature map in the step (4) are the same as those of the second feature map.

4. The method for real-time image semantic segmentation based on the lightweight convolutional neural network as claimed in claim 1, wherein the upsampling unit in step (7) is formed by sequentially stacking a deconvolution layer, an activation layer and a batch normalization layer.

5. The method for semantically segmenting the real-time image based on the lightweight convolutional neural network as claimed in claim 1, wherein the upsampling unit of step (14) is composed of a deconvolution layer.

6. A real-time image semantic segmentation system based on a lightweight convolutional neural network using the method of claim 1, comprising a down-sampling unit, an up-sampling unit and an extremely efficient residual error module; the down-sampling unit reduces resolution and changes the number of channels; the up-sampling unit improves the resolution and changes the number of channels; the extremely efficient residual module is composed of a 1D decomposition convolution, an efficient depth separable convolution and expansion convolutions with different expansion rates and is used for extracting features.

7. The system for real-time image semantic segmentation based on the lightweight convolutional neural network as claimed in claim 6, further comprising two jump connection branches for fusing deep and shallow features.