CN113313721A

CN113313721A - Real-time semantic segmentation method based on multi-scale structure

Info

Publication number: CN113313721A
Application number: CN202110867844.6A
Authority: CN
Inventors: 练智超; 贾稀贝; 刘悦; 陶叔银
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-08-27
Anticipated expiration: 2041-07-30
Also published as: CN113313721B

Abstract

The invention discloses a real-time semantic segmentation method based on a multi-scale structure, which comprises the steps of firstly, extracting high-dimensional features of semantic information branches; then establishing a context semantic branch and a space branch; and finally, inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task. The multiple convolutions embedded in the parallel semantic information branches of the invention integrate corresponding features of different stages and generate a strong global context feature representation with less computation cost. Compared with the BiSeNet, the method has the advantages that the speed is higher, the performance is equivalent, the FPS index is 195.7 on the urban landscape data set based on the ResNet18 backbone, the BiSeNet is far beyond the FPS index on the semantic segmentation inference speed, and the method is equivalent to that 45 images with the resolution ratio of 512 multiplied by 1024 can be calculated every second.

Description

Real-time semantic segmentation method based on multi-scale structure

Technical Field

The invention belongs to the technical field of intelligent processing of image video information, and particularly relates to a real-time semantic segmentation method based on a multi-scale structure.

Background

Semantic segmentation is one of important visual tasks in the deep learning field, is a challenging technology for image understanding and scene analysis, and has a very wide application range, wherein the video processing field has higher requirements on the realization of fast reasoning (Prediction) and Real-Time (Real-Time) response of a deep convolutional neural network by the semantic segmentation task. In the development process of semantic segmentation, a plurality of excellent algorithms emerge, and the improvement of the accuracy level of a segmentation model is improved. Nowadays, semantic segmentation is more slow at the speed of improving precision, and the model is more compatible due to the improvement of model reasoning speed.

The existing real-time semantic segmentation algorithm optimizes the convolutional neural network structure and adopts some model reduction methods, such as model compression, knowledge distillation, model pruning and other improvement schemes, and usually adopts a strategy of changing the speed by precision. It is worth noting that the improvement of the network structure is the most direct and feasible research direction for realizing the real-time semantic segmentation task.

In the current classical real-time semantic segmentation network BiSeNet, a U-shape-like cascade structure is adopted in a semantic branch, however, the U-shape-like cascade structure achieves restoration from high-dimensional features to the original size and still introduces more calculation amount, and the reasoning speed of the whole model is also slowed. The method is limited by the defect that the deep semantic feature map has a high channel number, and the calculation amount is inevitably increased suddenly when operations such as convolution and the like are carried out.

The chinese patent application No. 202011137108.7 discloses a real-time semantic segmentation method based on spatial information guidance, which utilizes shallow spatial detail information to continuously guide deep global context features to propagate to the neighborhood, thereby effectively reconstructing the spatial information lost in the global context features by adopting a single-stream segmentation method, and the network is a typical network structure of an encoder-decoder. The encoder is intended to encode an input picture, so that more abstract and more semantic feature expression is obtained. In the decoder part, a lightweight bidirectional network is designed to decode the coded features, and the guidance of spatial detail information is introduced in the decoding process. The patent adopts a single-stream segmentation mode, has large parameter quantity and low speed, and is not suitable for the condition of rapidly processing a large number of pictures.

Disclosure of Invention

The technical problems solved by the invention are as follows: the real-time semantic segmentation method based on the multi-scale structure is less in parameter quantity and higher in processing speed.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a real-time semantic segmentation method based on a multi-scale structure is used for extracting high-dimensional features of semantic information branches; establishing a context semantic branch and a spatial branch; and inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task.

The real-time semantic segmentation method based on the multi-scale structure specifically comprises the following steps:

step 1: firstly, extracting a high-dimensional characteristic diagram of semantic branches by using a residual error network;

step 2: constructing a spatial branch, and enabling a feature map with a size of 1/4 down-sampled by a high-dimensional feature map to pass through a pooling layer, and combining the feature map and a result of up-sampling by a feature map with a size of 1/16 down-sampled into an output feature map of the spatial branch, wherein the output feature map is used as one of the input of a feature fusion module;

and step 3: constructing semantic branches, respectively passing the feature graphs corresponding to 4 Bottleneck of ResNet through convolution layers, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through a sampling module;

and 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.

Preferably, in step 1, extracting high-dimensional features, taking a ResNet18 shallow convolutional neural network as a backbone model, extracting semantic features from an input image layer by layer through predefined convolutional blocks in the network, finally mapping the image to a 512-dimensional feature map with an original image size 1/32, and reasonably extracting the high-dimensional features.

Preferably, in the step 2, the spatial branch is generated by combining a certain layer in the ResNet extraction features in the semantic branch with an upsampling operation, and is used as a feature map for supplementing spatial detail information.

Preferably, step 3 constructs a semantic branch, and performs convolution operation on the feature maps of 4 residual stages by using 4 different types of convolutional layers in combination with upsampling, so that each residual stage outputs a feature map of 1/16 original image space size of 128-dimensional channel.

Preferably, all feature maps are merged by using Concat layer, deep layer, rough layer and semantic information and shallow layer, detail and space information are fully aggregated, and the feature maps are reduced to 128 dimensions by using channel convolution so as to be input into the feature fusion module.

Preferably, the 4-layer convolution with the parallel structure includes standard convolution and expansion convolution, so as to deal with the problem that the change of the receptive fields of a plurality of feature maps with different sizes is large, and the expansion convolution is utilized to appropriately reduce the feature maps and increase semantic information.

Has the advantages that: compared with the prior art, the invention has the following advantages:

the invention mainly aims at the improvement research of a real-time semantic segmentation network model, constructs a new real-time semantic segmentation fast system structure called SPCCNet (spatial and Parallel Context Combined network), and provides a real-time semantic segmentation model with more advantages in the segmentation inference speed from the existing bilateral segmentation network, and uses a new Context semantic branch and space branch structure, and the semantic information branch is encoded and input step by step in a ResNet18 trunk, and provides Context information for a feature fusion module. The properties of the former stage typically contain rich low-level details, while the latter stage provides high-level semantics. Multiple convolutions embedded in semantic information branches are used for gathering corresponding characteristics of different stages, a strong global context characteristic representation is generated at low calculation cost, a space branch is composed of a collection layer, an up-sampling operator and a projection convolution layer, and a concise component provides more space details for a network. The invention uses a double-flow mode to extract the spatial information and has the characteristics of less parameter quantity, higher speed and the like. Compared with the BiSeNet, the algorithm achieves higher speed and equivalent performance, and on an urban landscape data set based on the ResNet18 trunk, the mIoU accuracy rate is 72.34%, and the FPS is 195.7.

Drawings

FIG. 1 is the overall structure diagram of the SPCCNet network of the present invention;

FIG. 2 is a semantic branch in SPCCNet of the present invention;

FIG. 3 shows spatial branching in SPCCNet according to the present invention.

Detailed Description

The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The real-time semantic segmentation method based on the multi-scale structure is developed on a BiSeNet real-time semantic segmentation network architecture. The model firstly utilizes ResNet18 residual error network to complete the extraction of high-dimensional characteristic diagram (512 x h x w) of semantic branch; passing the down-sampled 1/4 size feature map through a pooling layer and combining the down-sampled 1/16 size feature map with the up-sampled result to form an output feature map of the spatial branch as one of the inputs to the feature fusion module; for semantic branching, 4 Bottleneeck corresponding feature graphs of ResNet respectively pass through convolution layers and are normalized into 128-dimensional feature graphs in channel dimension, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature Fusion module, entering a feature Fusion module FFM (feature Fusion module) after passing through an up-sampling module, and finally outputting corresponding prediction graphs to realize semantic segmentation tasks. The method specifically comprises the following steps:

step 1, extracting high-dimensional features: firstly, extracting a high-dimensional feature map (512 multiplied by h multiplied by w) of a semantic branch by using a ResNet18 residual error network;

extraction of high-dimensional features takes a shallow convolutional neural network such as ResNet18 as a backbone model, semantic features are extracted layer by layer from an input image through a predefined convolutional block in the network, and finally the image is mapped to a 512-dimensional feature map with an original image size 1/32, so that the high-dimensional features can be reasonably extracted.

Step 2, improving the spatial branch: the feature map of the downsampling 1/4 size of the high-dimensional feature map passes through a pooling layer and is combined with the result of upsampling the downsampling 1/16 size feature map into an output feature map of a spatial branch, and the output feature map is used as one of the input of a feature fusion module;

spatial branching improvement, the main flow is as follows: in spatial branching, the original BiseNet network adopts convolution operation of a single-layer large convolution kernel and convolution layers of two layers of 3 multiplied by 3 kernels to extract spatial information and encode detailed information of an original image, and finally the image is mapped to a characteristic diagram of 1/8 size so as to supplement spatial information missing in semantic branching. This is an acceleration strategy for the network with guaranteed accuracy, while other algorithms typically accelerate by reducing the input image resolution. According to the algorithm, three convolutional layers used for extracting spatial information in an original network are removed by means of the idea of 'sharing weight', and a certain layer (the second last layer) in ResNet extraction features in semantic branches is combined with an upsampling operation to generate spatial branches which are used as feature maps for supplementing spatial detail information. On one hand, because ResNet does not completely lose spatial information in the process of extracting features, the required feature map can be generated with smaller calculation amount by utilizing reasonable up-sampling operation, and meanwhile, because the feature map is acquired in the ResNet middle layer, deep semantic information is still abundant, and the segmentation precision is favorably improved. In addition, in view of fusing more original spatial detail information, the algorithm also obtains a shallow feature map of the original size 1/8 from the output of the first Bottleneck in ResNet using a computationally less pooling operation.

And step 3, semantic branch improvement: and respectively passing the 4 Bottleneck corresponding feature graphs of ResNet through a convolution layer, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, merging and compressing the feature graphs to the channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through an up-sampling module.

The main flow of semantic branch improvement is as follows: the cascade structure of the decoder stage is cancelled, and ARM and Refines modules used for enhancing features in the original network are removed, so that only the backbone structure of the ResNet residual network is reserved. Combining the semantic information and the spatial information change rule of the ResNet four convolution stages, designing and respectively utilizing 4 convolution layers of different types (convolution kernels with the sizes of 5 multiplied by 5, 3 multiplied by 3, 1 multiplied by 1 and 1 multiplied by 1 respectively) and combining upsampling to carry out convolution operation on the feature maps of the 4 residual stages, so that each residual stage outputs the feature map of 1/16 original image space size of a 128-dimensional channel. In the process, the feature maps output at different stages contain different levels of information, for example, the low-channel 64-dimensional feature map obtained through the expansion convolution operation contains rich spatial information and also has a small amount of semantic information, and the spatial information is transferred to the channel dimension; and the output of the 512-dimensional feature map after the up-sampling and standard convolution operation contains rich semantic information and a small amount of spatial information, and the semantic information is transferred to the spatial scale.

And then merging all the feature maps by using the Concat layer, and fully aggregating deep layer, rough layer and semantic information and shallow layer, detail and spatial information. And then, reducing the dimension of the feature map into 128 dimensions by using channel convolution so as to input the feature map into a feature fusion module, and finally performing an upsampling operation on the feature map because the feature fusion module needs to input two feature maps with the same size. The 4-layer convolution of the parallel structure comprises standard convolution and expansion convolution to deal with the problem that the variation of the receptive field of a plurality of characteristic images with different sizes is large, and the convolution operation and the receptive field variation are defined as follows:

wherein,yis the output of the expanding convolution operation, y [ i, j]And the output abscissa of the expansion convolution is i, and the ordinate is the output value of j point.rIs the expansion rate of the expanding convolution (rDegenerates to a standard convolution when = 1),xis to input a characteristic diagram of the image,Kis the convolution kernel size, K is used for the accumulation from 1 to the convolution kernel size K.w[k]It is the weight value that the convolution kernel should have,i,jis the corresponding eigenvalue on the characteristic graph; n denotes a few expansion convolution layers,r _nis the firstnThe size of the corresponding receptive field in each convolutional layer,k _nis the firstnA convolution step size of a convolution operation; s_iThe step size of the nth convolutional layer is shown. The expansion convolution is only carried out on the feature map of the ResNet first stage, and the stages are all standard convolution, because only limited convolution operation is included in the feature extraction process of the first stage, the feature map of the first stage has richer semantic information, but the receptive field is still small. In some verification experiments, the propagation of the loss taking the feature map at this stage as the main part is tried, but the practical effect is poor, which is caused by the fact that the shallower layer of the network is seriously lack of semantic information, so the algorithm utilizes the expansion convolution to increase the semantic information while properly reducing the feature map.

Further, the index for evaluating the segmentation effect used for displaying the model effect in the model includes mlou, mAcc, and alloca, where mlou (mean Intersection over union) is the most classical evaluation index, where IoU is obtained by calculating the ratio of Intersection and union of the set of real values and the set of predicted values, and mlou calculates IoU for each prediction category, and then averages, and defines the formula:

the proportion of true positive pixels to the correct type of total pixels is used to evaluate the semantic segmentation effect, which is defined as follows:

an mlou _ back (mlou with no background) index is also used, where the index represents the average cross-over ratio after the accuracy assessment of removing background classes.

Wherein,krepresenting the total number of prediction classes,p _iiexpress a prediction asiClass and true class isiThe number of pixels of a class, i.e., True-Positive (TP),p _ijexpress a prediction asiLike but actuallyjThe number of pixels of a class, i.e., False-Positive (FP),p _jithen the indication is predicted asjLike but actuallyiThe number of pixels of a class, i.e., False-Negative (FN).

Since this chapter focuses on the research of the real-time semantic segmentation model, the segmentation inference speed of the model is evaluated, and therefore, the following formula is defined by using the number of frames Per second predicted by fps (frame Per second) for evaluating the semantic segmentation speed:

wherein,Nis the number of the test pictures,t _iis the firstiAnd predicting the time for dividing the picture.

In order to verify the effectiveness of the model, the same training strategy is adopted in the whole experimental process to ensure fairness. SPCCNet takes ResNet18 as a backbone network to extract features, and sets an initial learning rate base \ ulr=1e^-2And power =0.9 and according to base _ lr x (1-iter/total _ iter)^powerPoly strategy of (1) changes the learning rate, and weight _ decay and momentum sums are set to 5e respectively^-4And 0.9. Structurally, the present algorithm uses the OHEM cross entropy loss function to balance the problem of sample maldistribution, defined as follows:

wherein,Lis the definition of the joint loss function,l _mis the main loss function of the output prediction graph,l _iis the firsti

The auxiliary loss function of each stage is used,X _ithen it is the model numberiThe characteristic diagram of each stage is shown,αfor balancing the ratio of the primary and secondary loss functions,Afor assisting the number of branches, and setting

A= α = 1; wherein the main loss functionl _mAnd auxiliary loss function of different stages

l _iThe calculation is defined as follows:

wherein,Nis the size of the training batch,W _kis the firstkThe loss weight of an individual class is,p _iis the pixel belongs tojThe probability of a class is determined by the probability of the class,p _kis the pixel belongs tokThe probability of a class.

The auxiliary prediction branch is an effective enhancement training strategy. Wherein, the addition of the auxiliary branch can enhance the learning of the network to the characteristics under different scales, andnetwork convergence is accelerated, and the auxiliary branch does not influence the calculation amount and the reasoning speed of the prediction stage. In the actual training stage, simple auxiliary prediction branches consisting of three layers of convolutions are added at different positions, and the proportion of different branches is adjusted by alpha to guide the training process of the network. It can be observed from the loss of loss function that auxiliary weights are used for different prediction classesWiTo balance the differences introduced by the samples. In addition, when the total loss is calculated, weights are preset for all categories, and the loss is solved by adopting a weighted average strategy.

For data enhancement, the image will be randomly enlarged or reduced in the range of [0.75, 1, 1.25, 1.5, 1.75, 2.0], randomly horizontally flipped with 50% probability, randomly rotated-10 to 10 degrees, random gaussian noise, etc., and in addition, the picture will be randomly cropped in consideration of excessive resolution of the citrescaps picture, and finally normalized to 768 × 1536 size to prevent memory overflow.

For training of the citysscapes dataset, a total of 19 classes were trained, training epochs were set to 80, and 1000 iterations were performed within each epoch with a batch size of batch _ size =8 to ensure that all training samples were used in each epoch training.

For the speed test of the verification model, 5000 pictures are adopted for carrying out segmentation tasks, and an average value strategy is taken to ensure the accuracy of the speed of the verification model.

In addition to the classical mlou and alloacc semantic segmentation precision indicators, an mlou _ back (mlou with no background) indicator was used in the experiment, which indicates the average cross-over ratio after the precision evaluation with background class removed. The segmentation inference speed of the model is evaluated by adopting an FPS (frame Per second) predicted frame number for evaluating the semantic segmentation speed, and the following formula is defined:

TABLE 1 semantic segmentation model comprehensive Performance comparison

Models	Backbone	#Param	mIoU (%)	allAcc (%)	FPS
						BiSeNet	ResNet18	101MB	73.01	95.42	150.6
ours	ResNet18	91MB	72.34	95.22	195.7

The performance of the SPCCNet segmentation model (Ours) for improving semantic branches and spatial branches is compared with that of a BiSeNet network, and it can be observed from Table 1 that under the condition that ResNet18 is taken as a backbone network, the SPCCNet model of the algorithm introduces fewer parameters to a certain extent and is more simplified; when the size of an input image is 512 multiplied by 1024 resolution, BiSeNet is far beyond the index of a semantic segmentation reasoning speed FPS, namely 45 images with the resolution of 512 multiplied by 1024 can be calculated more per second, and the efficiency is greatly improved.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A real-time semantic segmentation method based on a multi-scale structure is characterized in that high-dimensional feature extraction is carried out on semantic information branches; establishing a context semantic branch and a spatial branch; and inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task.

2. The method for real-time semantic segmentation based on multi-scale structures according to claim 1, characterized in that: the method comprises the following steps:

3. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and (2) extracting high-dimensional features in the step (1), taking a ResNet18 shallow convolutional neural network as a backbone model, extracting semantic features from an input image layer by layer through a predefined convolutional block in the network, finally mapping the image to a 512-dimensional feature map with an original image size of 1/32, and reasonably extracting the high-dimensional features.

4. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and 2, performing spatial branching, namely combining a certain layer in the ResNet extraction features in the semantic branches with an upsampling operation to generate spatial branches which are used as feature maps for supplementing spatial detail information.

5. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and 3, constructing semantic branches, and performing convolution operation on the feature maps of the 4 residual error stages by combining the convolution layers of 4 different types with up-sampling, so that each residual error stage outputs a feature map of 1/16 original image space size of a 128-dimensional channel.

6. The method for real-time semantic segmentation based on multi-scale structures according to claim 5, characterized in that: and merging all the feature maps by using a Concat layer, fully aggregating deep layer, rough layer and semantic information and shallow layer, detail and space information, and reducing the dimension of the feature maps into 128 dimensions by using channel convolution so as to input the feature maps into a feature fusion module.

7. The method for real-time semantic segmentation based on multi-scale structures according to claim 5, characterized in that: the 4-layer convolution of the parallel structure comprises standard convolution and expansion convolution so as to solve the problem that the change of the receptive fields of a plurality of feature maps with different sizes is large, and semantic information is increased while the feature maps are properly reduced by utilizing the expansion convolution.