CN113313721A - Real-time semantic segmentation method based on multi-scale structure - Google Patents
Real-time semantic segmentation method based on multi-scale structure Download PDFInfo
- Publication number
- CN113313721A CN113313721A CN202110867844.6A CN202110867844A CN113313721A CN 113313721 A CN113313721 A CN 113313721A CN 202110867844 A CN202110867844 A CN 202110867844A CN 113313721 A CN113313721 A CN 113313721A
- Authority
- CN
- China
- Prior art keywords
- semantic
- feature
- features
- semantic segmentation
- spatial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000004927 fusion Effects 0.000 claims abstract description 27
- 238000005070 sampling Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 239000010410 layer Substances 0.000 description 36
- 238000012549 training Methods 0.000 description 9
- 230000006872 improvement Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- BBRBUTFBTUFFBU-LHACABTQSA-N Ornoprostil Chemical compound CCCC[C@H](C)C[C@H](O)\C=C\[C@H]1[C@H](O)CC(=O)[C@@H]1CC(=O)CCCCC(=O)OC BBRBUTFBTUFFBU-LHACABTQSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000002146 bilateral effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20076—Probabilistic image processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a real-time semantic segmentation method based on a multi-scale structure, which comprises the steps of firstly, extracting high-dimensional features of semantic information branches; then establishing a context semantic branch and a space branch; and finally, inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task. The multiple convolutions embedded in the parallel semantic information branches of the invention integrate corresponding features of different stages and generate a strong global context feature representation with less computation cost. Compared with the BiSeNet, the method has the advantages that the speed is higher, the performance is equivalent, the FPS index is 195.7 on the urban landscape data set based on the ResNet18 backbone, the BiSeNet is far beyond the FPS index on the semantic segmentation inference speed, and the method is equivalent to that 45 images with the resolution ratio of 512 multiplied by 1024 can be calculated every second.
Description
Technical Field
The invention belongs to the technical field of intelligent processing of image video information, and particularly relates to a real-time semantic segmentation method based on a multi-scale structure.
Background
Semantic segmentation is one of important visual tasks in the deep learning field, is a challenging technology for image understanding and scene analysis, and has a very wide application range, wherein the video processing field has higher requirements on the realization of fast reasoning (Prediction) and Real-Time (Real-Time) response of a deep convolutional neural network by the semantic segmentation task. In the development process of semantic segmentation, a plurality of excellent algorithms emerge, and the improvement of the accuracy level of a segmentation model is improved. Nowadays, semantic segmentation is more slow at the speed of improving precision, and the model is more compatible due to the improvement of model reasoning speed.
The existing real-time semantic segmentation algorithm optimizes the convolutional neural network structure and adopts some model reduction methods, such as model compression, knowledge distillation, model pruning and other improvement schemes, and usually adopts a strategy of changing the speed by precision. It is worth noting that the improvement of the network structure is the most direct and feasible research direction for realizing the real-time semantic segmentation task.
In the current classical real-time semantic segmentation network BiSeNet, a U-shape-like cascade structure is adopted in a semantic branch, however, the U-shape-like cascade structure achieves restoration from high-dimensional features to the original size and still introduces more calculation amount, and the reasoning speed of the whole model is also slowed. The method is limited by the defect that the deep semantic feature map has a high channel number, and the calculation amount is inevitably increased suddenly when operations such as convolution and the like are carried out.
The chinese patent application No. 202011137108.7 discloses a real-time semantic segmentation method based on spatial information guidance, which utilizes shallow spatial detail information to continuously guide deep global context features to propagate to the neighborhood, thereby effectively reconstructing the spatial information lost in the global context features by adopting a single-stream segmentation method, and the network is a typical network structure of an encoder-decoder. The encoder is intended to encode an input picture, so that more abstract and more semantic feature expression is obtained. In the decoder part, a lightweight bidirectional network is designed to decode the coded features, and the guidance of spatial detail information is introduced in the decoding process. The patent adopts a single-stream segmentation mode, has large parameter quantity and low speed, and is not suitable for the condition of rapidly processing a large number of pictures.
Disclosure of Invention
The technical problems solved by the invention are as follows: the real-time semantic segmentation method based on the multi-scale structure is less in parameter quantity and higher in processing speed.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a real-time semantic segmentation method based on a multi-scale structure is used for extracting high-dimensional features of semantic information branches; establishing a context semantic branch and a spatial branch; and inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
The real-time semantic segmentation method based on the multi-scale structure specifically comprises the following steps:
step 1: firstly, extracting a high-dimensional characteristic diagram of semantic branches by using a residual error network;
step 2: constructing a spatial branch, and enabling a feature map with a size of 1/4 down-sampled by a high-dimensional feature map to pass through a pooling layer, and combining the feature map and a result of up-sampling by a feature map with a size of 1/16 down-sampled into an output feature map of the spatial branch, wherein the output feature map is used as one of the input of a feature fusion module;
and step 3: constructing semantic branches, respectively passing the feature graphs corresponding to 4 Bottleneck of ResNet through convolution layers, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through a sampling module;
and 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
Preferably, in step 1, extracting high-dimensional features, taking a ResNet18 shallow convolutional neural network as a backbone model, extracting semantic features from an input image layer by layer through predefined convolutional blocks in the network, finally mapping the image to a 512-dimensional feature map with an original image size 1/32, and reasonably extracting the high-dimensional features.
Preferably, in the step 2, the spatial branch is generated by combining a certain layer in the ResNet extraction features in the semantic branch with an upsampling operation, and is used as a feature map for supplementing spatial detail information.
Preferably, step 3 constructs a semantic branch, and performs convolution operation on the feature maps of 4 residual stages by using 4 different types of convolutional layers in combination with upsampling, so that each residual stage outputs a feature map of 1/16 original image space size of 128-dimensional channel.
Preferably, all feature maps are merged by using Concat layer, deep layer, rough layer and semantic information and shallow layer, detail and space information are fully aggregated, and the feature maps are reduced to 128 dimensions by using channel convolution so as to be input into the feature fusion module.
Preferably, the 4-layer convolution with the parallel structure includes standard convolution and expansion convolution, so as to deal with the problem that the change of the receptive fields of a plurality of feature maps with different sizes is large, and the expansion convolution is utilized to appropriately reduce the feature maps and increase semantic information.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the invention mainly aims at the improvement research of a real-time semantic segmentation network model, constructs a new real-time semantic segmentation fast system structure called SPCCNet (spatial and Parallel Context Combined network), and provides a real-time semantic segmentation model with more advantages in the segmentation inference speed from the existing bilateral segmentation network, and uses a new Context semantic branch and space branch structure, and the semantic information branch is encoded and input step by step in a ResNet18 trunk, and provides Context information for a feature fusion module. The properties of the former stage typically contain rich low-level details, while the latter stage provides high-level semantics. Multiple convolutions embedded in semantic information branches are used for gathering corresponding characteristics of different stages, a strong global context characteristic representation is generated at low calculation cost, a space branch is composed of a collection layer, an up-sampling operator and a projection convolution layer, and a concise component provides more space details for a network. The invention uses a double-flow mode to extract the spatial information and has the characteristics of less parameter quantity, higher speed and the like. Compared with the BiSeNet, the algorithm achieves higher speed and equivalent performance, and on an urban landscape data set based on the ResNet18 trunk, the mIoU accuracy rate is 72.34%, and the FPS is 195.7.
Drawings
FIG. 1 is the overall structure diagram of the SPCCNet network of the present invention;
FIG. 2 is a semantic branch in SPCCNet of the present invention;
FIG. 3 shows spatial branching in SPCCNet according to the present invention.
Detailed Description
The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The real-time semantic segmentation method based on the multi-scale structure is developed on a BiSeNet real-time semantic segmentation network architecture. The model firstly utilizes ResNet18 residual error network to complete the extraction of high-dimensional characteristic diagram (512 x h x w) of semantic branch; passing the down-sampled 1/4 size feature map through a pooling layer and combining the down-sampled 1/16 size feature map with the up-sampled result to form an output feature map of the spatial branch as one of the inputs to the feature fusion module; for semantic branching, 4 Bottleneeck corresponding feature graphs of ResNet respectively pass through convolution layers and are normalized into 128-dimensional feature graphs in channel dimension, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature Fusion module, entering a feature Fusion module FFM (feature Fusion module) after passing through an up-sampling module, and finally outputting corresponding prediction graphs to realize semantic segmentation tasks. The method specifically comprises the following steps:
step 1, extracting high-dimensional features: firstly, extracting a high-dimensional feature map (512 multiplied by h multiplied by w) of a semantic branch by using a ResNet18 residual error network;
extraction of high-dimensional features takes a shallow convolutional neural network such as ResNet18 as a backbone model, semantic features are extracted layer by layer from an input image through a predefined convolutional block in the network, and finally the image is mapped to a 512-dimensional feature map with an original image size 1/32, so that the high-dimensional features can be reasonably extracted.
Step 2, improving the spatial branch: the feature map of the downsampling 1/4 size of the high-dimensional feature map passes through a pooling layer and is combined with the result of upsampling the downsampling 1/16 size feature map into an output feature map of a spatial branch, and the output feature map is used as one of the input of a feature fusion module;
spatial branching improvement, the main flow is as follows: in spatial branching, the original BiseNet network adopts convolution operation of a single-layer large convolution kernel and convolution layers of two layers of 3 multiplied by 3 kernels to extract spatial information and encode detailed information of an original image, and finally the image is mapped to a characteristic diagram of 1/8 size so as to supplement spatial information missing in semantic branching. This is an acceleration strategy for the network with guaranteed accuracy, while other algorithms typically accelerate by reducing the input image resolution. According to the algorithm, three convolutional layers used for extracting spatial information in an original network are removed by means of the idea of 'sharing weight', and a certain layer (the second last layer) in ResNet extraction features in semantic branches is combined with an upsampling operation to generate spatial branches which are used as feature maps for supplementing spatial detail information. On one hand, because ResNet does not completely lose spatial information in the process of extracting features, the required feature map can be generated with smaller calculation amount by utilizing reasonable up-sampling operation, and meanwhile, because the feature map is acquired in the ResNet middle layer, deep semantic information is still abundant, and the segmentation precision is favorably improved. In addition, in view of fusing more original spatial detail information, the algorithm also obtains a shallow feature map of the original size 1/8 from the output of the first Bottleneck in ResNet using a computationally less pooling operation.
And step 3, semantic branch improvement: and respectively passing the 4 Bottleneck corresponding feature graphs of ResNet through a convolution layer, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, merging and compressing the feature graphs to the channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through an up-sampling module.
The main flow of semantic branch improvement is as follows: the cascade structure of the decoder stage is cancelled, and ARM and Refines modules used for enhancing features in the original network are removed, so that only the backbone structure of the ResNet residual network is reserved. Combining the semantic information and the spatial information change rule of the ResNet four convolution stages, designing and respectively utilizing 4 convolution layers of different types (convolution kernels with the sizes of 5 multiplied by 5, 3 multiplied by 3, 1 multiplied by 1 and 1 multiplied by 1 respectively) and combining upsampling to carry out convolution operation on the feature maps of the 4 residual stages, so that each residual stage outputs the feature map of 1/16 original image space size of a 128-dimensional channel. In the process, the feature maps output at different stages contain different levels of information, for example, the low-channel 64-dimensional feature map obtained through the expansion convolution operation contains rich spatial information and also has a small amount of semantic information, and the spatial information is transferred to the channel dimension; and the output of the 512-dimensional feature map after the up-sampling and standard convolution operation contains rich semantic information and a small amount of spatial information, and the semantic information is transferred to the spatial scale.
And then merging all the feature maps by using the Concat layer, and fully aggregating deep layer, rough layer and semantic information and shallow layer, detail and spatial information. And then, reducing the dimension of the feature map into 128 dimensions by using channel convolution so as to input the feature map into a feature fusion module, and finally performing an upsampling operation on the feature map because the feature fusion module needs to input two feature maps with the same size. The 4-layer convolution of the parallel structure comprises standard convolution and expansion convolution to deal with the problem that the variation of the receptive field of a plurality of characteristic images with different sizes is large, and the convolution operation and the receptive field variation are defined as follows:
wherein,yis the output of the expanding convolution operation, y [ i, j]And the output abscissa of the expansion convolution is i, and the ordinate is the output value of j point.rIs the expansion rate of the expanding convolution (rDegenerates to a standard convolution when = 1),xis to input a characteristic diagram of the image,Kis the convolution kernel size, K is used for the accumulation from 1 to the convolution kernel size K.w[k]It is the weight value that the convolution kernel should have,i,jis the corresponding eigenvalue on the characteristic graph; n denotes a few expansion convolution layers,r n is the firstnThe size of the corresponding receptive field in each convolutional layer,k n is the firstnA convolution step size of a convolution operation; siThe step size of the nth convolutional layer is shown. The expansion convolution is only carried out on the feature map of the ResNet first stage, and the stages are all standard convolution, because only limited convolution operation is included in the feature extraction process of the first stage, the feature map of the first stage has richer semantic information, but the receptive field is still small. In some verification experiments, the propagation of the loss taking the feature map at this stage as the main part is tried, but the practical effect is poor, which is caused by the fact that the shallower layer of the network is seriously lack of semantic information, so the algorithm utilizes the expansion convolution to increase the semantic information while properly reducing the feature map.
Further, the index for evaluating the segmentation effect used for displaying the model effect in the model includes mlou, mAcc, and alloca, where mlou (mean Intersection over union) is the most classical evaluation index, where IoU is obtained by calculating the ratio of Intersection and union of the set of real values and the set of predicted values, and mlou calculates IoU for each prediction category, and then averages, and defines the formula:
the proportion of true positive pixels to the correct type of total pixels is used to evaluate the semantic segmentation effect, which is defined as follows:
an mlou _ back (mlou with no background) index is also used, where the index represents the average cross-over ratio after the accuracy assessment of removing background classes.
Wherein,krepresenting the total number of prediction classes,p ii express a prediction asiClass and true class isiThe number of pixels of a class, i.e., True-Positive (TP),p ij express a prediction asiLike but actuallyjThe number of pixels of a class, i.e., False-Positive (FP),p ji then the indication is predicted asjLike but actuallyiThe number of pixels of a class, i.e., False-Negative (FN).
Since this chapter focuses on the research of the real-time semantic segmentation model, the segmentation inference speed of the model is evaluated, and therefore, the following formula is defined by using the number of frames Per second predicted by fps (frame Per second) for evaluating the semantic segmentation speed:
wherein,Nis the number of the test pictures,t i is the firstiAnd predicting the time for dividing the picture.
And 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
In order to verify the effectiveness of the model, the same training strategy is adopted in the whole experimental process to ensure fairness. SPCCNet takes ResNet18 as a backbone network to extract features, and sets an initial learning rate base \ ulr=1e-2And power =0.9 and according to base _ lr x (1-iter/total _ iter)powerPoly strategy of (1) changes the learning rate, and weight _ decay and momentum sums are set to 5e respectively-4And 0.9. Structurally, the present algorithm uses the OHEM cross entropy loss function to balance the problem of sample maldistribution, defined as follows:
wherein,Lis the definition of the joint loss function,l m is the main loss function of the output prediction graph,l i is the firsti The auxiliary loss function of each stage is used,X i then it is the model numberiThe characteristic diagram of each stage is shown,αfor balancing the ratio of the primary and secondary loss functions,Afor assisting the number of branches, and setting A= α = 1; wherein the main loss functionl m And auxiliary loss function of different stages l i The calculation is defined as follows:
wherein,Nis the size of the training batch,W k is the firstkThe loss weight of an individual class is,p i is the pixel belongs tojThe probability of a class is determined by the probability of the class,p k is the pixel belongs tokThe probability of a class.
The auxiliary prediction branch is an effective enhancement training strategy. Wherein, the addition of the auxiliary branch can enhance the learning of the network to the characteristics under different scales, andnetwork convergence is accelerated, and the auxiliary branch does not influence the calculation amount and the reasoning speed of the prediction stage. In the actual training stage, simple auxiliary prediction branches consisting of three layers of convolutions are added at different positions, and the proportion of different branches is adjusted by alpha to guide the training process of the network. It can be observed from the loss of loss function that auxiliary weights are used for different prediction classesWiTo balance the differences introduced by the samples. In addition, when the total loss is calculated, weights are preset for all categories, and the loss is solved by adopting a weighted average strategy.
For data enhancement, the image will be randomly enlarged or reduced in the range of [0.75, 1, 1.25, 1.5, 1.75, 2.0], randomly horizontally flipped with 50% probability, randomly rotated-10 to 10 degrees, random gaussian noise, etc., and in addition, the picture will be randomly cropped in consideration of excessive resolution of the citrescaps picture, and finally normalized to 768 × 1536 size to prevent memory overflow.
For training of the citysscapes dataset, a total of 19 classes were trained, training epochs were set to 80, and 1000 iterations were performed within each epoch with a batch size of batch _ size =8 to ensure that all training samples were used in each epoch training.
For the speed test of the verification model, 5000 pictures are adopted for carrying out segmentation tasks, and an average value strategy is taken to ensure the accuracy of the speed of the verification model.
In addition to the classical mlou and alloacc semantic segmentation precision indicators, an mlou _ back (mlou with no background) indicator was used in the experiment, which indicates the average cross-over ratio after the precision evaluation with background class removed. The segmentation inference speed of the model is evaluated by adopting an FPS (frame Per second) predicted frame number for evaluating the semantic segmentation speed, and the following formula is defined:
wherein,Nis the number of the test pictures,t i is the firstiAnd predicting the time for dividing the picture.
TABLE 1 semantic segmentation model comprehensive Performance comparison
Models | Backbone | #Param | mIoU (%) | allAcc (%) | FPS |
BiSeNet | ResNet18 | 101MB | 73.01 | 95.42 | 150.6 |
ours | ResNet18 | 91MB | 72.34 | 95.22 | 195.7 |
The performance of the SPCCNet segmentation model (Ours) for improving semantic branches and spatial branches is compared with that of a BiSeNet network, and it can be observed from Table 1 that under the condition that ResNet18 is taken as a backbone network, the SPCCNet model of the algorithm introduces fewer parameters to a certain extent and is more simplified; when the size of an input image is 512 multiplied by 1024 resolution, BiSeNet is far beyond the index of a semantic segmentation reasoning speed FPS, namely 45 images with the resolution of 512 multiplied by 1024 can be calculated more per second, and the efficiency is greatly improved.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (7)
1. A real-time semantic segmentation method based on a multi-scale structure is characterized in that high-dimensional feature extraction is carried out on semantic information branches; establishing a context semantic branch and a spatial branch; and inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
2. The method for real-time semantic segmentation based on multi-scale structures according to claim 1, characterized in that: the method comprises the following steps:
step 1: firstly, extracting a high-dimensional characteristic diagram of semantic branches by using a residual error network;
step 2: constructing a spatial branch, and enabling a feature map with a size of 1/4 down-sampled by a high-dimensional feature map to pass through a pooling layer, and combining the feature map and a result of up-sampling by a feature map with a size of 1/16 down-sampled into an output feature map of the spatial branch, wherein the output feature map is used as one of the input of a feature fusion module;
and step 3: constructing semantic branches, respectively passing the feature graphs corresponding to 4 Bottleneck of ResNet through convolution layers, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through a sampling module;
and 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
3. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and (2) extracting high-dimensional features in the step (1), taking a ResNet18 shallow convolutional neural network as a backbone model, extracting semantic features from an input image layer by layer through a predefined convolutional block in the network, finally mapping the image to a 512-dimensional feature map with an original image size of 1/32, and reasonably extracting the high-dimensional features.
4. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and 2, performing spatial branching, namely combining a certain layer in the ResNet extraction features in the semantic branches with an upsampling operation to generate spatial branches which are used as feature maps for supplementing spatial detail information.
5. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and 3, constructing semantic branches, and performing convolution operation on the feature maps of the 4 residual error stages by combining the convolution layers of 4 different types with up-sampling, so that each residual error stage outputs a feature map of 1/16 original image space size of a 128-dimensional channel.
6. The method for real-time semantic segmentation based on multi-scale structures according to claim 5, characterized in that: and merging all the feature maps by using a Concat layer, fully aggregating deep layer, rough layer and semantic information and shallow layer, detail and space information, and reducing the dimension of the feature maps into 128 dimensions by using channel convolution so as to input the feature maps into a feature fusion module.
7. The method for real-time semantic segmentation based on multi-scale structures according to claim 5, characterized in that: the 4-layer convolution of the parallel structure comprises standard convolution and expansion convolution so as to solve the problem that the change of the receptive fields of a plurality of feature maps with different sizes is large, and semantic information is increased while the feature maps are properly reduced by utilizing the expansion convolution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110867844.6A CN113313721B (en) | 2021-07-30 | 2021-07-30 | Real-time semantic segmentation method based on multi-scale structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110867844.6A CN113313721B (en) | 2021-07-30 | 2021-07-30 | Real-time semantic segmentation method based on multi-scale structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113313721A true CN113313721A (en) | 2021-08-27 |
CN113313721B CN113313721B (en) | 2021-11-19 |
Family
ID=77382422
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110867844.6A Active CN113313721B (en) | 2021-07-30 | 2021-07-30 | Real-time semantic segmentation method based on multi-scale structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113313721B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115995002A (en) * | 2023-03-24 | 2023-04-21 | 南京信息工程大学 | Network construction method and urban scene real-time semantic segmentation method |
CN118397259A (en) * | 2024-03-04 | 2024-07-26 | 中国科学院空天信息创新研究院 | Semantic segmentation method, device, equipment and storage medium for SAR image |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111127470A (en) * | 2019-12-24 | 2020-05-08 | 江西理工大学 | Image semantic segmentation method based on context and shallow space coding and decoding network |
US20200151497A1 (en) * | 2018-11-12 | 2020-05-14 | Sony Corporation | Semantic segmentation with soft cross-entropy loss |
-
2021
- 2021-07-30 CN CN202110867844.6A patent/CN113313721B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151497A1 (en) * | 2018-11-12 | 2020-05-14 | Sony Corporation | Semantic segmentation with soft cross-entropy loss |
CN111127470A (en) * | 2019-12-24 | 2020-05-08 | 江西理工大学 | Image semantic segmentation method based on context and shallow space coding and decoding network |
Non-Patent Citations (3)
Title |
---|
MINGYUAN FAN 等: "Rethinking BiSeNet For Real-time Semantic Segmentation", 《ARXIV:2104.13188V1》 * |
任天赐 等: "全局双边网络的语义分割算法", 《计算机科学》 * |
秦飞巍 等: "无人驾驶中的场景实时语义分割方法", 《计算机辅助设计与图形学学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115995002A (en) * | 2023-03-24 | 2023-04-21 | 南京信息工程大学 | Network construction method and urban scene real-time semantic segmentation method |
CN115995002B (en) * | 2023-03-24 | 2023-06-16 | 南京信息工程大学 | Network construction method and urban scene real-time semantic segmentation method |
CN118397259A (en) * | 2024-03-04 | 2024-07-26 | 中国科学院空天信息创新研究院 | Semantic segmentation method, device, equipment and storage medium for SAR image |
Also Published As
Publication number | Publication date |
---|---|
CN113313721B (en) | 2021-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111563508B (en) | Semantic segmentation method based on spatial information fusion | |
CN113850825A (en) | Remote sensing image road segmentation method based on context information and multi-scale feature fusion | |
CN113221969A (en) | Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion | |
CN114119638A (en) | Medical image segmentation method integrating multi-scale features and attention mechanism | |
CN111898439B (en) | Deep learning-based traffic scene joint target detection and semantic segmentation method | |
CN111062395B (en) | Real-time video semantic segmentation method | |
CN112381097A (en) | Scene semantic segmentation method based on deep learning | |
CN113392711B (en) | Smoke semantic segmentation method and system based on high-level semantics and noise suppression | |
CN113066089B (en) | Real-time image semantic segmentation method based on attention guide mechanism | |
CN113313721B (en) | Real-time semantic segmentation method based on multi-scale structure | |
CN112819000A (en) | Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium | |
CN114255403A (en) | Optical remote sensing image data processing method and system based on deep learning | |
CN115620010A (en) | Semantic segmentation method for RGB-T bimodal feature fusion | |
CN114332094A (en) | Semantic segmentation method and device based on lightweight multi-scale information fusion network | |
CN114821342A (en) | Remote sensing image road extraction method and system | |
CN113658200A (en) | Edge perception image semantic segmentation method based on self-adaptive feature fusion | |
CN113628297A (en) | COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning | |
CN112149526B (en) | Lane line detection method and system based on long-distance information fusion | |
CN117058542A (en) | Multi-scale high-precision light-weight target detection method based on large receptive field and attention mechanism | |
Hu et al. | LDPNet: A lightweight densely connected pyramid network for real-time semantic segmentation | |
CN115937693A (en) | Road identification method and system based on remote sensing image | |
CN115222750A (en) | Remote sensing image segmentation method and system based on multi-scale fusion attention | |
CN115995002B (en) | Network construction method and urban scene real-time semantic segmentation method | |
CN116631190A (en) | Intelligent traffic monitoring system and method thereof | |
CN115731226A (en) | Method for segmenting focus in skin mirror image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |