CN113313721A - Real-time semantic segmentation method based on multi-scale structure - Google Patents

Real-time semantic segmentation method based on multi-scale structure Download PDF

Info

Publication number
CN113313721A
CN113313721A CN202110867844.6A CN202110867844A CN113313721A CN 113313721 A CN113313721 A CN 113313721A CN 202110867844 A CN202110867844 A CN 202110867844A CN 113313721 A CN113313721 A CN 113313721A
Authority
CN
China
Prior art keywords
semantic
feature
features
semantic segmentation
spatial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110867844.6A
Other languages
Chinese (zh)
Other versions
CN113313721B (en
Inventor
练智超
贾稀贝
刘悦
陶叔银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202110867844.6A priority Critical patent/CN113313721B/en
Publication of CN113313721A publication Critical patent/CN113313721A/en
Application granted granted Critical
Publication of CN113313721B publication Critical patent/CN113313721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a real-time semantic segmentation method based on a multi-scale structure, which comprises the steps of firstly, extracting high-dimensional features of semantic information branches; then establishing a context semantic branch and a space branch; and finally, inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task. The multiple convolutions embedded in the parallel semantic information branches of the invention integrate corresponding features of different stages and generate a strong global context feature representation with less computation cost. Compared with the BiSeNet, the method has the advantages that the speed is higher, the performance is equivalent, the FPS index is 195.7 on the urban landscape data set based on the ResNet18 backbone, the BiSeNet is far beyond the FPS index on the semantic segmentation inference speed, and the method is equivalent to that 45 images with the resolution ratio of 512 multiplied by 1024 can be calculated every second.

Description

Real-time semantic segmentation method based on multi-scale structure
Technical Field
The invention belongs to the technical field of intelligent processing of image video information, and particularly relates to a real-time semantic segmentation method based on a multi-scale structure.
Background
Semantic segmentation is one of important visual tasks in the deep learning field, is a challenging technology for image understanding and scene analysis, and has a very wide application range, wherein the video processing field has higher requirements on the realization of fast reasoning (Prediction) and Real-Time (Real-Time) response of a deep convolutional neural network by the semantic segmentation task. In the development process of semantic segmentation, a plurality of excellent algorithms emerge, and the improvement of the accuracy level of a segmentation model is improved. Nowadays, semantic segmentation is more slow at the speed of improving precision, and the model is more compatible due to the improvement of model reasoning speed.
The existing real-time semantic segmentation algorithm optimizes the convolutional neural network structure and adopts some model reduction methods, such as model compression, knowledge distillation, model pruning and other improvement schemes, and usually adopts a strategy of changing the speed by precision. It is worth noting that the improvement of the network structure is the most direct and feasible research direction for realizing the real-time semantic segmentation task.
In the current classical real-time semantic segmentation network BiSeNet, a U-shape-like cascade structure is adopted in a semantic branch, however, the U-shape-like cascade structure achieves restoration from high-dimensional features to the original size and still introduces more calculation amount, and the reasoning speed of the whole model is also slowed. The method is limited by the defect that the deep semantic feature map has a high channel number, and the calculation amount is inevitably increased suddenly when operations such as convolution and the like are carried out.
The chinese patent application No. 202011137108.7 discloses a real-time semantic segmentation method based on spatial information guidance, which utilizes shallow spatial detail information to continuously guide deep global context features to propagate to the neighborhood, thereby effectively reconstructing the spatial information lost in the global context features by adopting a single-stream segmentation method, and the network is a typical network structure of an encoder-decoder. The encoder is intended to encode an input picture, so that more abstract and more semantic feature expression is obtained. In the decoder part, a lightweight bidirectional network is designed to decode the coded features, and the guidance of spatial detail information is introduced in the decoding process. The patent adopts a single-stream segmentation mode, has large parameter quantity and low speed, and is not suitable for the condition of rapidly processing a large number of pictures.
Disclosure of Invention
The technical problems solved by the invention are as follows: the real-time semantic segmentation method based on the multi-scale structure is less in parameter quantity and higher in processing speed.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a real-time semantic segmentation method based on a multi-scale structure is used for extracting high-dimensional features of semantic information branches; establishing a context semantic branch and a spatial branch; and inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
The real-time semantic segmentation method based on the multi-scale structure specifically comprises the following steps:
step 1: firstly, extracting a high-dimensional characteristic diagram of semantic branches by using a residual error network;
step 2: constructing a spatial branch, and enabling a feature map with a size of 1/4 down-sampled by a high-dimensional feature map to pass through a pooling layer, and combining the feature map and a result of up-sampling by a feature map with a size of 1/16 down-sampled into an output feature map of the spatial branch, wherein the output feature map is used as one of the input of a feature fusion module;
and step 3: constructing semantic branches, respectively passing the feature graphs corresponding to 4 Bottleneck of ResNet through convolution layers, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through a sampling module;
and 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
Preferably, in step 1, extracting high-dimensional features, taking a ResNet18 shallow convolutional neural network as a backbone model, extracting semantic features from an input image layer by layer through predefined convolutional blocks in the network, finally mapping the image to a 512-dimensional feature map with an original image size 1/32, and reasonably extracting the high-dimensional features.
Preferably, in the step 2, the spatial branch is generated by combining a certain layer in the ResNet extraction features in the semantic branch with an upsampling operation, and is used as a feature map for supplementing spatial detail information.
Preferably, step 3 constructs a semantic branch, and performs convolution operation on the feature maps of 4 residual stages by using 4 different types of convolutional layers in combination with upsampling, so that each residual stage outputs a feature map of 1/16 original image space size of 128-dimensional channel.
Preferably, all feature maps are merged by using Concat layer, deep layer, rough layer and semantic information and shallow layer, detail and space information are fully aggregated, and the feature maps are reduced to 128 dimensions by using channel convolution so as to be input into the feature fusion module.
Preferably, the 4-layer convolution with the parallel structure includes standard convolution and expansion convolution, so as to deal with the problem that the change of the receptive fields of a plurality of feature maps with different sizes is large, and the expansion convolution is utilized to appropriately reduce the feature maps and increase semantic information.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the invention mainly aims at the improvement research of a real-time semantic segmentation network model, constructs a new real-time semantic segmentation fast system structure called SPCCNet (spatial and Parallel Context Combined network), and provides a real-time semantic segmentation model with more advantages in the segmentation inference speed from the existing bilateral segmentation network, and uses a new Context semantic branch and space branch structure, and the semantic information branch is encoded and input step by step in a ResNet18 trunk, and provides Context information for a feature fusion module. The properties of the former stage typically contain rich low-level details, while the latter stage provides high-level semantics. Multiple convolutions embedded in semantic information branches are used for gathering corresponding characteristics of different stages, a strong global context characteristic representation is generated at low calculation cost, a space branch is composed of a collection layer, an up-sampling operator and a projection convolution layer, and a concise component provides more space details for a network. The invention uses a double-flow mode to extract the spatial information and has the characteristics of less parameter quantity, higher speed and the like. Compared with the BiSeNet, the algorithm achieves higher speed and equivalent performance, and on an urban landscape data set based on the ResNet18 trunk, the mIoU accuracy rate is 72.34%, and the FPS is 195.7.
Drawings
FIG. 1 is the overall structure diagram of the SPCCNet network of the present invention;
FIG. 2 is a semantic branch in SPCCNet of the present invention;
FIG. 3 shows spatial branching in SPCCNet according to the present invention.
Detailed Description
The present invention will be further illustrated by the following specific examples, which are carried out on the premise of the technical scheme of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The real-time semantic segmentation method based on the multi-scale structure is developed on a BiSeNet real-time semantic segmentation network architecture. The model firstly utilizes ResNet18 residual error network to complete the extraction of high-dimensional characteristic diagram (512 x h x w) of semantic branch; passing the down-sampled 1/4 size feature map through a pooling layer and combining the down-sampled 1/16 size feature map with the up-sampled result to form an output feature map of the spatial branch as one of the inputs to the feature fusion module; for semantic branching, 4 Bottleneeck corresponding feature graphs of ResNet respectively pass through convolution layers and are normalized into 128-dimensional feature graphs in channel dimension, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature Fusion module, entering a feature Fusion module FFM (feature Fusion module) after passing through an up-sampling module, and finally outputting corresponding prediction graphs to realize semantic segmentation tasks. The method specifically comprises the following steps:
step 1, extracting high-dimensional features: firstly, extracting a high-dimensional feature map (512 multiplied by h multiplied by w) of a semantic branch by using a ResNet18 residual error network;
extraction of high-dimensional features takes a shallow convolutional neural network such as ResNet18 as a backbone model, semantic features are extracted layer by layer from an input image through a predefined convolutional block in the network, and finally the image is mapped to a 512-dimensional feature map with an original image size 1/32, so that the high-dimensional features can be reasonably extracted.
Step 2, improving the spatial branch: the feature map of the downsampling 1/4 size of the high-dimensional feature map passes through a pooling layer and is combined with the result of upsampling the downsampling 1/16 size feature map into an output feature map of a spatial branch, and the output feature map is used as one of the input of a feature fusion module;
spatial branching improvement, the main flow is as follows: in spatial branching, the original BiseNet network adopts convolution operation of a single-layer large convolution kernel and convolution layers of two layers of 3 multiplied by 3 kernels to extract spatial information and encode detailed information of an original image, and finally the image is mapped to a characteristic diagram of 1/8 size so as to supplement spatial information missing in semantic branching. This is an acceleration strategy for the network with guaranteed accuracy, while other algorithms typically accelerate by reducing the input image resolution. According to the algorithm, three convolutional layers used for extracting spatial information in an original network are removed by means of the idea of 'sharing weight', and a certain layer (the second last layer) in ResNet extraction features in semantic branches is combined with an upsampling operation to generate spatial branches which are used as feature maps for supplementing spatial detail information. On one hand, because ResNet does not completely lose spatial information in the process of extracting features, the required feature map can be generated with smaller calculation amount by utilizing reasonable up-sampling operation, and meanwhile, because the feature map is acquired in the ResNet middle layer, deep semantic information is still abundant, and the segmentation precision is favorably improved. In addition, in view of fusing more original spatial detail information, the algorithm also obtains a shallow feature map of the original size 1/8 from the output of the first Bottleneck in ResNet using a computationally less pooling operation.
And step 3, semantic branch improvement: and respectively passing the 4 Bottleneck corresponding feature graphs of ResNet through a convolution layer, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, merging and compressing the feature graphs to the channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through an up-sampling module.
The main flow of semantic branch improvement is as follows: the cascade structure of the decoder stage is cancelled, and ARM and Refines modules used for enhancing features in the original network are removed, so that only the backbone structure of the ResNet residual network is reserved. Combining the semantic information and the spatial information change rule of the ResNet four convolution stages, designing and respectively utilizing 4 convolution layers of different types (convolution kernels with the sizes of 5 multiplied by 5, 3 multiplied by 3, 1 multiplied by 1 and 1 multiplied by 1 respectively) and combining upsampling to carry out convolution operation on the feature maps of the 4 residual stages, so that each residual stage outputs the feature map of 1/16 original image space size of a 128-dimensional channel. In the process, the feature maps output at different stages contain different levels of information, for example, the low-channel 64-dimensional feature map obtained through the expansion convolution operation contains rich spatial information and also has a small amount of semantic information, and the spatial information is transferred to the channel dimension; and the output of the 512-dimensional feature map after the up-sampling and standard convolution operation contains rich semantic information and a small amount of spatial information, and the semantic information is transferred to the spatial scale.
And then merging all the feature maps by using the Concat layer, and fully aggregating deep layer, rough layer and semantic information and shallow layer, detail and spatial information. And then, reducing the dimension of the feature map into 128 dimensions by using channel convolution so as to input the feature map into a feature fusion module, and finally performing an upsampling operation on the feature map because the feature fusion module needs to input two feature maps with the same size. The 4-layer convolution of the parallel structure comprises standard convolution and expansion convolution to deal with the problem that the variation of the receptive field of a plurality of characteristic images with different sizes is large, and the convolution operation and the receptive field variation are defined as follows:
Figure 233063DEST_PATH_IMAGE001
wherein,yis the output of the expanding convolution operation, y [ i, j]And the output abscissa of the expansion convolution is i, and the ordinate is the output value of j point.rIs the expansion rate of the expanding convolution (rDegenerates to a standard convolution when = 1),xis to input a characteristic diagram of the image,Kis the convolution kernel size, K is used for the accumulation from 1 to the convolution kernel size K.w[k]It is the weight value that the convolution kernel should have,i,jis the corresponding eigenvalue on the characteristic graph; n denotes a few expansion convolution layers,r n is the firstnThe size of the corresponding receptive field in each convolutional layer,k n is the firstnA convolution step size of a convolution operation; siThe step size of the nth convolutional layer is shown. The expansion convolution is only carried out on the feature map of the ResNet first stage, and the stages are all standard convolution, because only limited convolution operation is included in the feature extraction process of the first stage, the feature map of the first stage has richer semantic information, but the receptive field is still small. In some verification experiments, the propagation of the loss taking the feature map at this stage as the main part is tried, but the practical effect is poor, which is caused by the fact that the shallower layer of the network is seriously lack of semantic information, so the algorithm utilizes the expansion convolution to increase the semantic information while properly reducing the feature map.
Further, the index for evaluating the segmentation effect used for displaying the model effect in the model includes mlou, mAcc, and alloca, where mlou (mean Intersection over union) is the most classical evaluation index, where IoU is obtained by calculating the ratio of Intersection and union of the set of real values and the set of predicted values, and mlou calculates IoU for each prediction category, and then averages, and defines the formula:
Figure 553186DEST_PATH_IMAGE002
the proportion of true positive pixels to the correct type of total pixels is used to evaluate the semantic segmentation effect, which is defined as follows:
Figure 655135DEST_PATH_IMAGE003
Figure 568864DEST_PATH_IMAGE004
an mlou _ back (mlou with no background) index is also used, where the index represents the average cross-over ratio after the accuracy assessment of removing background classes.
Wherein,krepresenting the total number of prediction classes,p ii express a prediction asiClass and true class isiThe number of pixels of a class, i.e., True-Positive (TP),p ij express a prediction asiLike but actuallyjThe number of pixels of a class, i.e., False-Positive (FP),p ji then the indication is predicted asjLike but actuallyiThe number of pixels of a class, i.e., False-Negative (FN).
Since this chapter focuses on the research of the real-time semantic segmentation model, the segmentation inference speed of the model is evaluated, and therefore, the following formula is defined by using the number of frames Per second predicted by fps (frame Per second) for evaluating the semantic segmentation speed:
Figure 14627DEST_PATH_IMAGE005
wherein,Nis the number of the test pictures,t i is the firstiAnd predicting the time for dividing the picture.
And 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
In order to verify the effectiveness of the model, the same training strategy is adopted in the whole experimental process to ensure fairness. SPCCNet takes ResNet18 as a backbone network to extract features, and sets an initial learning rate base \ ulr=1e-2And power =0.9 and according to base _ lr x (1-iter/total _ iter)powerPoly strategy of (1) changes the learning rate, and weight _ decay and momentum sums are set to 5e respectively-4And 0.9. Structurally, the present algorithm uses the OHEM cross entropy loss function to balance the problem of sample maldistribution, defined as follows:
Figure 392518DEST_PATH_IMAGE006
wherein,Lis the definition of the joint loss function,l m is the main loss function of the output prediction graph,l i is the firsti
Figure 806313DEST_PATH_IMAGE007
The auxiliary loss function of each stage is used,X i then it is the model numberiThe characteristic diagram of each stage is shown,αfor balancing the ratio of the primary and secondary loss functions,Afor assisting the number of branches, and setting
Figure 971453DEST_PATH_IMAGE008
A= α = 1; wherein the main loss functionl m And auxiliary loss function of different stages
Figure 847005DEST_PATH_IMAGE009
l i The calculation is defined as follows:
Figure 17086DEST_PATH_IMAGE010
wherein,Nis the size of the training batch,W k is the firstkThe loss weight of an individual class is,p i is the pixel belongs tojThe probability of a class is determined by the probability of the class,p k is the pixel belongs tokThe probability of a class.
The auxiliary prediction branch is an effective enhancement training strategy. Wherein, the addition of the auxiliary branch can enhance the learning of the network to the characteristics under different scales, andnetwork convergence is accelerated, and the auxiliary branch does not influence the calculation amount and the reasoning speed of the prediction stage. In the actual training stage, simple auxiliary prediction branches consisting of three layers of convolutions are added at different positions, and the proportion of different branches is adjusted by alpha to guide the training process of the network. It can be observed from the loss of loss function that auxiliary weights are used for different prediction classesWiTo balance the differences introduced by the samples. In addition, when the total loss is calculated, weights are preset for all categories, and the loss is solved by adopting a weighted average strategy.
For data enhancement, the image will be randomly enlarged or reduced in the range of [0.75, 1, 1.25, 1.5, 1.75, 2.0], randomly horizontally flipped with 50% probability, randomly rotated-10 to 10 degrees, random gaussian noise, etc., and in addition, the picture will be randomly cropped in consideration of excessive resolution of the citrescaps picture, and finally normalized to 768 × 1536 size to prevent memory overflow.
For training of the citysscapes dataset, a total of 19 classes were trained, training epochs were set to 80, and 1000 iterations were performed within each epoch with a batch size of batch _ size =8 to ensure that all training samples were used in each epoch training.
For the speed test of the verification model, 5000 pictures are adopted for carrying out segmentation tasks, and an average value strategy is taken to ensure the accuracy of the speed of the verification model.
In addition to the classical mlou and alloacc semantic segmentation precision indicators, an mlou _ back (mlou with no background) indicator was used in the experiment, which indicates the average cross-over ratio after the precision evaluation with background class removed. The segmentation inference speed of the model is evaluated by adopting an FPS (frame Per second) predicted frame number for evaluating the semantic segmentation speed, and the following formula is defined:
Figure 132941DEST_PATH_IMAGE011
wherein,Nis the number of the test pictures,t i is the firstiAnd predicting the time for dividing the picture.
TABLE 1 semantic segmentation model comprehensive Performance comparison
Models Backbone #Param mIoU (%) allAcc (%) FPS
BiSeNet ResNet18 101MB 73.01 95.42 150.6
ours ResNet18 91MB 72.34 95.22 195.7
The performance of the SPCCNet segmentation model (Ours) for improving semantic branches and spatial branches is compared with that of a BiSeNet network, and it can be observed from Table 1 that under the condition that ResNet18 is taken as a backbone network, the SPCCNet model of the algorithm introduces fewer parameters to a certain extent and is more simplified; when the size of an input image is 512 multiplied by 1024 resolution, BiSeNet is far beyond the index of a semantic segmentation reasoning speed FPS, namely 45 images with the resolution of 512 multiplied by 1024 can be calculated more per second, and the efficiency is greatly improved.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (7)

1. A real-time semantic segmentation method based on a multi-scale structure is characterized in that high-dimensional feature extraction is carried out on semantic information branches; establishing a context semantic branch and a spatial branch; and inputting the semantic features and the spatial features into a feature fusion module for feature fusion, and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
2. The method for real-time semantic segmentation based on multi-scale structures according to claim 1, characterized in that: the method comprises the following steps:
step 1: firstly, extracting a high-dimensional characteristic diagram of semantic branches by using a residual error network;
step 2: constructing a spatial branch, and enabling a feature map with a size of 1/4 down-sampled by a high-dimensional feature map to pass through a pooling layer, and combining the feature map and a result of up-sampling by a feature map with a size of 1/16 down-sampled into an output feature map of the spatial branch, wherein the output feature map is used as one of the input of a feature fusion module;
and step 3: constructing semantic branches, respectively passing the feature graphs corresponding to 4 Bottleneck of ResNet through convolution layers, normalizing the feature graphs in channel dimensions into 128-dimensional feature graphs, then merging and compressing the feature graphs to channel dimension feature graphs suitable for being input by a feature fusion module, and entering the feature fusion module after passing through a sampling module;
and 4, step 4: and (3) performing feature fusion on the spatial features and the semantic features obtained in the step (2) and the step (3), and finally outputting a corresponding prediction graph to realize a semantic segmentation task.
3. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and (2) extracting high-dimensional features in the step (1), taking a ResNet18 shallow convolutional neural network as a backbone model, extracting semantic features from an input image layer by layer through a predefined convolutional block in the network, finally mapping the image to a 512-dimensional feature map with an original image size of 1/32, and reasonably extracting the high-dimensional features.
4. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and 2, performing spatial branching, namely combining a certain layer in the ResNet extraction features in the semantic branches with an upsampling operation to generate spatial branches which are used as feature maps for supplementing spatial detail information.
5. The method for real-time semantic segmentation based on multi-scale structures according to claim 2, characterized in that: and 3, constructing semantic branches, and performing convolution operation on the feature maps of the 4 residual error stages by combining the convolution layers of 4 different types with up-sampling, so that each residual error stage outputs a feature map of 1/16 original image space size of a 128-dimensional channel.
6. The method for real-time semantic segmentation based on multi-scale structures according to claim 5, characterized in that: and merging all the feature maps by using a Concat layer, fully aggregating deep layer, rough layer and semantic information and shallow layer, detail and space information, and reducing the dimension of the feature maps into 128 dimensions by using channel convolution so as to input the feature maps into a feature fusion module.
7. The method for real-time semantic segmentation based on multi-scale structures according to claim 5, characterized in that: the 4-layer convolution of the parallel structure comprises standard convolution and expansion convolution so as to solve the problem that the change of the receptive fields of a plurality of feature maps with different sizes is large, and semantic information is increased while the feature maps are properly reduced by utilizing the expansion convolution.
CN202110867844.6A 2021-07-30 2021-07-30 Real-time semantic segmentation method based on multi-scale structure Active CN113313721B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110867844.6A CN113313721B (en) 2021-07-30 2021-07-30 Real-time semantic segmentation method based on multi-scale structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110867844.6A CN113313721B (en) 2021-07-30 2021-07-30 Real-time semantic segmentation method based on multi-scale structure

Publications (2)

Publication Number Publication Date
CN113313721A true CN113313721A (en) 2021-08-27
CN113313721B CN113313721B (en) 2021-11-19

Family

ID=77382422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110867844.6A Active CN113313721B (en) 2021-07-30 2021-07-30 Real-time semantic segmentation method based on multi-scale structure

Country Status (1)

Country Link
CN (1) CN113313721B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115995002A (en) * 2023-03-24 2023-04-21 南京信息工程大学 Network construction method and urban scene real-time semantic segmentation method
CN118397259A (en) * 2024-03-04 2024-07-26 中国科学院空天信息创新研究院 Semantic segmentation method, device, equipment and storage medium for SAR image

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111127470A (en) * 2019-12-24 2020-05-08 江西理工大学 Image semantic segmentation method based on context and shallow space coding and decoding network
US20200151497A1 (en) * 2018-11-12 2020-05-14 Sony Corporation Semantic segmentation with soft cross-entropy loss

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151497A1 (en) * 2018-11-12 2020-05-14 Sony Corporation Semantic segmentation with soft cross-entropy loss
CN111127470A (en) * 2019-12-24 2020-05-08 江西理工大学 Image semantic segmentation method based on context and shallow space coding and decoding network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MINGYUAN FAN 等: "Rethinking BiSeNet For Real-time Semantic Segmentation", 《ARXIV:2104.13188V1》 *
任天赐 等: "全局双边网络的语义分割算法", 《计算机科学》 *
秦飞巍 等: "无人驾驶中的场景实时语义分割方法", 《计算机辅助设计与图形学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115995002A (en) * 2023-03-24 2023-04-21 南京信息工程大学 Network construction method and urban scene real-time semantic segmentation method
CN115995002B (en) * 2023-03-24 2023-06-16 南京信息工程大学 Network construction method and urban scene real-time semantic segmentation method
CN118397259A (en) * 2024-03-04 2024-07-26 中国科学院空天信息创新研究院 Semantic segmentation method, device, equipment and storage medium for SAR image

Also Published As

Publication number Publication date
CN113313721B (en) 2021-11-19

Similar Documents

Publication Publication Date Title
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN113850825A (en) Remote sensing image road segmentation method based on context information and multi-scale feature fusion
CN113221969A (en) Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion
CN114119638A (en) Medical image segmentation method integrating multi-scale features and attention mechanism
CN111898439B (en) Deep learning-based traffic scene joint target detection and semantic segmentation method
CN111062395B (en) Real-time video semantic segmentation method
CN112381097A (en) Scene semantic segmentation method based on deep learning
CN113392711B (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN113313721B (en) Real-time semantic segmentation method based on multi-scale structure
CN112819000A (en) Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium
CN114255403A (en) Optical remote sensing image data processing method and system based on deep learning
CN115620010A (en) Semantic segmentation method for RGB-T bimodal feature fusion
CN114332094A (en) Semantic segmentation method and device based on lightweight multi-scale information fusion network
CN114821342A (en) Remote sensing image road extraction method and system
CN113658200A (en) Edge perception image semantic segmentation method based on self-adaptive feature fusion
CN113628297A (en) COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN117058542A (en) Multi-scale high-precision light-weight target detection method based on large receptive field and attention mechanism
Hu et al. LDPNet: A lightweight densely connected pyramid network for real-time semantic segmentation
CN115937693A (en) Road identification method and system based on remote sensing image
CN115222750A (en) Remote sensing image segmentation method and system based on multi-scale fusion attention
CN115995002B (en) Network construction method and urban scene real-time semantic segmentation method
CN116631190A (en) Intelligent traffic monitoring system and method thereof
CN115731226A (en) Method for segmenting focus in skin mirror image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant