CN112396607A - Streetscape image semantic segmentation method for deformable convolution fusion enhancement - Google Patents

Streetscape image semantic segmentation method for deformable convolution fusion enhancement Download PDF

Info

Publication number
CN112396607A
CN112396607A CN202011291950.6A CN202011291950A CN112396607A CN 112396607 A CN112396607 A CN 112396607A CN 202011291950 A CN202011291950 A CN 202011291950A CN 112396607 A CN112396607 A CN 112396607A
Authority
CN
China
Prior art keywords
feature
image
semantic segmentation
fusion
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011291950.6A
Other languages
Chinese (zh)
Other versions
CN112396607B (en
Inventor
张珣
秦晓海
刘宪圣
张浩轩
江东
张迎春
付晶莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Geographic Sciences and Natural Resources of CAS
Beijing Technology and Business University
Original Assignee
Institute of Geographic Sciences and Natural Resources of CAS
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Geographic Sciences and Natural Resources of CAS, Beijing Technology and Business University filed Critical Institute of Geographic Sciences and Natural Resources of CAS
Priority to CN202011291950.6A priority Critical patent/CN112396607B/en
Publication of CN112396607A publication Critical patent/CN112396607A/en
Application granted granted Critical
Publication of CN112396607B publication Critical patent/CN112396607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a streetscape image semantic segmentation method with deformable convolution fusion enhancement, which comprises a training stage and a testing stage, wherein a streetscape image semantic segmentation deep neural network model is constructed, so that the network model obtains more small target characteristic information while a streetscape image large target object is segmented, the problems of small-scale target loss and discontinuous segmentation during streetscape image semantic segmentation are solved, the image segmentation effect is improved, the integral robustness of the model is better, and the streetscape image processing precision is higher.

Description

Streetscape image semantic segmentation method for deformable convolution fusion enhancement
Technical Field
The invention belongs to the technical field of computer vision, relates to an image processing technology, and particularly relates to a streetscape image semantic segmentation method based on deformable convolution fusion enhancement.
Background
Image semantic segmentation is an important branch of computer vision in the field of artificial intelligence, and is an important ring for understanding and analyzing images in machine vision. The image semantic segmentation is to accurately classify each pixel in the image into the category to which the pixel belongs, so that the pixel is consistent with the visual representation content of the image, and therefore, the image semantic segmentation task is also called as a pixel-level image classification task. At present, semantic segmentation is widely applied to scenes such as automatic driving and unmanned aerial vehicle drop point judgment.
Convolutional neural networks have been successful in image classification, localization, and scene understanding. With the proliferation of tasks such as augmented reality and autonomous driving of vehicles, many researchers have turned their attention to scene understanding, where one of the main steps is semantic segmentation, i.e., classification of each pixel in a given image. Semantic segmentation has important significance in mobile and robot related applications
Different from image classification, the difficulty of image semantic segmentation is higher, because the classification of each pixel point needs to be determined by combining detailed local information with global context information, a backbone network is often used for extracting global features, and then the feature resolution is reconstructed by combining shallow features in the backbone network to restore the original image size. The resolution size of the feature map is changed by first decreasing and then increasing, and the former is generally called an encoding network, and the latter is called a decoding network. The classical semantic segmentation methods include Full Connected Network (FCN) and deep Lab series Network, and the methods have good expression of pixel precision, average pixel precision and average cross-over ratio on a road scene segmentation database. The conventional network is subsampled, wherein the upsampling is to recover the small-sized high-dimensional feature map so as to make pixel prediction and obtain the classification information of each point. Although the FCN does upsampling, the lost information cannot be recovered completely without loss; on the basis, the DeepLab series network adds a hole convolution algorithm to expand the receptive field, improves the information loss, but does not well control the problem of information loss. Therefore, these methods affect the accuracy of semantic segmentation of images due to information loss, and the segmentation effect is worse especially on the identification of small target objects.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a streetscape image semantic segmentation method for deformable convolution fusion enhancement, which is used for constructing a streetscape image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information while a large target of the streetscape image is segmented, thereby solving the problems of small-scale target loss and discontinuous segmentation during streetscape image semantic segmentation, ensuring better overall robustness of the model, ensuring higher streetscape image processing precision and improving the image segmentation effect.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a deformable convolution fusion enhanced streetscape image semantic segmentation method is characterized by comprising a training stage and a testing stage, and comprises the following steps:
1) constructing an image training set: the method comprises an original image and a corresponding semantic label image, and the group of image data is input into a network constructed by the user, so that the user can participate in training.
1_1) selecting N original street view image data and corresponding semantic segmentation label gray level images, forming a training set, and selecting the nth original street view image data in the training setThe street view image is marked as { Jn(i, J) }, matching the training set with { J }n(i, j) } corresponding semantic segmentation label images are recorded as
Figure BDA0002784048100000021
The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of a pixel point in the image; i is more than or equal to 1 and less than or equal to W, J is more than or equal to 1 and less than or equal to H, and W represents { Jn(i, J) }, H denotes { JnHeight of (i, J) }, Jn(i, J) represents { JnThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0002784048100000022
to represent
Figure BDA0002784048100000023
The middle coordinate position is the pixel value of the pixel point of (i, j); meanwhile, in order to evaluate the designed model well, the real segmentation label images corresponding to the original street view images and the corresponding semantic label images in the training set need to be used as training targets and recorded as training targets
Figure BDA0002784048100000025
Then, semantic segmentation label gray level images corresponding to each original street view image in the training set are subjected to single hot coding technology
Figure BDA0002784048100000024
Processing into a one-hot coded image; in specific implementation, 19 classes are divided into street view image object classes, and the real semantic segmentation label image corresponding to the original street view image is divided into
Figure BDA0002784048100000031
The set of constructs is denoted as
Figure BDA0002784048100000032
2) Constructing and training a streetscape image semantic segmentation deep neural network model: the streetscape image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises two parts, namely a multi-resolution fusion enhancement network introducing deformable convolution, wherein the first stage is a first part, and the second stage and the third stage are a second part. In the first stage, deformable convolution is connected in series with an Xception Module to form a fusion sub-network A, and the sub-network A is connected in series and repeated three times, so that more deep semantic feature information can be obtained. Wherein, the Xceptation Module is a basic residual Module of a semantic segmentation network Deeplab V3 +; the second stage is a double-branch parallel network, each branch is a subnet, and each branch is composed of three Xceptance modules which are connected in series and used for feature extraction and feature fusion; the third stage is a three-branch parallel network, each branch is a subnet, and is composed of three Xceptance modules connected in series, and the function of the three Xceptance modules is the same as that of the second stage. The first stage is specifically composed of three repeated deformable convolution modules connected in series with an Xception Module, wherein the sizes of convolution kernels are all 3x3, and the Xception Module in the invention is composed of 3 convolution layers with 3x3 depth separable convolutions, step length of 1 and filling of 1. The method comprises the following steps:
2_1) the input layer of the street view image semantic segmentation deep convolutional neural network model is used for receiving R, G, B three-channel components of an original input image and outputting the components to a hidden layer;
for the input layer, the input end of the input layer receives R, G, B three-channel components of an original input image with width W and height H, and the output end of the input layer outputs R, G, B three-channel components of the original input image to the hidden layer;
2_2) the first stage of the hidden layer comprises a fusion enhancement Module constructed by connecting a deformable convolution Module with an Xception Module in series, three fusion enhancement modules are repeatedly stacked, and a plurality of feature maps are generated in sequence through the three fusion enhancement modules;
the hidden layer first stage is composed of three fusion enhancement modules, and each fusion enhancement Module is mainly composed of a deformable convolution Module and a lightweight Xception Module connected in series. The position information of the spatial sampling is further adjusted in the module by adopting a deformable convolution unit, and the displacement can be obtained by learning in a target task without an additional supervision signal. The offset added in the deformable convolution unit is a part of a deep convolution neural network model structure segmented by street view image semantics, in an input feature map, an original part obtains a part of feature regions through a sliding window, and after the deformable convolution is introduced, the original convolution network is divided into two paths to share the feature map. One of the two paths uses a parallel standard convolution to learn offset, and meanwhile, the gradient back propagation can also normally learn end to end, thereby ensuring the successful integration of the deformable convolution. After learning of the offset is added, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which needs to be identified currently, and the visual effect is that the positions of sampling points of the convolution kernels at different positions can be changed in a self-adaptive mode according to the image content, so that the method is suitable for geometric deformation such as the shape, the size and the like of different objects in the image content. The concrete expression is as follows: the displacement required by the deformable convolution is obtained through the output of a parallel standard convolution, and then the displacement is acted on a convolution kernel to achieve the effect of the deformable convolution. When the picture pixels are integrated, the pixels need to be subjected to offset operation, the generation of the offset can generate a floating point type, the offset needs to be converted into an integer type, the inverse propagation cannot be carried out if the offset is directly rounded, and at the moment, a bilinear difference value mode is adopted to obtain the corresponding pixels.
The convolution includes two steps: 1) sampling on an input feature map x using a regular grid R; 2) the sum of the sampled values weighted with w. Grid R defines the size and expansion of the receptive field. As in equation (1):
R={(-1,-1),(-1,0),...,(0,1),(1,1)} (1)
defining a 3x3, dilation-1 convolution kernel;
the definition of convolution is: for each pixel position p in the output0The general convolution calculation is as in equation (2):
Figure BDA0002784048100000041
wherein, y (p)0) Is p0A feature map corresponding to the position; p is a radical of0For each position on the output feature map y; x is the input feature map, and R represents a grid of receptive fields, here 3 × 3 for example. For each position p on the output profile y0,pkIs an enumeration of positions in R. In the deformable convolution, the regular grid R is increased by an offset { Δ pk1., K }, where K ═ R |. Equation (2) becomes (3):
Figure BDA0002784048100000042
wherein p is1For each position on the output feature map y after merging into the deformable convolution, y (p)1) Is p1The position corresponds to the deformed characteristic diagram; Δ pkIs an offset 2K in both x and y directions;
and recording the original image pixel value as V, dividing the original convolution process into two paths, learning the offset in the x and y directions by the previous path, and outputting the offset as H x W2K, wherein K | R | represents the number of pixels in the grid, 2K means the offset in the x and y directions, and the image pixel value is recorded as U at the moment. After the offset exists, the pixel value index of the image in U is added to V, and for each convolution window, the window is not a regular sliding window, but a window after translation, and the calculation process is consistent with the convolution. The position of the sample now becomes an irregular position due to the offset Δ pkUsually a decimal number, we therefore calculate the corresponding integer type by bilinear interpolation, as shown in equation (4):
Figure BDA0002784048100000051
where p represents an arbitrary position on the feature map (for equation (3), p ═ p1+pk+Δpk) Q is the total number of integral spatial locations in the enumerated feature map x, and G (,) is the bilinear interpolation kernel.
After all the pixel positions are obtained, a new picture M is obtained, and the new picture M is input to the Xception Module as input data.
The Xceptation Module is used as a basic residual structure of Deeplab V3+, a residual learning unit of the Xceptation Module extracts features through 3x3 depth separable convolutions, convolution calculation is respectively carried out on the features channel by channel and point by point, compared with the conventional convolution operation, the parameter number and the operation cost are lower, and the Xceptation Module is also a main reason for introducing the Xceptation Module. When the input feature and the output feature have different dimensions (the number of channels of the feature map), the dimensions of the input feature need to be adjusted through 1 × 1 convolution, and then the input feature and the output feature of the residual learning unit are added and fused, so that the final feature map is obtained. And when the dimensions of the input features and the output features are the same, adding (fusing) the input features and the output feature graph of the residual error learning unit to obtain the finally extracted features. The Xceptation Module combines the idea of depth separable convolution and the basic residual Module Bottleneck residual structure (including 1 convolution of 1X1, one convolution of 3X3 and one convolution of 1X1, wherein the convolution of 1X1 is used for adjusting the feature dimension, and the convolution of 3X3 is used for extracting the feature), and the standard convolution is split into channel convolution and space convolution by utilizing the depth separable convolution to reduce the parameters of model training; and eliminating the problem of gradient explosion caused by the deepening of the network hierarchy by using a residual structure.
For the first stage, the input end of the fusion enhancement Module of the 1 st deformable convolution series Xception Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature map, and the set formed by the output feature maps is recorded as R1(ii) a The input end of the fusion enhancement Module of the 2 nd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module series Xception Module1The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is recorded as R2(ii) a The input end of the fusion enhancement Module of the 3 rd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module series Xception Module2Output of the output terminalThe generated feature map is represented by R3(ii) a Wherein R is3Has a width W, a height H, and a number of channels C, so that R is3Can be recorded as (H, W, C). The first stage comprises a feature extraction branch, and after repeating 3 times of the deformable convolution and connecting the Xceptance Module fusion enhancement Module in series, downsampling operation with the step length of 1 and 2 is respectively carried out to obtain two new feature diagram sets which are respectively marked as R4And R5(ii) a Wherein R is4Each feature map in (H, W, C), R5Each characteristic map in (H/2, W/2, 2C).
2_3) the second and third stages constitute a second part of the hidden layer; high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the network has good semantic expression capability while the high-resolution spatial resolution is kept. The method comprises the following specific steps:
after the first stage, the second stage generates two parallel networks S1And S2,S1Is composed of 3 lightweight Xception modules connected in series. The input feature layer and the output feature layer of each Xception Module have the same width and height, S1Input terminal receiving R4All characteristic maps of1The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R6Wherein R is6Each feature map in (a) is (H, W, C); s2Is composed of 3 lightweight Xscene modules in series, the input feature layer and the output feature layer of each Xscene Module have the same width and height, S2Input terminal receiving R5All characteristic maps of2The output end outputs the generated feature map, and the feature map set is recorded as R7Wherein R is7Each characteristic map in (H/2, W/2, 2C); two parallel networks S passing through the second stage1And S2Respectively carrying out down-sampling operations with the step length of 1 and 2 to obtain five new feature diagram sets which are respectively marked as R8、R9、R10R11And R12. Wherein R is8Each feature map in (H, W, C), R9Each characteristic diagram in (H/2, W/2,2C), R10Each characteristic diagram in (H/2, W/2,2C), R11Each characteristic diagram in (H/4, W/4,4C), R12Each feature map in (H/4, W/4, 4C).
After the second stage, the third stage generates three parallel networks S3、S4And S5Wherein S is3Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xception Module are consistent when R is equal7Partial up-sampling is carried out to obtain a new feature diagram formed set which is marked as R13Wherein R is13Each feature map in (a, b) is (H, W, C). The simultaneous information fusion layer will R8And R13Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R14Wherein R is14Each feature map in (a, b) is (H, W, C). S3Input terminal receiving R14All characteristic maps of3The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R15Wherein R is15Each feature map in (a) is (H, W, C); s4Is composed of 3 lightweight Xscene modules connected in series, the input feature layer and output feature layer of each Xscene Module have the same width and height, and the information fusion layer combines R with each other9And R10Performing feature information fusion, and recording a set formed by the feature graphs after fusion generation as R16Wherein R is16Each characteristic map in (H/2, W/2, 2C). S4Input terminal receiving R16All characteristic maps of4The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R17Wherein R is17Each characteristic map in (H/2, W/2, 2C); s5Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xconvergence Module are consistent, and the information convergence layer combines R11And R12Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R18Wherein R is18Each feature map in (H/4, W/4, 4C). S5Input terminal receiving R18All characteristic maps of5The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R19Wherein R is19Each characteristic map in (H/4, W/4, 4C); at the end of the third phase, the subnet S4And sub-network S5Generated feature map R17And R19Performing an upsampling operation to generate and pass S3R generated by subnet15The feature maps with the same size and scale are respectively marked as R20And R21Then R is added15、R20And R21Inputting the feature information into the feature fusion layer to perform feature information fusion, and generating a new feature graph set R22Wherein R is22Each feature map in (a, b) is (H, W, C).
For the output layer, which is composed of 1 convolutional layer, the input end of the output layer receives the characteristic diagram set R22The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; wherein, the width of each semantic segmentation prediction graph is W, the height of each semantic segmentation prediction graph is H, and the channel of each semantic segmentation prediction graph is C.
2_4) original streetscape image in training set { Jn(i, J) } and corresponding semantic label images (semantic segmentation label gray level images) as original input images, inputting the original input images into the constructed street view image semantic segmentation deep neural network model for training to obtain a semantic segmentation prediction map corresponding to each original street view image in a training set, and enabling each original street view image { J to be a prediction map corresponding to each original street view image in the training setn(i, j) } the set of semantic segmentation prediction maps is denoted as
Figure BDA0002784048100000081
2_5) calculating a set consisting of semantic segmentation prediction graphs corresponding to each original street view image in the training set
Figure BDA0002784048100000082
One-hot coded image set processed with corresponding true semantic segmentation image
Figure BDA0002784048100000083
The value of the loss function in between will
Figure BDA0002784048100000084
And
Figure BDA0002784048100000085
the value of the loss function in between is recorded as
Figure BDA0002784048100000086
In specific implementation, the classified cross entropy is adopted to obtain
Figure BDA0002784048100000087
And
Figure BDA0002784048100000088
value of loss function in between
Figure BDA0002784048100000089
2_6) repeatedly executing the step 2_4) and the step 2_5) for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then finding out the loss function value with the minimum value from the M multiplied by N loss function values; then, the weight vector and the bias item corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias item corresponding to the deep neural network classification training model and are correspondingly marked as WbestAnd bbest(ii) a And finishing the training of the streetscape image semantic segmentation deep neural network classification model.
3) Testing the model: and inputting the test set into the trained model for testing.
3_1) order
Figure BDA0002784048100000091
Representing a road scene image to be semantically segmented, namely a test set; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0002784048100000092
Width of (A), H' represents
Figure BDA0002784048100000093
The height of (a) of (b),
Figure BDA0002784048100000094
to represent
Figure BDA0002784048100000095
The pixel value of the pixel point with the middle coordinate position (i ', j');
3_2) will
Figure BDA0002784048100000096
Inputting the R channel component, the G channel component and the B channel component into a trained streetscape image semantic segmentation deep neural network classification model, and utilizing WbestAnd bbestMaking a prediction to obtain
Figure BDA0002784048100000097
Corresponding predictive semantic segmentation image, denoted
Figure BDA0002784048100000098
Wherein the content of the first and second substances,
Figure BDA0002784048100000099
to represent
Figure BDA00027840481000000910
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
Through the steps, the image semantic segmentation enhanced by the deformable convolution fusion is realized.
Compared with the prior art, the invention has the beneficial effects that:
1) lightweight Xception Module was introduced to replace the bottleeck Module in the conventional network model. The Xtitle Module references the design idea of the Bottleneck, continuously deepens the network model through the residual learning unit, extracts rich semantic features, and replaces the standard convolution in the Bottleneck with the deep separable convolution, so that the parameters of the model can be reduced under the condition of ensuring the precision, and the operation cost is reduced. Meanwhile, the multi-scale fusion of the network has a better effect, and after the characteristic extraction and fusion of the modules, the interaction of high and low resolution has better result output.
2) The deep neural network constructed by the method adopts a high-resolution fusion parallel network to reduce the loss characteristic information of the characteristic image in the whole network, and effective depth information is reserved to the greatest extent by keeping the high resolution unchanged and fusing the low-resolution characteristic image information in the whole process.
3) According to the deep neural network constructed by the method, the deformable convolution is blended in the first stage of the hidden layer, so that the network model has better deformation modeling capability while maintaining high-resolution characteristics in the characteristic extraction process, the problems of small-scale target loss and discontinuous segmentation in semantic segmentation are solved, and the overall robustness of the model is better.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
FIG. 2 is a block diagram of the structure of a street view image semantic segmentation neural network model constructed by the method of the present invention.
FIG. 3 is a schematic diagram of a frame of a street view image semantic segmentation neural network model according to the method of the present invention.
FIG. 4 is a street view image to be semantically segmented, a corresponding real semantic segmentation image, and a predicted semantic segmentation image obtained by prediction according to an embodiment of the present invention;
wherein, (a) is a selected street view image to be semantically segmented; (b) segmenting an image for real semantics corresponding to the street view image to be semantically segmented shown in (a); (c) the method is used for predicting the street view image to be semantically segmented shown in (a) to obtain a predicted semantically segmented image.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a streetscape image semantic segmentation method with deformable convolution fusion enhancement, which is used for constructing a streetscape image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information while a large target object of the streetscape image is segmented, the problems of small-scale target loss and discontinuous segmentation during streetscape image semantic segmentation are solved, the overall robustness of the model is better, the streetscape image processing precision is higher, and the image segmentation effect is improved.
The general implementation block diagram of the streetscape image semantic segmentation method based on deformable convolution fusion enhancement provided by the invention is shown in fig. 1, and the method comprises a training stage and a testing stage.
FIG. 2 is a block diagram of the structure of a street view image semantic segmentation neural network model constructed by the method of the present invention. FIG. 3 is a schematic diagram of a frame of a street view image semantic segmentation neural network model according to the method of the present invention. The method for realizing the street view image semantic segmentation enhanced by the deformable convolution fusion mainly comprises the following steps:
1) firstly, inputting an original image into a first deformable convolution layer in a first stage of a network, and performing feature extraction (high-resolution feature map);
2) inputting the output initial characteristics into a first Xcecption Module to obtain a deeper characteristic diagram;
3) repeating 1) and 2) for 3 times, namely, immediately following an Xception Module behind the deformable convolution Module, and extracting deep-level features for multiple times while increasing the receptive field;
4) respectively carrying out down-sampling operations with the step length of 1 and 2, wherein one part maintains high resolution, and the other part is parallel to a lower resolution graph and is input into 3 Xceptance modules with different branch repetition at the second stage;
5) after feature information fusion is carried out through a feature fusion layer, the feature information is respectively input into 3 Xreception modules repeated by different branches in the third stage again, down-sampling operation with the step length of 1 and 2 is carried out, then up-sampling and feature fusion are carried out, and a high-resolution feature graph is output;
6) and finally, after one convolution, adjusting the number of channels of the output characteristics to be the number of categories to be segmented, and activating through a classifier function to obtain a predicted segmented image.
In specific implementation, the streetscape image semantic segmentation neural network model training stage process of the method comprises the following specific steps:
1, constructing an image training set: selecting N original streetscape image data and corresponding semantic segmentation label gray level images, forming a training set, and recording the nth original streetscape image in the training set as { J }n(i, J) }, matching the training set with { J }n(i, j) } corresponding semantic segmentation label images are recorded as
Figure BDA0002784048100000111
The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500, such as 1000; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of a pixel point in the image; i is more than or equal to 1 and less than or equal to W, J is more than or equal to 1 and less than or equal to H, and W represents { Jn(i, J) }, H denotes { Jn(i, J) } e.g. take W1024, H512, Jn(i, J) represents { JnThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0002784048100000112
to represent
Figure BDA0002784048100000113
The middle coordinate position is the pixel value of the pixel point of (i, j); meanwhile, in order to evaluate the designed model well, the real segmentation label images corresponding to the original street view images and the corresponding semantic segmentation label gray-scale images in the training set need to be used as training targets and recorded as training targets
Figure BDA0002784048100000114
Here, the original street view image is directly selected from the city landscape data set, i.e. 2975 images of training data set in the city scenes public data set.
2, constructing a deep neural network: the deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer consists of two parts, and the high-resolution input of the network and the double-stage multi-branch parallel fused Xception Module network are formed by connecting three repeated deformable convolution modules in series with an Xception Module.
2_1 for the input layer, the input end of the input layer receives R, G, B three-channel components of an original input image, and the output end of the input layer outputs the R-channel component, the G-channel component and the B-channel component of the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W, and the height is required to be H;
2_2, constructing a fusion Module by connecting the deformable convolution Module with the Xceptance Module in series in the first part of the hidden layer, repeatedly stacking the fusion Module to three parts, and generating a plurality of feature maps in sequence by the three fusion enhancement modules;
the hidden layer first stage is composed of three fusion enhancement modules, and each fusion enhancement Module is mainly composed of a deformable convolution Module and a lightweight Xception Module connected in series. After all pixel positions are obtained through the first deformable convolution, a new picture M is obtained, and the new picture M is used as input data and is input into the Xception Module.
The first stage is a first part, for the first stage, the input end of the fusion enhancement Module of the 1 st deformable convolution series Xceptance Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature map, and the set formed by the output feature maps is recorded as R1(ii) a The input end of the fusion enhancement Module of the 2 nd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module series Xception Module1The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is recorded as R2(ii) a The input end of the fusion enhancement Module of the 3 rd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module series Xception Module2The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is recorded as R3(ii) a Wherein R is3Has a width W, a height H, and a number of channels C, so that R is3Can be recorded as (H, W, C). The first stage comprises a feature extraction branch, which is repeated 3 times for the deformable volumeAfter the product is connected with the Xception Module fusion enhancement Module in series, the downsampling operation with the step length of 1 and 2 is respectively carried out to obtain a set formed by two new feature graphs which are respectively recorded as R4And R5(ii) a Wherein R is4Each feature map in (H, W, C), R5Each characteristic map in (H/2, W/2, 2C).
2_3 the second and third stages form the second part of the hidden layer; high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the network has good semantic expression capability while the high-resolution spatial resolution is kept. The method comprises the following specific steps:
after the first stage, the second stage generates two parallel networks S1And S2,S1The device is composed of 3 lightweight Xception modules connected in series, wherein the Xception modules in the device are composed of 3 convolution layers with 3 multiplied by 3 depth separable convolutions, step length of 1 and filling of 1. The input feature layer and the output feature layer of each Xception Module have the same width and height, S1Input terminal receiving R4All characteristic maps of1The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R6Wherein R is6Each feature map in (a) is (H, W, C); s2Is composed of 3 lightweight Xscene modules in series, the input feature layer and the output feature layer of each Xscene Module have the same width and height, S2Input terminal receiving R5All characteristic maps of2The output end outputs the generated feature map, and the feature map set is recorded as R7Wherein R is7Each characteristic map in (H/2, W/2, 2C); two parallel networks S passing through the second stage1And S2Respectively carrying out down-sampling operations with the step length of 1 and 2 to obtain five new feature diagram sets which are respectively marked as R8、R9、R10R11And R12Wherein R is8Each feature map in (H, W, C), R9Each characteristic diagram in (H/2, W/2,2C), R10Each characteristic diagram in (H/2, W/2,2C), R11Each characteristic diagram in (H/4, W/4,4C), R12Each feature map in (H/4, W/4, 4C).
After the second stage, the third stage generates three parallel networks S3、S4And S5Wherein S is3Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xception Module are consistent when R is equal7Partial up-sampling is carried out to obtain a new feature diagram formed set which is marked as R13Wherein R is13Each feature map in (a, b) is (H, W, C). The simultaneous information fusion layer will R8And R13Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R14Wherein R is14Each feature map in (a, b) is (H, W, C). S3Input terminal receiving R14All characteristic maps of3The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R15Wherein R is15Each feature map in (a) is (H, W, C); s4Is composed of 3 lightweight Xscene modules connected in series, the input feature layer and output feature layer of each Xscene Module have the same width and height, and the information fusion layer combines R with each other9And R10Carrying out feature information fusion, and marking the feature graph generated by fusion as R16Wherein R is16Each characteristic map in (H/2, W/2, 2C). S4Input terminal receiving R16All characteristic maps of4The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R17Wherein R is17Each characteristic map in (H/2, W/2, 2C); s5Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xconvergence Module are consistent, and the information convergence layer combines R11And R12Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R18Wherein R is18Each feature map in (H/4, W/4, 4C). S5Input terminal receiving R18All characteristic maps of5The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R19Wherein R is19Each characteristic diagram in (1) is(H/4, W/4, 4C); at the end of the third phase we need to connect subnet S4And sub-network S5Generated feature map R17And R19Performing an upsampling operation to generate and pass S3R generated by subnet15The feature maps with the same size and scale are respectively marked as R20And R21Then R is added15、R20And R21Inputting the feature information into the feature fusion layer to perform feature information fusion, and generating a new feature graph set R22Wherein R is22Each feature map in (a, b) is (H, W, C).
For the output layer, which is composed of 1 convolutional layer, the input end of the output layer receives the feature map set R22The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; wherein, the width of each semantic segmentation prediction graph is W, and the height of each semantic segmentation prediction graph is H.
2_4, inputting the original street view images and the corresponding semantic segmentation label gray level images in the training set as original input images into a deep neural network for training to obtain a semantic segmentation prediction map corresponding to each original street view image in the training set, and enabling each original street view image to be { J }n(i, j) } the set of semantic segmentation prediction maps is denoted as
Figure BDA0002784048100000151
2_5 calculating a set consisting of semantic segmentation prediction graphs corresponding to each original street view image in the training set
Figure BDA0002784048100000152
One-hot coded image set processed with corresponding true semantic segmentation image
Figure BDA0002784048100000153
The value of the loss function in between will
Figure BDA0002784048100000154
And
Figure BDA0002784048100000155
the value of the loss function in between is recorded as
Figure BDA0002784048100000156
Figure BDA0002784048100000157
Obtained using categorical cross entropy (categorical cross entropy).
2_6, repeatedly executing the step 2_4 and the step 2_5 for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then finding out the loss function value with the minimum value from the M multiplied by N loss function values; then, the weight vector and the bias item corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias item corresponding to the deep neural network classification training model and are correspondingly marked as WbestAnd bbest. In this example, M is 484.
3, testing the model: and inputting the test set into the trained model for testing. The test stage process comprises the following specific steps:
3_1 order
Figure BDA0002784048100000158
Representing a road scene image to be semantically segmented; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0002784048100000159
Width of (A), H' represents
Figure BDA00027840481000001510
The height of (a) of (b),
Figure BDA00027840481000001511
to represent
Figure BDA00027840481000001512
The pixel value of the pixel point with the middle coordinate position (i ', j');
3_2 will
Figure BDA00027840481000001513
The R channel component, the G channel component and the B channel component are input into a trained deep neural network classification model, and W is utilizedbestAnd bbestMaking a prediction to obtain
Figure BDA00027840481000001514
Corresponding predictive semantic segmentation image, denoted
Figure BDA00027840481000001515
Wherein the content of the first and second substances,
Figure BDA00027840481000001516
to represent
Figure BDA00027840481000001517
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
The feasibility and effectiveness of the method of the invention is further verified below.
A deep neural network architecture is built by using a python-based deep learning library pytorch 1.2. The city scenes test set is adopted to analyze the segmentation effect of the street view image predicted by the method. Here, the segmentation performance of the predicted semantic segmentation image is evaluated by using 3 common objective parameters for evaluating the semantic segmentation method as evaluation indexes, namely, Mean Intersection over unity (MIoU), Pixel Accuracy (PA), and Mean Pixel Accuracy (MPA), and the definitions thereof are given.
Definition 1: mlou (Mean Intersection over Union) is a standard metric for semantic segmentation. Which calculates the ratio of the intersection and union of the two sets. The formula is as follows:
Figure BDA0002784048100000161
definition 2: the Pixel Accuracy (Pixel Accuracy) represents the proportion of pixels marked correctly to the total pixels, as shown in the following equation:
Figure BDA0002784048100000162
definition 3: mean Pixel Accuracy (Mean Pixel Accuracy) is an improvement of PA, calculates the proportion of correctly classified pixels in each class, and then averages the PA of all classes, as follows:
Figure BDA0002784048100000163
the method is utilized to predict each street view image in the Cityscapes test set to obtain a predicted semantic segmentation image corresponding to each street view image, and the higher the average cross union reflecting the semantic segmentation effect of the method is than the value of MIoU, the higher the pixel accuracy PA and the higher the average pixel accuracy MPA, the higher the effectiveness and the higher the prediction accuracy are, wherein the average cross union is shown in Table 1.
TABLE 1 mIoU values of the method of the invention on the Cityscapes dataset
Figure BDA0002784048100000164
Figure BDA0002784048100000171
As can be seen from the data listed in table 1, the street view image obtained by the method of the present invention has a good segmentation effect, which indicates that it is feasible and effective to obtain the prediction semantic segmentation image corresponding to the street view image by using the method of the present invention. The average cross-parallel ratio MIoU, the pixel accuracy PA and the average pixel accuracy MPA of the method are shown in the table 2, and the result shows that the segmentation effect of the method is in the front of the existing segmentation model.
TABLE 2 computational Performance on the Cityscapes dataset
Figure BDA0002784048100000172
In fig. 4, (a) shows a selected street view image to be semantically segmented; (b) giving a real semantic segmentation image corresponding to the street view image to be semantically segmented shown in (a); (c) the invention provides a prediction semantic segmentation image obtained by predicting the street view image to be semantically segmented shown in (a) by using the method. Comparing (b) and (c) in fig. 4, it can be seen that the segmentation precision of the predicted semantic segmentation image obtained by the method of the present invention is higher and is close to the real semantic segmentation image.
It is noted that the disclosed examples are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (7)

1. A deformable convolution fusion enhanced streetscape image semantic segmentation method comprises a training stage and a testing stage, and specifically comprises the following steps:
1) constructing an image training set, wherein the image training set comprises an original street view image and a corresponding semantic label image;
selecting N original street view image data and a corresponding semantic segmentation label gray-scale image, namely a semantic label image, to form an image training set; n is a positive integer; record the n-th original street view image in the training set as { J }n(i, J) }, matching the training set with { J }n(i, j) } corresponding semantic segmentation label images are recorded as
Figure FDA0002784048090000011
N is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of a pixel point in the image; i is more than or equal to 1 and less than or equal to W, J is more than or equal to 1 and less than or equal to H, and W represents { Jn(i, J) }, H denotes { Jn(i, j) }; (i, j) representing the coordinate positions of the pixel points in the image; j. the design is a squaren(i, J) represents { Jn(i, j) } pixel of pixel point with coordinate position (i, j)The value of the one or more of,
Figure FDA0002784048090000012
to represent
Figure FDA0002784048090000013
The middle coordinate position is the pixel value of the pixel point of (i, j); recording the real segmentation label image corresponding to the semantic label image as
Figure FDA0002784048090000014
Then will be
Figure FDA0002784048090000015
Processed into one-hot coded images, the constituent sets being denoted as
Figure FDA0002784048090000016
2) Constructing and training a streetscape image semantic segmentation deep neural network model:
the streetscape image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer is a multi-resolution fusion enhancement network introducing deformable convolution and comprises a first stage, a second stage and a third stage;
the first stage is that deformable convolution is connected with Xception Module in series to form a fusion sub-network A, and the sub-network A is connected with the fusion sub-network A in series and repeated three times to obtain more deep semantic feature information; the Xceptation Module is a basic residual Module of a semantic segmentation network Deeplab V3 +;
the second stage is a double-branch parallel network, each branch is a subnet which is composed of three Xceptance modules connected in series and is used for feature extraction and feature fusion; the third stage is a three-branch parallel network, each branch is a subnet and consists of three Xceptance modules connected in series;
2_1) the input layer of the street view image semantic segmentation deep convolutional neural network model is used for receiving R, G, B three-channel components of an original input image and outputting the components to a hidden layer;
2_2) the first stage of the hidden layer comprises a fusion enhancement Module constructed by connecting a deformable convolution Module with an Xception Module in series, and a plurality of feature maps are generated in sequence by the three fusion enhancement modules; obtaining the displacement required by the deformable convolution through the output of a parallel standard convolution, and then acting on a convolution kernel to achieve the deformable convolution;
in the deformable convolution, the regular grid R is increased by an offset { Δ pk1, ·, K }, where K | R |, represented by formula (3):
Figure FDA0002784048090000017
wherein p is1For each position on the output feature map y after merging into the deformable convolution, y (p)1) Is p1The position corresponds to the deformed characteristic diagram; Δ pkIs an offset 2K in both x and y directions;
then calculating by a bilinear interpolation method to obtain an offset delta pnA corresponding integer type; obtaining the positions of all pixels, namely obtaining a new picture M, and inputting the M into an Xmeeting Module as input data;
the residual error learning unit of the Xception Module performs convolution calculation and feature extraction on channel by channel and point by point respectively through depth separable convolution; obtaining a characteristic diagram;
for the first stage, the input end of the fusion enhancement Module of the 1 st deformable convolution series Xception Module receives the channel component of the original input image output by the output end of the input layer, and the output end outputs the generated feature map set R1(ii) a The input end of the fusion enhancement Module of the 2 nd deformable convolution series Xception Module receives R1And the output end outputs the generated feature map set R2(ii) a The input end of the fusion enhancement Module of the 3 rd deformable convolution series Xceptance Module receives R2And the output end outputs the generated feature map set R3Is marked as (H, W, C), wherein C is the number of channels;
the first stage extracts features, and after repeating 3 times of deformable convolution and connecting an Xception Module fusion enhancement Module in series, downsampling operation is respectively carried out to obtainTwo new feature graph sets, respectively denoted as R4And R5(ii) a Wherein R is4Each feature map in (H, W, C), R5Each characteristic map in (H/2, W/2, 2C);
2_3) the second stage and the third stage of the hidden layer exchange information among the multi-resolution features, so that the hidden layer has good semantic expression capability while maintaining the spatial resolution of high resolution;
the second stage generates two parallel networks S1And S2,S1Is composed of 3 lightweight Xception modules connected in series; s1Input terminal of (1) receiving R4,S1Output of (3) the generated feature map set R6Wherein R is6Each feature map in (a) is (H, W, C); s2Is composed of 3 lightweight Xception modules connected in series, S2Input terminal receiving R5,S2Feature map set R generated by output end output7Wherein R is7Each characteristic map in (H/2, W/2, 2C); two parallel networks S1And S2Respectively carrying out down-sampling operation to obtain a set consisting of five new characteristic graphs, which are respectively marked as R8、R9、R10 R11And R12(ii) a Wherein R is8Each feature map in (H, W, C), R9Each characteristic diagram in (H/2, W/2,2C), R10Each characteristic diagram in (H/2, W/2,2C), R11Each characteristic diagram in (H/4, W/4,4C), R12Each characteristic map in (H/4, W/4, 4C);
the third stage generates three parallel networks S3、S4And S5Wherein S is3Is composed of 3 lightweight Xception modules connected in series; r7Partial up-sampling is carried out to obtain a new feature diagram formed set which is marked as R13Wherein R is13Each feature map in (a) is (H, W, C); r is to be8And R13Carrying out feature information layer fusion, and recording a feature graph set generated after fusion as R14Wherein R is14Each feature map in (a) is (H, W, C); s3Input terminal receiving R14,S3Output of the output terminalGenerating a feature map set R15Wherein R is15Each feature map in (a) is (H, W, C); s4Is composed of 3 lightweight Xception modules connected in series, and R is connected in series9And R10Carrying out feature information fusion, and generating a feature graph set R by fusion16Wherein R is16Each characteristic map in (H/2, W/2, 2C); s4Input terminal receiving R16,S4Output of (3) the generated feature map set R17Wherein R is17Each characteristic map in (H/2, W/2, 2C); s5Is composed of 3 lightweight Xception modules connected in series; r is to be11And R12The feature graph set generated by fusing the feature information layers is marked as R18Wherein R is18Each characteristic map in (H/4, W/4, 4C); s5Input terminal receiving R18,S5Output of (3) the generated feature map set R19Wherein R is19Each characteristic map in (H/4, W/4, 4C); at the end of the third phase, S4And S5Generated feature map R17And R19Performing an upsampling operation to generate an upsampled R15The feature maps of the same size are respectively marked as R20And R21(ii) a Then R is put15、R20And R21Inputting the feature information into a feature fusion layer to perform feature information fusion to generate a new feature graph set R22Wherein R is22Each feature map in (a) is (H, W, C);
the output layer is composed of 1 convolution layer, and the input end of the output layer receives the characteristic diagram set R22The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the input original image; wherein the width of each semantic segmentation prediction graph is W, the height of each semantic segmentation prediction graph is H, and the channel of each semantic segmentation prediction graph is C;
2_4) inputting the image training set into the constructed streetscape image semantic segmentation deep neural network model for training to obtain a semantic segmentation prediction graph corresponding to each original streetscape image, and recording a set formed by the semantic segmentation prediction graphs as
Figure FDA0002784048090000041
2_5) calculation
Figure FDA0002784048090000042
With corresponding sets of one-hot coded images
Figure FDA0002784048090000043
Value of loss function in between
Figure FDA0002784048090000044
2_6) repeatedly executing the step 2_4) and the step 2_5) M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values in total; finding out the loss function value with the minimum value from the M multiplied by N loss function values; taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item corresponding to the deep neural network classification training model, and correspondingly marking as WbestAnd bbest(ii) a Finishing the training of the streetscape image semantic segmentation deep neural network classification model;
3) testing the model: inputting the test set into the trained model for testing;
3_1) order
Figure FDA0002784048090000045
Representing a road scene image to be semantically segmented, namely a test set; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure FDA0002784048090000046
Width of (A), H' represents
Figure FDA0002784048090000047
The height of (a) of (b),
Figure FDA0002784048090000048
to represent
Figure FDA0002784048090000049
The pixel value of the pixel point with the middle coordinate position (i ', j');
3_2) will
Figure FDA00027840480900000410
The channel components are input into a trained streetscape image semantic segmentation deep neural network classification model, and W is utilizedbestAnd bbestMaking a prediction to obtain
Figure FDA00027840480900000411
Corresponding predictive semantic segmentation image, denoted
Figure FDA00027840480900000412
Wherein the content of the first and second substances,
Figure FDA00027840480900000413
to represent
Figure FDA00027840480900000414
And (3) the pixel value of the pixel point with the middle coordinate position (i ', j'), namely the image semantic segmentation enhanced by the deformable convolution is realized.
2. The method for semantic segmentation of street view image with deformable convolution fusion enhancement as claimed in claim 1, wherein in step 1), the object classes in the street view image are classified into 19 classes.
3. The method for semantic segmentation of street view image enhanced by deformable convolution according to claim 1, wherein the bilinear interpolation in step 2_2) is calculated and expressed as formula (4):
Figure FDA00027840480900000415
wherein P represents a position on the feature map; q is all the integral spatial positions in the feature map x, and G (,) is the bilinear interpolation kernel.
4. The streetscape image semantic segmentation method based on deformable convolution fusion enhancement as claimed in claim 1, wherein a residual error learning unit of an Xception Module extracts features through 3 separable convolutions with 3x3 depths; when the input feature and the output feature have different dimensions, the dimension of the input feature is adjusted through 1 × 1 convolution, and then the dimension is added with the output feature of the residual error learning unit, so that a feature map is obtained; and when the input feature and the output feature are the same in dimension, adding the input feature to the output feature map of the residual error learning unit to obtain the feature map.
5. The method as claimed in claim 1, wherein the step 2_5) is achieved by using cross entropy classification
Figure FDA0002784048090000051
And
Figure FDA0002784048090000052
value of loss function in between
Figure FDA0002784048090000053
6. The deformable convolution fusion enhanced streetscape image semantic segmentation method as claimed in claim 1, wherein a deep neural network model is constructed by using a python-based deep learning library pytorch 1.2.
7. The method for street view image semantic segmentation enhanced by deformable convolution fusion as claimed in claim 1, wherein a cityscape test set is specifically adopted, and parameter average cross-over ratio, pixel accuracy and average pixel accuracy are adopted as indexes to verify the street view image segmentation effect of the street view image semantic segmentation enhanced by deformable convolution fusion.
CN202011291950.6A 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method Active CN112396607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011291950.6A CN112396607B (en) 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011291950.6A CN112396607B (en) 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method

Publications (2)

Publication Number Publication Date
CN112396607A true CN112396607A (en) 2021-02-23
CN112396607B CN112396607B (en) 2023-06-16

Family

ID=74606378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011291950.6A Active CN112396607B (en) 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method

Country Status (1)

Country Link
CN (1) CN112396607B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205520A (en) * 2021-04-22 2021-08-03 华中科技大学 Method and system for semantic segmentation of image
CN113313105A (en) * 2021-04-12 2021-08-27 厦门大学 Method for identifying areas of office swivel chair wood board sprayed with glue and pasted with cotton
CN113326799A (en) * 2021-06-22 2021-08-31 长光卫星技术有限公司 Remote sensing image road extraction method based on EfficientNet network and direction learning
CN113420770A (en) * 2021-06-21 2021-09-21 梅卡曼德(北京)机器人科技有限公司 Image data processing method, image data processing device, electronic equipment and storage medium
CN113554733A (en) * 2021-07-28 2021-10-26 北京大学 Language-based decoupling condition injection gray level image colorization method
CN113608223A (en) * 2021-08-13 2021-11-05 国家气象信息中心(中国气象局气象数据中心) Single-station Doppler weather radar strong precipitation estimation method based on double-branch double-stage depth model
CN113657388A (en) * 2021-07-09 2021-11-16 北京科技大学 Image semantic segmentation method fusing image super-resolution reconstruction
CN113762263A (en) * 2021-08-17 2021-12-07 慧影医疗科技(北京)有限公司 Semantic segmentation method and system for small-scale similar structure
CN113807356A (en) * 2021-07-29 2021-12-17 北京工商大学 End-to-end low visibility image semantic segmentation method
CN115294488A (en) * 2022-10-10 2022-11-04 江西财经大学 AR rapid object matching display method
CN115393725A (en) * 2022-10-26 2022-11-25 西南科技大学 Bridge crack identification method based on feature enhancement and semantic segmentation
CN115620001A (en) * 2022-12-15 2023-01-17 长春理工大学 Visual auxiliary system based on 3D point cloud bilateral amplification algorithm

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164290A1 (en) * 2016-08-25 2019-05-30 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation
CN110059768A (en) * 2019-04-30 2019-07-26 福州大学 The semantic segmentation method and system of the merging point and provincial characteristics that understand for streetscape
CN110795976A (en) * 2018-08-03 2020-02-14 华为技术有限公司 Method, device and equipment for training object detection model
CN110826596A (en) * 2019-10-09 2020-02-21 天津大学 Semantic segmentation method based on multi-scale deformable convolution
CN111401436A (en) * 2020-03-13 2020-07-10 北京工商大学 Streetscape image segmentation method fusing network and two-channel attention mechanism
CN111563909A (en) * 2020-05-10 2020-08-21 中国人民解放军91550部队 Semantic segmentation method for complex street view image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164290A1 (en) * 2016-08-25 2019-05-30 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation
CN110795976A (en) * 2018-08-03 2020-02-14 华为技术有限公司 Method, device and equipment for training object detection model
CN110059768A (en) * 2019-04-30 2019-07-26 福州大学 The semantic segmentation method and system of the merging point and provincial characteristics that understand for streetscape
CN110826596A (en) * 2019-10-09 2020-02-21 天津大学 Semantic segmentation method based on multi-scale deformable convolution
CN111401436A (en) * 2020-03-13 2020-07-10 北京工商大学 Streetscape image segmentation method fusing network and two-channel attention mechanism
CN111563909A (en) * 2020-05-10 2020-08-21 中国人民解放军91550部队 Semantic segmentation method for complex street view image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋建辉;程思宇;刘砚菊;于洋;: "基于深度卷积网络的无人机地物场景语义分割", 沈阳理工大学学报, no. 06 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313105A (en) * 2021-04-12 2021-08-27 厦门大学 Method for identifying areas of office swivel chair wood board sprayed with glue and pasted with cotton
CN113313105B (en) * 2021-04-12 2022-07-01 厦门大学 Method for identifying areas of office swivel chair wood board sprayed with glue and pasted with cotton
CN113205520A (en) * 2021-04-22 2021-08-03 华中科技大学 Method and system for semantic segmentation of image
CN113205520B (en) * 2021-04-22 2022-08-05 华中科技大学 Method and system for semantic segmentation of image
CN113420770A (en) * 2021-06-21 2021-09-21 梅卡曼德(北京)机器人科技有限公司 Image data processing method, image data processing device, electronic equipment and storage medium
CN113326799A (en) * 2021-06-22 2021-08-31 长光卫星技术有限公司 Remote sensing image road extraction method based on EfficientNet network and direction learning
CN113657388A (en) * 2021-07-09 2021-11-16 北京科技大学 Image semantic segmentation method fusing image super-resolution reconstruction
CN113657388B (en) * 2021-07-09 2023-10-31 北京科技大学 Image semantic segmentation method for super-resolution reconstruction of fused image
CN113554733A (en) * 2021-07-28 2021-10-26 北京大学 Language-based decoupling condition injection gray level image colorization method
CN113807356B (en) * 2021-07-29 2023-07-25 北京工商大学 End-to-end low-visibility image semantic segmentation method
CN113807356A (en) * 2021-07-29 2021-12-17 北京工商大学 End-to-end low visibility image semantic segmentation method
CN113608223A (en) * 2021-08-13 2021-11-05 国家气象信息中心(中国气象局气象数据中心) Single-station Doppler weather radar strong precipitation estimation method based on double-branch double-stage depth model
CN113608223B (en) * 2021-08-13 2024-01-05 国家气象信息中心(中国气象局气象数据中心) Single-station Doppler weather radar strong precipitation estimation method based on double-branch double-stage depth model
CN113762263A (en) * 2021-08-17 2021-12-07 慧影医疗科技(北京)有限公司 Semantic segmentation method and system for small-scale similar structure
CN115294488B (en) * 2022-10-10 2023-01-24 江西财经大学 AR rapid object matching display method
CN115294488A (en) * 2022-10-10 2022-11-04 江西财经大学 AR rapid object matching display method
CN115393725A (en) * 2022-10-26 2022-11-25 西南科技大学 Bridge crack identification method based on feature enhancement and semantic segmentation
CN115393725B (en) * 2022-10-26 2023-03-07 西南科技大学 Bridge crack identification method based on feature enhancement and semantic segmentation
CN115620001A (en) * 2022-12-15 2023-01-17 长春理工大学 Visual auxiliary system based on 3D point cloud bilateral amplification algorithm
CN115620001B (en) * 2022-12-15 2023-04-07 长春理工大学 Visual auxiliary system based on 3D point cloud bilateral amplification algorithm

Also Published As

Publication number Publication date
CN112396607B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN111339903B (en) Multi-person human body posture estimation method
CN110443842B (en) Depth map prediction method based on visual angle fusion
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN110728192B (en) High-resolution remote sensing image classification method based on novel characteristic pyramid depth network
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN113469094A (en) Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN111401436B (en) Streetscape image segmentation method fusing network and two-channel attention mechanism
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN112991350B (en) RGB-T image semantic segmentation method based on modal difference reduction
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN112258526A (en) CT (computed tomography) kidney region cascade segmentation method based on dual attention mechanism
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN112131959A (en) 2D human body posture estimation method based on multi-scale feature reinforcement
CN111768415A (en) Image instance segmentation method without quantization pooling
CN112651423A (en) Intelligent vision system
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
CN115359372A (en) Unmanned aerial vehicle video moving object detection method based on optical flow network
CN113076947A (en) RGB-T image significance detection system with cross-guide fusion
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
CN114092824A (en) Remote sensing image road segmentation method combining intensive attention and parallel up-sampling
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant