CN112396607B - Deformable convolution fusion enhanced street view image semantic segmentation method - Google Patents

Deformable convolution fusion enhanced street view image semantic segmentation method Download PDF

Info

Publication number
CN112396607B
CN112396607B CN202011291950.6A CN202011291950A CN112396607B CN 112396607 B CN112396607 B CN 112396607B CN 202011291950 A CN202011291950 A CN 202011291950A CN 112396607 B CN112396607 B CN 112396607B
Authority
CN
China
Prior art keywords
feature
image
semantic segmentation
street view
view image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011291950.6A
Other languages
Chinese (zh)
Other versions
CN112396607A (en
Inventor
张珣
秦晓海
刘宪圣
张浩轩
江东
张迎春
付晶莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Geographic Sciences and Natural Resources of CAS
Beijing Technology and Business University
Original Assignee
Institute of Geographic Sciences and Natural Resources of CAS
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Geographic Sciences and Natural Resources of CAS, Beijing Technology and Business University filed Critical Institute of Geographic Sciences and Natural Resources of CAS
Priority to CN202011291950.6A priority Critical patent/CN112396607B/en
Publication of CN112396607A publication Critical patent/CN112396607A/en
Application granted granted Critical
Publication of CN112396607B publication Critical patent/CN112396607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration by the use of local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a deformable convolution fusion enhanced street view image semantic segmentation method which comprises a training stage and a testing stage, wherein a street view image semantic segmentation deep neural network model is constructed, so that the network model obtains more small target characteristic information when a large target object of a street view image is segmented, the problems of small scale target loss and discontinuous segmentation during the semantic segmentation of the street view image are solved, the image segmentation effect is improved, the overall robustness of the model is better, and the street view image processing precision is higher.

Description

Deformable convolution fusion enhanced street view image semantic segmentation method
Technical Field
The invention belongs to the technical field of computer vision, relates to an image processing technology, and in particular relates to a deformable convolution fusion enhanced street view image semantic segmentation method.
Background
Image semantic segmentation is an important branch of computer vision in the field of artificial intelligence, and is a ring of importance for image understanding and resolution in machine vision. Image semantic segmentation is the task of classifying each pixel in an image into its category accurately so that it is consistent with the visual representation of the image itself, so the task of image semantic segmentation is also known as the task of image classification at the pixel level. At present, semantic segmentation has been widely applied to scenes such as automatic driving and unmanned aerial vehicle landing point judgment.
Convolutional neural networks have been successful in classifying, locating, and scene understanding images. With the proliferation of tasks such as augmented reality and automatic driving of vehicles, many researchers have turned their attention to scene understanding, where one of the main steps is semantic segmentation, i.e., classifying each pixel in a given image. Semantic segmentation has significance in mobile and robot-related applications
Unlike image classification, the difficulty of image semantic segmentation is higher, because it requires not only global context information, but also fine local information to determine the class of each pixel, so often a backbone is used to extract the more global features, and then feature resolution reconstruction is performed in combination with shallow features in the backbone to restore to the original image size. The resolution size of the feature map is changed from decreasing to increasing, the former is generally referred to as an encoding network, and the latter is referred to as a decoding network. Classical semantic segmentation methods include fully connected networks (Full Connected Network, FCN) and deep lab series networks, which have good performance in terms of pixel accuracy, pixel uniformity accuracy and intersection-to-intersection ratio on a road scene segmentation database. Conventional networks are subsampled, where upsampling is in the sense of recovering a small-sized high-dimensional feature map back for pixel prediction to obtain classification information for each point. Although FCNs do upsampling, they do not recover all the lost information without loss; the deep Lab series network is added with a cavity convolution algorithm expansion receptive field on the basis, so that information loss is improved, but the problem of information loss is not well controlled. Therefore, these methods affect the accuracy of the semantic segmentation of the image due to the loss of information, and the segmentation effect is worse especially on the recognition of small target objects.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the deformable convolution fusion enhanced street view image semantic segmentation method, which constructs a street view image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information when a large target object of the street view image is segmented, thereby solving the problems of small scale target loss and discontinuous segmentation in the process of street view image semantic segmentation, ensuring better overall robustness of the model, higher street view image processing precision and improving the image segmentation effect.
The technical scheme adopted for solving the technical problems is as follows:
a deformable convolution fusion enhanced street view image semantic segmentation method is characterized by comprising two processes of a training stage and a testing stage, and comprises the following steps:
1) Constructing an image training set: the method comprises the steps of inputting the image data into a constructed network to participate in training, wherein the image data comprises an original image and a corresponding semantic label image.
1_1) selecting N original street view image data and a corresponding semantic segmentation label gray level map, forming a training set, and marking the nth original street view image in the training set as { J } n (i, J) } and { J } in the training set n The semantic segmentation label image corresponding to (i, j) is recorded as
Figure BDA0002784048100000021
The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of the pixel point in the image; 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W representing { J ] n Width of (i, J) }, H represents { J }, J n Height of (i, J), J n (i, J) represents { J ] n Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>
Figure BDA0002784048100000022
Representation->
Figure BDA0002784048100000023
Pixel values of the pixel points with the middle coordinate positions (i, j); meanwhile, in order to well evaluate the designed model, the real segmentation label image corresponding to the original street view image and the corresponding semantic label image in the training set is required to be used as a training target of us and marked as +.>
Figure BDA0002784048100000025
Then, semantic segmentation label gray level images corresponding to each original street view image in the training set are obtained by adopting a single-hot coding technology
Figure BDA0002784048100000024
Processing the single-heat coded image; in the specific implementation, the street view image object categories are classified into 19 categories, and the real semantics corresponding to the original street view image is segmented into tag images +.>
Figure BDA0002784048100000031
The set of constituents is marked->
Figure BDA0002784048100000032
2) Constructing and training a street view image semantic segmentation deep neural network model: the street view image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises two parts, mainly a multi-resolution fusion enhancement network which introduces deformable convolution, wherein the first stage is a first part, and the second stage and the third stage are second parts. The first stage is that the deformable convolution is connected in series with Xreception Module to form a fusion sub-network A, and the sub-network A is repeated three times in series, so that more deep semantic feature information can be obtained. The Xreception Module is a basic residual error Module of a semantic segmentation network deep V3+; the second stage is a double-branch parallel network, wherein each branch is a subnet and consists of three Xreception modules in series for feature extraction and feature fusion; the third stage is three branch parallel network, each branch is a subnet, and is composed of three Xreception modules in series, and the function of the third stage is the same as that of the second stage. The first stage is specifically composed of three repeated deformable convolution modules connected in series with Xreception modules, wherein the size of the convolution kernel is 3×3, and the Xreception modules in the invention are composed of 3 convolution layers with depth of 3×3 and separable convolution, step length of 1 and filling of 1. Comprising the following steps:
2_1) an input layer of the street view image semantic segmentation deep convolutional neural network model is used for receiving R, G, B three-channel components of an original input image and outputting the R, G, B three-channel components to a hidden layer;
for the input layer, the input end of the input layer receives R, G, B three-channel components of an original input image with the width W and the height H, and the output end of the input layer outputs R, G, B three-channel components of the original input image to the hidden layer;
2_2) a first stage of the hidden layer comprises a fusion enhancement Module constructed by a deformable convolution Module connected in series with an Xreception Module, wherein three fusion enhancement modules are repeatedly stacked, and a plurality of feature graphs are sequentially generated through the three fusion enhancement modules;
the first stage of the hidden layer is composed of three fusion enhancement modules, and each fusion enhancement Module mainly comprises a deformable convolution Module connected in series with a lightweight Xreception Module. Further displacement adjustment of the spatially sampled position information in the module is achieved by using a deformable convolution unit, which displacement can be learned in the target task and does not require additional supervisory signals. The offset added in the deformable convolution unit is a part of a structure of a street view image semantic segmentation deep convolution neural network model, in an input feature map, an original part obtains a part of feature areas through a sliding window, and after the deformable convolution is introduced, the original convolution network is divided into two paths for sharing the feature map. The upper path learns the offset by using a parallel standard convolution, and meanwhile, the gradient back propagation can normally perform end-to-end learning, so that the successful integration of the deformable convolution is ensured. After the offset is learned, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which is required to be identified at present, and the visual effect is that the sampling point positions of the convolution kernels at different positions can be adaptively changed according to the image content, so that the geometric deformation such as the shape, the size and the like of different objects in the image content is adapted. The concrete steps are as follows: the displacement required by the deformable convolution is obtained through the output of a parallel standard convolution, and then the displacement is acted on a convolution kernel to achieve the effect of the deformable convolution. When the picture pixels are integrated, offset operation is needed to be carried out on the pixels, floating point number types are generated in the generation of offset, the offset is converted into integer types, counter propagation cannot be carried out if the offset is rounded directly, and the corresponding pixels are obtained in a bilinear difference mode.
Convolution comprises two steps: 1) Sampling on the input feature diagram x by using a regular grid R; 2) The sum of the sampled values weighted with w. Grid R defines the size and expansion of receptive fields. As in formula (1):
R={(-1,-1),(-1,0),...,(0,1),(1,1)} (1)
a 3 x 3, dilation-to-1 convolution kernel is defined;
the definition of convolution is: for each pixel position p in the output 0 General convolution calculation is as in equation (2):
Figure BDA0002784048100000041
wherein y (p) 0 ) Is p 0 A feature map corresponding to the position; p is p 0 For each position on the output signature y; x is the input signature and R represents a grid of receptive fields, here exemplified by 3 x 3. For each position p on the output profile y 0 ,p k Is an enumeration of locations in R. In the deformable convolution, the regular grid R is increased by an offset { Δp } k I k=1,..K= |r|. Equation (2) becomes (3):
Figure BDA0002784048100000042
wherein p is 1 To incorporate each position on the deformable convolved output feature map y, y (p 1 ) Is p 1 The position corresponds to the deformed characteristic diagram; Δp k Is the offset 2K in both x and y directions;
the original image pixel value is marked as V, the original convolution process is divided into two paths, the upper path learns the offset in the x direction and the y direction, the output is H.times.W.times.2K, wherein K.times.R.times.2 represents the number of pixels in a grid, 2K means the offset in the x direction and the y direction, and the image pixel value is marked as U at the moment. With this offset, the pixel value index of the image in U is added to that in V, and for each convolution window, the window after the translation is no longer the original regular sliding window, and the calculation process is consistent with the convolution. Now, the sampled position becomes an irregular position due to the offset Δp k Usually a fraction, we calculate the corresponding integer type by bilinear interpolation as in equation (4):
Figure BDA0002784048100000051
where p represents an arbitrary position on the feature map (p=p for equation (3) 1 +p k +Δp k ) Q is the enumeration of all integral spatial positions in the feature map x, and G (,) is the bilinear interpolation kernel.
After all pixel positions are obtained, a new image M is obtained, and the new image M is input as input data to the x-resolution Module.
The residual learning unit of the residual structure takes Xreception Module as the basic residual structure of deep V < 3+ >, the characteristic is extracted by 3 depth separable convolution of 3 multiplied by 3, the convolution calculation is carried out channel by channel and point by point respectively, and compared with the conventional convolution operation, the parameter quantity and the operation cost are lower, which is also the main reason for introducing the Xreception Module. When the input feature is different from the output feature in dimension (the number of channels of the feature map), the dimension of the input feature needs to be adjusted through 1×1 convolution, and then the input feature and the output feature of the residual learning unit are added and fused, so that a final feature map is obtained. When the input feature and the output feature are the same in dimension, the final extracted feature can be obtained by adding (fusing) the input feature and the output feature map of the residual learning unit. The idea of combining a depth separable convolution with a basic residual Module Bottleneck residual structure (comprising 1X1 convolution, one 3X3 convolution and one 1X1 convolution, wherein 1X1 convolution is used for adjusting feature dimensions and 3X3 convolution is used for extracting features), and dividing a standard convolution into a channel convolution and a spatial convolution by using the depth separable convolution to reduce parameters of model training; and the residual structure is utilized to eliminate the gradient explosion problem caused by the deepening of the network hierarchy.
For the first stage, the input end of the 1 st fusion enhancement Module of the deformable convolution serial Xreception Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature map, and the set formed by the output feature map is recorded as R 1 The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 2 nd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module serial Xreception Module 1 The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R 2 The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 3 rd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module serial Xreception Module 2 The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R 3 The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is 3 Each of the feature maps has a width W, a height H, and a channel number C, and therefore R 3 Can be noted as (H, W, C). The first stage comprises a feature extraction branch, and after 3 times of repeated deformable convolution tandem Xreception Module fusion enhancement modules, downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain two new feature image formed sets which are respectively marked as R 4 And R is 5 The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is 4 Each of which is (H, W, C), R 5 Is (H/2, W/2,2C).
2_3) the second and third phases constitute a second part of the hidden layer; the high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the high-resolution spatial resolution is kept, and meanwhile, the high-resolution spatial resolution has good semantic expression capability. The method comprises the following specific steps:
after the first stage, the second stage generates two parallel networks S 1 And S is 2 ,S 1 Consists of 3 lightweight Xreception modules in series. The width and the height of the input characteristic layer and the output characteristic layer of each Xreception Module are consistent, S 1 Input terminal receives R 4 All feature patterns of (S) 1 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 6 Wherein R is 6 Is (H, W, C); s is S 2 Is formed by connecting 3 lightweight Xreception modules in series, the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, S 2 Input terminal receives R 5 All feature patterns of (S) 2 The output end outputs the generated characteristic diagram, and the characteristic diagram set is marked as R 7 Wherein R is 7 Is (H/2, W/2,2C); two parallel networks S passing through the second stage 1 And S is 2 Downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain five new sets formed by characteristic diagrams, which are respectively marked as R 8 、R 9 、R 10 R 11 And R is 12 . Wherein R is 8 Each of which is (H, W, C), R 9 Each of which is (H/2, W/2,2C), R 10 Each of which is (H/2, W/2,2C), R 11 Each of which is (H/4, W/4,4C), R 12 Is (H/4, W/4,4C).
After the second stage, the third stage generates three parallel networks S 3 、S 4 And S is 5 Wherein S is 3 Consists of 3 lightweight Xreception modules in series. Each XacceptThe width and height of the input feature layer and the output feature layer of the ion Module are consistent, and R is 7 Performing partial upsampling to obtain a new set of feature patterns denoted as R 13 Wherein R is 13 Is (H, W, C). At the same time, the information fusion layer fuses R 8 And R is 13 Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R 14 Wherein R is 14 Is (H, W, C). S is S 3 Input terminal receives R 14 All feature patterns of (S) 3 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 15 Wherein R is 15 Is (H, W, C); s is S 4 The system consists of 3 lightweight Xreception modules connected in series, wherein the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, and an information fusion layer simultaneously fuses R 9 And R is 10 Fusing the characteristic information, and marking a set formed by the characteristic images generated by fusion as R 16 Wherein R is 16 Is (H/2, W/2,2C). S is S 4 Input terminal receives R 16 All feature patterns of (S) 4 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 17 Wherein R is 17 Is (H/2, W/2,2C); s is S 5 Consists of 3 lightweight Xreception modules in series. The input feature layer and the output feature layer of each Xreception Module are consistent in width and height, and the information fusion layer fuses R 11 And R is 12 Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R 18 Wherein R is 18 Is (H/4, W/4,4C). S is S 5 Input terminal receives R 18 All feature patterns of (S) 5 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 19 Wherein R is 19 Is (H/4, W/4,4C); at the end of the third phase, the sub-network S 4 And subnetwork S 5 Generated feature map R 17 And R is 19 Up-sampling operation is carried out to generate and pass S 3 Subnet generated R 15 Feature maps of the same size scale, specialThe collection of the symptomatic charts is respectively denoted as R 20 And R is 21 R is then taken up 15 、R 20 And R is 21 Inputting the new feature images into a feature fusion layer to perform feature information fusion, and generating a new set formed by the feature images to be marked as R 22 Wherein R is 22 Is (H, W, C).
For the output layer, which consists of 1 convolution layer, the input end of the output layer receives the data composed of the feature map set R 22 The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; the width of each semantic segmentation prediction graph is W, the height is H, and the channel is C.
2_4) the original street view image { J } in the training set n (i, J) and the corresponding semantic label image (semantic segmentation label gray level image) are used as original input images, input into a constructed street view image semantic segmentation deep neural network model for training, obtain a semantic segmentation prediction image corresponding to each original street view image in a training set, and send each original street view image { J } n The set of semantic segmentation prediction graphs corresponding to (i, j) is denoted as
Figure BDA0002784048100000081
2_5) computing a set of semantic segmentation prediction graphs corresponding to each original street view image in the training set
Figure BDA0002784048100000082
One-hot coded image set processed with corresponding true semantic segmentation images>
Figure BDA0002784048100000083
The loss function value between will->
Figure BDA0002784048100000084
And (3) with
Figure BDA0002784048100000085
The loss function value between them is recorded as->
Figure BDA0002784048100000086
In specific implementation, the cross entropy of classification is adopted to obtain
Figure BDA0002784048100000087
And->
Figure BDA0002784048100000088
Loss function value->
Figure BDA0002784048100000089
2_6) repeatedly executing the step 2_4) and the step 2_5) for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then find out the smallest value of the loss function value from M×N loss function values; then the weight vector and the bias term corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias term corresponding to the deep neural network classification training model, and are correspondingly marked as W best And b best The method comprises the steps of carrying out a first treatment on the surface of the And the training of the street view image semantic segmentation deep neural network classification model is completed.
3) Test model: and inputting the test set into the trained model for testing.
3_1) order
Figure BDA0002784048100000091
Representing a road scene image to be semantically segmented, namely a test set; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>
Figure BDA0002784048100000092
Is H' represents ∈>
Figure BDA0002784048100000093
Height of->
Figure BDA0002784048100000094
Representation->
Figure BDA0002784048100000095
Pixel values of the pixel points with the middle coordinate positions of (i ', j');
3_2) will
Figure BDA0002784048100000096
R channel component, G channel component and B channel component of the model are input into a trained street view image semantic segmentation deep neural network classification model, and W is utilized best And b best Predicting to obtain->
Figure BDA0002784048100000097
Corresponding predictive semantic segmentation image, noted +.>
Figure BDA0002784048100000098
Wherein (1)>
Figure BDA0002784048100000099
Representation->
Figure BDA00027840481000000910
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
Through the steps, the deformable convolution fusion enhanced image semantic segmentation is realized.
Compared with the prior art, the invention has the beneficial effects that:
1) The lightweight Xreception Module was introduced to replace the Bottleneck Module in the conventional network model. The Xreception Module uses the design thought of the Bottleneck as a reference, the network model is continuously deepened through the residual error learning unit, rich semantic features are extracted, and meanwhile, standard convolution in the Bottleneck is replaced by depth separable convolution, so that parameters of the model can be reduced under the condition of ensuring accuracy, and the operation cost is reduced. Meanwhile, the multi-scale fusion of the network has a better effect, and after the feature extraction and fusion of the modules, the interaction of high and low resolution has better result output.
2) The deep neural network constructed by the method adopts the high-resolution fusion parallel network to reduce the lost characteristic information of the characteristic image in the whole network, and retains the effective depth information to a great extent by the high-resolution unchanged characteristic image information and the fusion low-resolution characteristic image information in the whole process.
3) The deep neural network constructed by the method disclosed by the invention has the advantages that the deformable convolution is integrated in the first stage of the hidden layer, so that the network model has better deformation modeling capability while maintaining high-resolution characteristics in the characteristic extraction process, the problems of small-scale target loss and discontinuous segmentation during semantic segmentation are solved, and the overall robustness of the model is better.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Fig. 2 is a block diagram of a composition structure of a street view image semantic segmentation neural network model constructed by the method of the invention.
Fig. 3 is a schematic diagram of a framework of a street view image semantic segmentation neural network model according to the method of the present invention.
FIG. 4 shows a street view image to be semantically segmented, a corresponding real semantic segmentation image, and a predicted semantic segmentation image obtained by prediction, which are adopted in the embodiment of the present invention;
wherein, (a) is a street view image to be semantically segmented; (b) The real semantic segmentation image corresponding to the street view image to be semantically segmented is shown in the step (a); (c) The method is used for predicting the street view image to be semantically segmented shown in the step (a) to obtain the predicted semantic segmentation image.
Detailed Description
The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.
The invention provides a deformable convolution fusion enhanced street view image semantic segmentation method, which constructs a street view image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information when a large target object of a street view image is segmented, thereby solving the problems of small-scale target loss and discontinuous segmentation during the semantic segmentation of the street view image, ensuring better overall robustness of the model, ensuring higher street view image processing precision and improving the image segmentation effect.
The general implementation block diagram of the deformable convolution fusion enhanced street view image semantic segmentation method provided by the invention is shown in fig. 1, and comprises two processes of a training stage and a testing stage.
Fig. 2 is a block diagram of a composition structure of a street view image semantic segmentation neural network model constructed by the method of the invention. Fig. 3 is a schematic diagram of a framework of a street view image semantic segmentation neural network model according to the method of the present invention. The method for realizing the deformable convolution fusion enhanced street view image semantic segmentation mainly comprises the following steps:
1) Firstly, inputting an original image into a first deformable convolution layer in a first stage of a network, and extracting features (high-resolution feature images);
2) Inputting the output initial characteristics to a first Xcection Module to obtain a deeper characteristic diagram;
3) Repeating 1) and 2) for 3 times, namely immediately following an Xreception Module after the deformable convolution Module, and extracting deep features for multiple times while increasing the receptive field;
4) Respectively performing downsampling operations with step sizes of 1 and 2, wherein one part of the downsampling operations keep high resolution, and the other part of the downsampling operations parallelizes with lower resolution, and inputting the downsampling operations into 3 Xreception modules repeated by different branches in the second stage;
5) After feature information fusion is carried out through a feature fusion layer, the feature information fusion layer is respectively input into 3 Xreception modules repeated by different branches in the third stage, downsampling operation with step sizes of 1 and 2 is carried out, then upsampling and feature fusion are carried out, and a high-resolution feature map is output;
6) Finally, after one convolution, the channel number of the output characteristic is adjusted to be the class number to be segmented, and the predicted segmented image can be obtained after the classifier function is activated.
In specific implementation, the specific steps of the training phase process of the street view image semantic segmentation neural network model of the method are as follows:
1, constructing an image training set: selecting N original street view image data and a corresponding semantic segmentation label gray level map, forming a training set, and marking the nth original street view image in the training set as { J } n (i, J) } and { J } in the training set n The semantic segmentation label image corresponding to (i, j) is recorded as
Figure BDA0002784048100000111
The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500, such as 1000; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of the pixel point in the image; 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W representing { J ] n Width of (i, J) }, H represents { J }, J n Height of (i, j), e.g. w=1024, h=512, j n (i, J) represents { J ] n Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>
Figure BDA0002784048100000112
Representation of
Figure BDA0002784048100000113
Pixel values of the pixel points with the middle coordinate positions (i, j); meanwhile, in order to well evaluate the designed model, the real segmentation label image corresponding to the original street view image in the training set and the corresponding semantic segmentation label gray level image is required to be used as a training target of the user and marked as +.>
Figure BDA0002784048100000114
Here, the original street view image directly selects 2975 images from the training dataset in the city landscape dataset, namely the city sceneries public dataset.
2, constructing a deep neural network: the deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer is composed of two parts, and three repeated deformable convolution modules are connected in series to form a high-resolution input of the network and a double-stage multi-branch parallel fused Xreception Module network.
2_1 for the input layer, the input end of the input layer receives R, G, B three channel components of a pair of original input images, and the output end of the input layer outputs R channel components, G channel components and B channel components of the original input images to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W, and the height is required to be H;
2_2, constructing a fusion Module for the deformable convolution Module serial Xreception Module at the first part of the hidden layer, then repeatedly stacking the fusion modules to three, and sequentially generating a plurality of feature images through the three fusion enhancement modules;
the first stage of the hidden layer is composed of three fusion enhancement modules, and each fusion enhancement Module mainly comprises a deformable convolution Module connected in series with a lightweight Xreception Module. After all pixel positions are obtained through the first deformable convolution, a new picture M is obtained, and the new picture M is input into the Xreception Module as input data.
The first stage is the first part, for the first stage, the input end of the 1 st fusion enhancement Module of the deformable convolution serial Xreception Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature image, and the set formed by the output feature image is recorded as R 1 The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 2 nd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module serial Xreception Module 1 The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R 2 The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 3 rd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module serial Xreception Module 2 The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R 3 The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is 3 Each of the feature maps has a width W, a height H, and a channel number C, and therefore R 3 Can be noted as (H, W, C). The first stage comprises a feature extraction branch, and after 3 times of repeated deformable convolution tandem Xreception Module fusion enhancement modules, downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain two new feature graphs, wherein the two new feature graphs are respectively marked as R 4 And R is 5 The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is 4 Each of which is (H, W, C), R 5 Is (H/2, W/2,2C).
2_3 the second and third phases constitute a second part of the hidden layer; the high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the high-resolution spatial resolution is kept, and meanwhile, the high-resolution spatial resolution has good semantic expression capability. The method comprises the following specific steps:
after the first stage, the second stage generates two parallel networks S 1 And S is 2 ,S 1 The invention consists of 3 lightweight Xreception modules connected in series, wherein each Xreception Module consists of 3 convolution layers with depth of 3×3 and with a step length of 1 and a filling of 1. The width and the height of the input characteristic layer and the output characteristic layer of each Xreception Module are consistent, S 1 Input terminal receives R 4 All feature patterns of (S) 1 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 6 Wherein R is 6 Is (H, W, C); s is S 2 Is formed by connecting 3 lightweight Xreception modules in series, the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, S 2 Input terminal receives R 5 All feature patterns of (S) 2 The output end outputs the generated characteristic diagram, and the characteristic diagram set is marked as R 7 Wherein R is 7 Is (H/2, W/2,2C); two parallel networks S passing through the second stage 1 And S is 2 Downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain five new sets formed by characteristic diagrams, which are respectively marked as R 8 、R 9 、R 10 R 11 And R is 12 Wherein R is 8 Each of which is (H, W, C), R 9 Each of which is (H/2, W/2,2C), R 10 Each of which is (H/2, W/2,2C), R 11 Each of which is (H/4, W/4,4C), R 12 Is (H/4, W/4,4C).
After the second stage, the third stage generates three parallel networks S 3 、S 4 And S is 5 Wherein S is 3 Consists of 3 lightweight Xreception modules in series. The input feature layer of each Xception Module is consistent with the width and height of the output feature layer,at this time R 7 Performing partial upsampling to obtain a new set of feature patterns denoted as R 13 Wherein R is 13 Is (H, W, C). At the same time, the information fusion layer fuses R 8 And R is 13 Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R 14 Wherein R is 14 Is (H, W, C). S is S 3 Input terminal receives R 14 All feature patterns of (S) 3 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 15 Wherein R is 15 Is (H, W, C); s is S 4 The system consists of 3 lightweight Xreception modules connected in series, wherein the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, and an information fusion layer simultaneously fuses R 9 And R is 10 Fusing the characteristic information, and marking the characteristic diagram generated by fusion as R 16 Wherein R is 16 Is (H/2, W/2,2C). S is S 4 Input terminal receives R 16 All feature patterns of (S) 4 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 17 Wherein R is 17 Is (H/2, W/2,2C); s is S 5 Consists of 3 lightweight Xreception modules in series. The input feature layer and the output feature layer of each Xreception Module are consistent in width and height, and the information fusion layer fuses R 11 And R is 12 Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R 18 Wherein R is 18 Is (H/4, W/4,4C). S is S 5 Input terminal receives R 18 All feature patterns of (S) 5 The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R 19 Wherein R is 19 Is (H/4, W/4,4C); at the end of the third phase we need to connect the subnetwork S 4 And subnetwork S 5 Generated feature map R 17 And R is 19 Up-sampling operation is carried out to generate and pass S 3 Subnet generated R 15 Feature maps of the same size scale, the sets of feature maps are respectively denoted as R 20 And R is 21 However, it isR is then set 15 、R 20 And R is 21 Inputting the new feature images into a feature fusion layer to perform feature information fusion, and generating a new set formed by the feature images to be marked as R 22 Wherein R is 22 Is (H, W, C).
For the output layer, which consists of 1 convolution layer, the input end of the output layer receives the feature map set R 22 The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; wherein, the width of each semantic segmentation prediction graph is W, and the height is H.
2_4, taking the original street view images in the training set and the corresponding semantic segmentation label gray level images as original input images, inputting the original street view images and the corresponding semantic segmentation label gray level images into a deep neural network for training to obtain semantic segmentation prediction graphs corresponding to each original street view image in the training set, and taking each original street view image { J } n The set of semantic segmentation prediction graphs corresponding to (i, j) is denoted as
Figure BDA0002784048100000151
2_5 calculating a set formed by semantic segmentation prediction graphs corresponding to each original street view image in the training set
Figure BDA0002784048100000152
One-hot coded image set processed with corresponding true semantic segmentation images >
Figure BDA0002784048100000153
The loss function value between will->
Figure BDA0002784048100000154
And (3) with
Figure BDA0002784048100000155
The loss function value between them is recorded as->
Figure BDA0002784048100000156
Figure BDA0002784048100000157
Obtained using a classification cross entropy (categorical crossentropy).
2_6 repeatedly executing the step 2_4 and the step 2_5 for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then find out the smallest value of the loss function value from M×N loss function values; then the weight vector and the bias term corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias term corresponding to the deep neural network classification training model, and are correspondingly marked as W best And b best . In this example, m=484.
3 test model: and inputting the test set into the trained model for testing. The specific steps of the test stage process are as follows:
3_1 ream
Figure BDA0002784048100000158
Representing a road scene image to be semantically segmented; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>
Figure BDA0002784048100000159
Is H' represents ∈>
Figure BDA00027840481000001510
Height of->
Figure BDA00027840481000001511
Representation->
Figure BDA00027840481000001512
Pixel values of the pixel points with the middle coordinate positions of (i ', j');
3_2 will
Figure BDA00027840481000001513
The R channel component, the G channel component and the B channel component of the model are input into a trained deep neural network classification model, and W is utilized best And b best Predicting to obtain->
Figure BDA00027840481000001514
The corresponding predicted semantic segmentation image is noted as
Figure BDA00027840481000001515
Wherein (1)>
Figure BDA00027840481000001516
Representation->
Figure BDA00027840481000001517
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
The feasibility and effectiveness of the method of the invention are further verified as follows.
The architecture of the deep neural network was built using the python-based deep learning library pytorch 1.2. The Cityscapes test set is adopted to analyze the segmentation effect of the street view image predicted by the method. Here, the segmentation performance of the predicted semantic segmentation image is evaluated using 3 commonly used objective parameters of the evaluation semantic segmentation method, i.e., the homography ratio (Mean Intersection over Union, MIoU), the Pixel Accuracy (PA), the average Pixel Accuracy (Mean Pixel Accuracy, MPA), and the definition thereof are given.
Definition 1: mlou (homozygote ratio, mean Intersection over Union) is a standard measure of semantic segmentation. Which calculates the ratio of the intersection and union of the two sets. The formula is as follows:
Figure BDA0002784048100000161
definition 2: the Pixel Accuracy (Pixel Accuracy) represents the proportion of the marked correct pixels to the total pixels as shown in the following formula:
Figure BDA0002784048100000162
definition 3: the average pixel accuracy (Mean Pixel Accuracy) is a boost of the PA, calculates the ratio of the number of correctly classified pixels in each class, and then averages the PA for all classes as follows:
Figure BDA0002784048100000163
The method is used for predicting each street view image in the Cityscapes test set to obtain a predicted semantic segmentation image corresponding to each street view image, and the higher the average intersection ratio MIoU value, the pixel accuracy PA and the average pixel accuracy MPA of the semantic segmentation effect of the method are, the higher the validity and the prediction accuracy are, wherein the average intersection ratio MIoU is shown in a table 1.
TABLE 1 mIoU values on the Cityscapes dataset for the method of the present invention
Figure BDA0002784048100000164
Figure BDA0002784048100000171
From the data listed in table 1, the segmentation effect of the street view image obtained by the method of the present invention is better, which indicates that the method of the present invention is feasible and effective for obtaining the predicted semantic segmentation image corresponding to the street view image. The specific performances of the average cross ratio MIoU, the pixel accuracy rate PA and the average pixel accuracy rate MPA of the method are shown in the table 2, and the result shows that the segmentation effect of the method is in the front of the existing segmentation model.
TABLE 2 Algorithm Performance on the Cityscapes dataset
Figure BDA0002784048100000172
/>
In fig. 4, (a) a selected street view image to be semantically segmented is given; (b) Giving out a real semantic segmentation image corresponding to the street view image to be semantically segmented shown in the step (a); (c) The invention provides a predicted semantic segmentation image obtained by predicting the street view image to be semantically segmented shown in the step (a) by using the method. Comparing (b) and (c) in fig. 4, it can be seen that the segmentation accuracy of the predicted semantic segmentation image obtained by the method of the present invention is higher, and is close to the real semantic segmentation image.
It should be noted that the examples are disclosed for the purpose of aiding in the further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims (7)

1. A deformable convolution fusion enhanced street view image semantic segmentation method comprises a training stage and a testing stage, and specifically comprises the following steps:
1) Constructing an image training set which comprises an original street view image and a corresponding semantic tag image;
selecting N pieces of original street view image data and a corresponding semantic segmentation label gray level image, namely semantic label images, to form an image training set; n is a positive integer; the nth original street view image in the training set is recorded as { J } n (i, J) } and { J } in the training set n The semantic segmentation label image corresponding to (i, j) is recorded as
Figure FDA0002784048090000011
N is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of the pixel point in the image; 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W representing { J ] n Width of (i, J) }, H represents { J }, J n (i, j) } height; (i, j) represents the coordinate position of the pixel point in the image; j (J) n (i, J) represents { J ] n Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>
Figure FDA0002784048090000012
Representation->
Figure FDA0002784048090000013
Pixel values of the pixel points with the middle coordinate positions (i, j); the true segmentation label image corresponding to the semantic label image is marked as +.>
Figure FDA0002784048090000014
And then->
Figure FDA0002784048090000015
Processing into single-heat coded image, and forming set
Figure FDA0002784048090000016
2) Constructing and training a street view image semantic segmentation deep neural network model:
the street view image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer is a multi-resolution fusion enhancement network introducing deformable convolution and comprises a first stage, a second stage and a third stage;
the first stage is that a fusion subnet A is formed by a deformable convolution serial Xreception Module, and the fusion subnet A is connected with a repeated subnet A for three times so as to obtain more deep semantic feature information; the Xreception Module is a basic residual Module of the semantic segmentation network deep V3+;
the second stage is a double-branch parallel network, each branch is a subnet, and each branch consists of three Xreception modules which are connected in series and is used for feature extraction and feature fusion; the third stage is three branch parallel networks, each branch is a subnet, and each branch consists of three Xreception modules connected in series;
2_1) an input layer of the street view image semantic segmentation deep convolutional neural network model is used for receiving R, G, B three-channel components of an original input image and outputting the R, G, B three-channel components to a hidden layer;
2_2) a first stage of the hidden layer comprises a fusion enhancement Module constructed by connecting deformable convolution modules in series with Xreception modules, and a plurality of feature images are sequentially generated through the three fusion enhancement modules; obtaining the displacement required by the deformable convolution through the output of a parallel standard convolution, and then acting on a convolution kernel to achieve the deformable convolution;
in the deformable convolution, the regular grid R is increased by an offset { Δp } k I k=1,.. represented by formula (3):
Figure FDA0002784048090000017
wherein p is 1 To incorporate each position on the deformable convolved output feature map y, y (p 1 ) Is p 1 The position corresponds to the deformed characteristic diagram; Δp k Is the offset 2K in both x and y directions;
then calculating by bilinear interpolation to obtain offset delta p n A corresponding integer type; obtaining the positions of all pixels, namely obtaining a new image M, and inputting the M as input data to an Xreception Module;
the residual error learning unit of the Xreception Module carries out convolution calculation and extraction of features by depth separable convolution respectively channel by channel and point by point; obtaining a feature map;
for the first stage, the input end of the 1 st deformable convolution serial Xreception Module is connected with the channel component of the original input image output by the output end of the input layer, and the output end outputs the generated characteristic image set R 1 The method comprises the steps of carrying out a first treatment on the surface of the The input end of the fusion enhancement Module of the 2 nd deformable convolution serial Xreception Module receives R 1 The output end outputs the generated characteristic diagram set R 2 The method comprises the steps of carrying out a first treatment on the surface of the The input end of the fusion enhancement Module of the 3 rd deformable convolution serial Xreception Module receives R 2 The output end outputs the generated characteristic diagram set R 3 Denoted (H, W, C), where C is the number of channels;
the first stage extracting feature is repeated 3 times to form deformable convolution serial Xreception Module fusion enhancement Module, and then downsampling operation is respectively carried out to obtain two new feature image formed sets, which are respectively marked as R 4 And R is 5 The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is 4 Each of which is (H, W, C), R 5 Is (H/2, W/2,2C);
2_3) the second stage and the third stage of the hidden layer exchange information among the multi-resolution features, so that the hidden layer has good semantic expression capability while maintaining the high-resolution spatial resolution;
the second stage generates two parallel networks S 1 And S is 2 ,S 1 The system consists of 3 lightweight Xreception modules which are connected in series; s is S 1 Receiving R at the input terminal of (2) 4 ,S 1 The output end of (1) outputs the generated characteristic diagram set R 6 Wherein R is 6 Is (H, W, C); s is S 2 Is formed by connecting 3 lightweight Xreception modules in series, S 2 Input terminal receives R 5 ,S 2 The output end outputs the generated characteristic diagram set R 7 Wherein R is 7 Is (H/2, W/2,2C); two parallel networks S 1 And S is 2 Respectively performing downsampling operation to obtain five new feature graphs, respectively denoted as R 8 、R 9 、R 10 R 11 And R is 12 The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is 8 Each of which is (H, W, C), R 9 Each of which is (H/2, W/2,2C), R 10 Each of which is (H/2, W/2,2C), R 11 Each of which is (H/4, W/4,4C), R 12 Is (H/4, W/4,4C);
third stage of generating three parallel networks S 3 、S 4 And S is 5 Wherein S is 3 The system consists of 3 lightweight Xreception modules which are connected in series; r is R 7 Performing partial upsampling to obtain a new set of feature patterns denoted as R 13 Wherein R is 13 Is (H, W, C); r is R 8 And R is 13 Fusing the feature information layers, and marking the feature image set generated after fusion as R 14 Wherein R is 14 Is (H, W, C); s is S 3 Input terminal receives R 14 ,S 3 Is output by the output end of (1) to generate a feature graph set R 15 Wherein R is 15 Is (H, W, C); s is S 4 Is composed of 3 lightweight Xreception modules connected in series, and R is simultaneously 9 And R is 10 Feature information fusion is carried out, and fusion is carried outSet of feature graphs R 16 Wherein R is 16 Is (H/2, W/2,2C); s is S 4 Input terminal receives R 16 ,S 4 The output end of (1) outputs the generated characteristic diagram set R 17 Wherein R is 17 Is (H/2, W/2,2C); s is S 5 The system consists of 3 lightweight Xreception modules which are connected in series; r is R 11 And R is 12 The feature map set generated by feature information layer fusion is marked as R 18 Wherein R is 18 Is (H/4, W/4,4C); s is S 5 Input terminal receives R 18 ,S 5 The output end of (1) outputs the generated characteristic diagram set R 19 Wherein R is 19 Is (H/4, W/4,4C); at the end of the third stage, S 4 And S is 5 Generated feature map R 17 And R is 19 Up-sampling operation is carried out to generate and R 15 Feature maps of the same size, the sets of feature maps are respectively denoted as R 20 And R is 21 The method comprises the steps of carrying out a first treatment on the surface of the R is then taken up 15 、R 20 And R is 21 Inputting the new feature image set R into a feature fusion layer for feature information fusion to generate a new feature image set R 22 Wherein R is 22 Is (H, W, C);
the output layer consists of 1 convolution layer, and the input end of the output layer receives the characteristic diagram set R 22 The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the input original image; the width of each semantic segmentation prediction graph is W, the height is H, and the channel is C;
2_4) inputting the image training set into a constructed street view image semantic segmentation deep neural network model for training to obtain a semantic segmentation prediction graph corresponding to each original street view image, and marking a set formed by the semantic segmentation prediction graphs as
Figure FDA0002784048090000041
2_5) calculation
Figure FDA0002784048090000042
And corresponding set of one-hot encoded images/>
Figure FDA0002784048090000043
Loss function value->
Figure FDA0002784048090000044
2_6) repeatedly executing the step 2_4) and the step 2_5) for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values in total; finding out a loss function value with the smallest value from M multiplied by N loss function values; the weight vector and the bias term corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias term corresponding to the deep neural network classification training model, and the corresponding weight vector and the optimal bias term are marked as W best And b best The method comprises the steps of carrying out a first treatment on the surface of the Training a street view image semantic segmentation deep neural network classification model is completed;
3) Test model: inputting the test set into the trained model for testing;
3_1) order
Figure FDA0002784048090000045
Representing a road scene image to be semantically segmented, namely a test set; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>
Figure FDA0002784048090000046
Is H' represents ∈>
Figure FDA0002784048090000047
Height of->
Figure FDA0002784048090000048
Representation->
Figure FDA0002784048090000049
Pixel values of the pixel points with the middle coordinate positions of (i ', j');
3_2) will
Figure FDA00027840480900000410
The channel components of the (2) are input into a trained street view image semantic segmentation deep neural network classification model, and W is utilized best And b best Predicting to obtain->
Figure FDA00027840480900000411
Corresponding predictive semantic segmentation image, noted +.>
Figure FDA00027840480900000412
Wherein (1)>
Figure FDA00027840480900000413
Representation->
Figure FDA00027840480900000414
And (3) the pixel value of the pixel point with the middle coordinate position of (i ', j') is the pixel value, so that the deformable convolution fusion enhanced image semantic segmentation is realized.
2. The method for semantic segmentation of a deformable convolution fusion enhanced street view image according to claim 1, wherein in step 1), the object categories in the street view image are classified into 19 categories.
3. The deformable convolution fusion enhanced street view image semantic segmentation method of claim 1, wherein the bilinear interpolation calculation of step 2_2) is represented by formula (4):
Figure FDA00027840480900000415
wherein P represents a position on the feature map; q is the spatial position of all integrals in the feature map x and G (,) is the bilinear interpolation kernel.
4. The deformable convolution fusion enhanced street view image semantic segmentation method as claimed in claim 1, wherein the residual learning unit of the Xreception Module extracts features by 3 depth separable convolutions of 3×3; when the input feature is different from the output feature in dimension, the dimension of the input feature is adjusted through 1 multiplied by 1 convolution, and then the dimension is added with the output feature of the residual error learning unit, so that a feature map is obtained; when the input feature and the output feature are identical in dimension, the input feature is added to the output feature map of the residual learning unit, thereby obtaining a feature map.
5. The deformable convolution fusion enhanced street view image semantic segmentation method as claimed in claim 1, wherein the step 2_5) specifically adopts classification cross entropy to obtain
Figure FDA0002784048090000051
And->
Figure FDA0002784048090000052
Loss function value->
Figure FDA0002784048090000053
6. The deformable convolution fusion enhanced street view image semantic segmentation method as claimed in claim 1, wherein a deep neural network model is built by specifically using a python-based deep learning library pytorch 1.2.
7. The method for semantic segmentation of the deformable convolution enhanced street view image according to claim 1 is characterized in that a Cityscapes test set is specifically adopted, and parameter average intersection ratio, pixel accuracy and average pixel accuracy are adopted as indexes to verify the street view image segmentation effect of the deformable convolution enhanced street view image semantic segmentation method.
CN202011291950.6A 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method Active CN112396607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011291950.6A CN112396607B (en) 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011291950.6A CN112396607B (en) 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method

Publications (2)

Publication Number Publication Date
CN112396607A CN112396607A (en) 2021-02-23
CN112396607B true CN112396607B (en) 2023-06-16

Family

ID=74606378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011291950.6A Active CN112396607B (en) 2020-11-18 2020-11-18 Deformable convolution fusion enhanced street view image semantic segmentation method

Country Status (1)

Country Link
CN (1) CN112396607B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313105B (en) * 2021-04-12 2022-07-01 厦门大学 Method for identifying areas of office swivel chair wood board sprayed with glue and pasted with cotton
CN113205520B (en) * 2021-04-22 2022-08-05 华中科技大学 Method and system for semantic segmentation of image
CN113420770A (en) * 2021-06-21 2021-09-21 梅卡曼德(北京)机器人科技有限公司 Image data processing method, image data processing device, electronic equipment and storage medium
CN113326799A (en) * 2021-06-22 2021-08-31 长光卫星技术有限公司 Remote sensing image road extraction method based on EfficientNet network and direction learning
CN113657388B (en) * 2021-07-09 2023-10-31 北京科技大学 Image semantic segmentation method for super-resolution reconstruction of fused image
CN113554733B (en) * 2021-07-28 2022-02-01 北京大学 Language-based decoupling condition injection gray level image colorization method
CN113807356B (en) * 2021-07-29 2023-07-25 北京工商大学 End-to-end low-visibility image semantic segmentation method
CN113608223B (en) * 2021-08-13 2024-01-05 国家气象信息中心(中国气象局气象数据中心) Single-station Doppler weather radar strong precipitation estimation method based on double-branch double-stage depth model
CN113762263A (en) * 2021-08-17 2021-12-07 慧影医疗科技(北京)有限公司 Semantic segmentation method and system for small-scale similar structure
CN115294488B (en) * 2022-10-10 2023-01-24 江西财经大学 AR rapid object matching display method
CN115393725B (en) * 2022-10-26 2023-03-07 西南科技大学 Bridge crack identification method based on feature enhancement and semantic segmentation
CN115620001B (en) * 2022-12-15 2023-04-07 长春理工大学 Visual auxiliary system based on 3D point cloud bilateral amplification algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059768A (en) * 2019-04-30 2019-07-26 福州大学 The semantic segmentation method and system of the merging point and provincial characteristics that understand for streetscape
CN110795976A (en) * 2018-08-03 2020-02-14 华为技术有限公司 Method, device and equipment for training object detection model
CN110826596A (en) * 2019-10-09 2020-02-21 天津大学 Semantic segmentation method based on multi-scale deformable convolution
CN111401436A (en) * 2020-03-13 2020-07-10 北京工商大学 Streetscape image segmentation method fusing network and two-channel attention mechanism
CN111563909A (en) * 2020-05-10 2020-08-21 中国人民解放军91550部队 Semantic segmentation method for complex street view image

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018035805A1 (en) * 2016-08-25 2018-03-01 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795976A (en) * 2018-08-03 2020-02-14 华为技术有限公司 Method, device and equipment for training object detection model
CN110059768A (en) * 2019-04-30 2019-07-26 福州大学 The semantic segmentation method and system of the merging point and provincial characteristics that understand for streetscape
CN110826596A (en) * 2019-10-09 2020-02-21 天津大学 Semantic segmentation method based on multi-scale deformable convolution
CN111401436A (en) * 2020-03-13 2020-07-10 北京工商大学 Streetscape image segmentation method fusing network and two-channel attention mechanism
CN111563909A (en) * 2020-05-10 2020-08-21 中国人民解放军91550部队 Semantic segmentation method for complex street view image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度卷积网络的无人机地物场景语义分割;宋建辉;程思宇;刘砚菊;于洋;;沈阳理工大学学报(第06期);全文 *

Also Published As

Publication number Publication date
CN112396607A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN110136062B (en) Super-resolution reconstruction method combining semantic segmentation
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN110929736B (en) Multi-feature cascading RGB-D significance target detection method
CN112258526B (en) CT kidney region cascade segmentation method based on dual attention mechanism
CN111401436B (en) Streetscape image segmentation method fusing network and two-channel attention mechanism
CN112991350B (en) RGB-T image semantic segmentation method based on modal difference reduction
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
CN114612476B (en) Image tampering detection method based on full-resolution hybrid attention mechanism
CN112884758B (en) Defect insulator sample generation method and system based on style migration method
CN115984494A (en) Deep learning-based three-dimensional terrain reconstruction method for lunar navigation image
CN115082675A (en) Transparent object image segmentation method and system
CN113076947A (en) RGB-T image significance detection system with cross-guide fusion
US20220198694A1 (en) Disparity estimation optimization method based on upsampling and exact rematching
CN114092824A (en) Remote sensing image road segmentation method combining intensive attention and parallel up-sampling
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN116630704A (en) Ground object classification network model based on attention enhancement and intensive multiscale
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN114494699A (en) Image semantic segmentation method and system based on semantic propagation and foreground and background perception
Yang et al. Depth map super-resolution via multilevel recursive guidance and progressive supervision
Wang et al. Saliency detection by multilevel deep pyramid model
CN115496910B (en) Point cloud semantic segmentation method based on full-connected graph coding and double-expansion residual error

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant