CN111563909B

CN111563909B - Semantic segmentation method for complex street view image

Info

Publication number: CN111563909B
Application number: CN202010389518.4A
Authority: CN
Inventors: 张丹; 刘京; 余义德; 孙杰; 王红萍; 时光; 张志伟; 裴立冠
Original assignee: Unite 91550 Of Pla
Current assignee: Unite 91550 Of Pla
Priority date: 2020-05-10
Filing date: 2020-05-10
Publication date: 2023-05-05
Anticipated expiration: 2040-05-10
Also published as: CN111563909A

Abstract

A semantic segmentation method for complex street view images. The invention provides a new method for performing semantic segmentation on a complex street view image by adopting a global convolutional neural network based on successful application of a deep learning method in the field of computer vision comprising image semantic segmentation, and aims to effectively solve the problems of under segmentation and over segmentation in the process of segmentation of the complex street view image and remarkably improve the precision and speed of image segmentation. The method mainly comprises four-stage method flows of information input, an encoder, a decoder and information output. The encoding module mainly comprises a DCNN part and an R-ASPP part, and the decoding module mainly comprises an AT-Decoder part. The DCNN can effectively extract low-level features containing position information, the R-ASPP can furthest extract high-level semantic features containing geometric and texture information, and the AT-Decoder can effectively fuse the low-level detail features and the high-level semantic features.

Description

Semantic segmentation method for complex street view image

Technical Field

The invention belongs to the technical field of computer vision and digital image processing, and particularly relates to a method for semantically segmenting a complex street view image.

Background

Image semantic segmentation is a fundamental technology of image understanding, and is currently applied to the fields of true three-dimensional display, unmanned driving, auxiliary medical treatment and the like, and is also one of research hotspots in the visual direction of a computer. The main task of the method is to classify each pixel point in the image, determine the category of each point, and therefore divide the area and mark the object category of each pixel in the image. The deep learning theory has been widely applied in the field, especially the convolutional neural network (Convolutional Neural Network, CNN) method in the theory, and has been successfully used in the construction of various semantic segmentation neural network models by many researchers.

Patent [ application number: CN109101975a discloses a method for processing the problem of image semantic segmentation by using a full convolutional neural network. The main characteristics are as follows: the front-end network downsamples the feature images output by each block into uniform sizes based on the detail reservation pooling layer, performs serial connection processing on the four output feature images, and transmits the feature images into the back-end network after further correcting the feature images by using a feature correcting module; after the back-end network is mainly responsible for image sampling, the back-end network is subjected to global pooling of variable weight, and finally cross entropy is calculated with the semantic annotation image of the training data set, so that error back propagation is performed.

Patent [ application number: CN110263833a discloses an image semantic segmentation method based on an encoding-decoding structure. The method is provided for solving the problems that the resolution of the feature map and the spatial information are easy to lose during multi-layer maximum pooling and downsampling operation in the network. The method comprises the steps of effectively fusing deep information and shallow space information acquired from a network, refining the fused feature map by utilizing a multi-core convolution block, and finally obtaining a segmentation result through data-dependent gradual up-sampling operation.

Patent [ application number: CN110782462AFu discloses a method for semantic segmentation by double flow feature fusion. In order to accurately obtain an image semantic segmentation result, a convolutional neural network in a training stage of the method comprises an input layer, a hidden layer and an output layer, and an RGB image processing module, a depth image processing module, a fusion module and a first deconvolution layer are built in the hidden layer; when the original image is input into the network to execute training operation, a corresponding semantic segmentation prediction graph can be obtained; and establishing a loss function for quantifying a loss value between a set formed by the semantic segmentation prediction graph corresponding to the original image and a set formed by the single-heat coding image processed by the corresponding real semantic segmentation image, and acquiring an optimal weight vector and an offset item required by the convolutional neural network classification training model based on the loss value.

Patent [ application number: CN110245665a discloses an image semantic segmentation method based on an attention mechanism. To extract image features effectively, the patent applies a deep convolutional neural network to a backbone network that selects a semantic segmentation network. An improved attention mechanism calculation module is constructed and connected in series with the backbone network. And adopting a position attention module, extracting the dependency relationship between data characteristics through data training, and designing a channel attention module to simulate the channel interdependence. The patent significantly improves the segmentation results by modeling rich contextual dependencies on local features.

Patent [ application number: CN110210485a discloses an image semantic segmentation method based on attention mechanism guiding feature fusion, and the network adopts a coding-decoding structure. Wherein the encoder uses the modified ResNet-101 to generate a series of features that change from high resolution low semantic to low resolution high semantic; the decoder adopts a pyramid structure module based on three-layer convolution operation, extracts high-level semantics of strong consistency constraint, and then performs layer-by-layer weighted fusion on low-level stage features to obtain a preliminary segmentation heat map; in addition, the patent adds auxiliary supervision to each fusion output of the decoding stage, and then overlaps with the main supervision loss after the heat map is up-sampled to strengthen the layering training of the model, and finally obtains the semantic segmentation image.

Therefore, the image semantic segmentation method constructed based on the deep learning theory can be effectively applied to segmentation tasks, but the currently proposed method cannot achieve the effect of full scene high precision due to the complex and diversified segmentation scenes, and the problems that geometric feature deletion, texture feature unobvious and the like easily occur when the existing segmentation method is used for segmenting complex street view images due to the fact that the pixel value difference of one object class in an image is too large and the pixel value difference between different classes in the image is too small due to the influence of surrounding environment, the problems that the existing segmentation method is easy to occur when the existing segmentation method is used for segmenting complex street view images are caused, and therefore the application of the deep learning theory in the field of image semantic segmentation is necessary to be further explored, the application field of the existing segmentation method is effectively expanded, and the segmentation precision of the existing segmentation method is improved.

Disclosure of Invention

The invention aims to solve the problem that the image segmentation is affected due to geometrical feature deletion and unobvious texture features when the complex street view image is segmented by the existing segmentation method, and provides a semantic segmentation method for the complex street view image.

The invention provides a new method for performing semantic segmentation on a complex street view image by adopting a global convolutional neural network based on successful application of a deep learning method in the field of computer vision comprising image semantic segmentation, and aims to effectively solve the problems of under segmentation and over segmentation in the process of segmentation of the complex street view image and remarkably improve the precision and speed of image segmentation.

The technical proposal of the invention

The invention discloses a complex street view image semantic segmentation method, which mainly comprises the steps of information input, an encoder, a decoder and information output. The network is a whole coding-decoding framework, wherein the coding module mainly comprises a DCNN part and an R-ASPP part, and the decoding module mainly comprises an AT-Decoder part; the method mainly comprises the following steps:

step 1, obtaining an image to be processed;

step 2, extracting low-level features containing position information by adopting DCNN;

step 3, extracting the information containing the advanced semantic geometry and texture of the image to the greatest extent by adopting an R-ASPP method;

step 4, adopting the characteristics of the various parts of different network structures of the AT-Decoder module to process and fuse, so that the low-level detail characteristics and the high-level semantic characteristics can be effectively fused;

and step 5, finally, a bilinear interpolation upsampling operation is needed to obtain a final segmentation result image (4 times upsampling is carried out to restore the original input image size).

Further, in the step 3 operation, considering that the multi-scale image features may contain more feature information, the DCNN output image features obtained in the step 2 operation are converted into five branches through the spatial pyramid pooling module. And the other branches except the global pooled branch are subjected to 3 x 3 common convolution to further learn important content information in the feature map, in addition, the original input of the R-ASPP is transferred to the point by adopting a long jump connection mode to be fused with the image features acquired by the 3 x 3 common convolution, and the outputs of the five branches are spliced together according to the channel dimension to be used as the output of the R-ASPP part.

Further, in the operation of step 4, three different network structures, namely DF branch, DC branch and DD branch of the AT-Decoder module are adopted to process the characteristics of each part respectively, and finally the results obtained by each branch are fused in sequence in the channel dimension; firstly, multiplying the output of the DF branch and the output of the DC branch to obtain B2, then adding the B2 and the output of the DC branch to obtain a feature map B3, and finally further learning features of the fused feature map by using 3X 3 convolution to obtain a feature map B4.

The method comprises the steps of adopting DF branches to process low-level features containing detail information learned by DCNN, designing a simple spatial attention model, further learning main features by using 3X 3 convolution operation, and classifying the features by adopting a softmax function. And splicing the output characteristic diagram of the DCNN and the output characteristic diagram of the R-ASPP in the channel dimension by adopting a DC branch, and directly extracting the characteristics containing accurate position information, complete geometric and texture information by using 3X 3 convolution. The DD branch is adopted to process high-level semantic features output by the encoder module, the method comprises the steps of designing a relation between attention module attention feature map channels based on the channels, wherein the attention module consists of two sub-branches, namely maximum pooling and average Chi Huazi branches; then, adding a full-connection layer, and fusing the feature graphs among the channels; and finally, fusing the result feature graphs obtained by the two sub-branches of the maximum pooling and the average pooling to obtain channel features containing more image information.

Further, bilinear interpolation up-sampling operation is adopted, and the image is restored to the original image resolution while more image characteristic contents are maintained, so that a final segmentation result image B5 is obtained.

The invention has the advantages and beneficial effects that:

the invention can realize a higher-precision segmentation result in a complex street view segmentation task, and has excellent overall segmentation performance especially in images with different illumination intensities and category diversity phenomena. Has wide application in the fields of true three-dimensional display, unmanned driving, auxiliary medical treatment and the like.

Drawings

Fig. 1 is a schematic diagram of the overall network.

The trellis effect that occurs with the hole convolution of fig. 2.

Fig. 3R-ASPP module structure.

Fig. 4 AT-Decoder module structure, (a) is the overall structure of the AT-Decoder, and (b) is the channel attention module (Channel Attention) structure based on advanced semantic features.

Two examples of scene images used in the experiment of fig. 5 are (a) daytime scene images and (b) dusk scene images.

FIG. 6 is a comparison of visual segmentation results in a Camvid test set; group (a) is a graph of the over-segmentation effect; (b) group undersplitting effect schematic; the two groups of images are all displayed in the left-to-right order, the first column is an original image, the second column is a true value, the third column is a segmentation result diagram obtained by an ENT method, the fourth column is a segmentation result diagram obtained by a SegNet method, the fifth column is a segmentation result diagram obtained by a deep 3-plus method, and the sixth column is a segmentation result diagram of the invention.

FIG. 7 is a comparison of visual segmentation results in a Cityscapes test set; group (a) is a graph of the over-segmentation effect; (b) group is an under-segmentation effect schematic diagram; the two groups of images are all in a left-to-right sequence, the first column is represented as an original image, the second column is a true value, the third column is a segmentation result diagram obtained by a PSPNet method, the fourth column is a segmentation result diagram obtained by a deep 3-plus method, and the fifth column is a segmentation result diagram of the invention.

Fig. 8 is a flow chart of an embodiment of the invention.

Detailed Description

The detailed features and advantages of the present invention will be readily apparent to those skilled in the art from the following detailed description, taken in conjunction with the accompanying drawings, wherein fig. 1 illustrates an overall network structure of the present invention and a process flow of the present invention is shown in fig. 8.

The invention provides a method for semantically segmenting a complex street view image, which comprises the following specific operation steps:

and 101, obtaining a colored image to be processed.

The experimental data of the invention are all complex street scene images, cover different time and different weather conditions, and are shot by a camera arranged on an automobile instrument panel.

Step 201, acquiring low-level image features by using a DCNN deep convolutional neural network.

The invention adopts 65 layers of Xattention network as the deep convolution neural network to obtain the low-level image characteristics containing more detail information.

Step 301, the R-ASPP module obtains sufficient high-level semantic geometry and texture information for the image.

According to the idea of image semantic segmentation network, the output image and the input image keep consistent in size, the effect of increasing the receptive field can be achieved through pooling operation in the network, and the defect of reduced image resolution can be derived at the same time. Researchers have further explored hole convolution to avoid this drawback, but the new problem that arises is the grid effect, as shown in fig. 2. This effect is also a great difficulty in performing semantic segmentation tasks on a pixel-level basis. Aiming at the grid effect, the Goolge team proposes to adopt an ASPP module in the deep bv3+ method to solve the grid effect. ASPP contains five branches in the feature top-to-bottom mapping process, including hole convolutions of four different sampling rates, and one global average pooling branch. The cavity convolution with different sampling rates can effectively capture multi-scale information and acquire global information by global average pooling. In general, the encoder end of the image semantic segmentation network can obtain advanced semantic information such as image geometry, texture and the like, thereby providing effective assistance for the follow-up accurate image segmentation operation. Therefore, an image to be segmented is input at the initial end of the network, and the DCNN is used for effectively extracting low-level features of the image containing the position information. In addition, in order to achieve the maximization effect of high-level semantic geometry and texture information, the invention provides an R-ASPP method aiming at an ASPP module, and the basic principle is that branches are pooled in addition to global averageFour hole convolution branches set up a residual block, which consists of a common 3 x 3 convolution and long jump connection, as shown in particular in fig. 3. The basic calculation process of the R-ASPP module comprises the following steps: input is output result of DCNN module, wherein D _fm ，fm ₁₁ 、fm ₂₁ 、fm ₃₁ 、fm ₄₁ 、fm ₅₁ A feature map obtained by the first convolution of five branches; then, the other branches except the global pooled branch are subjected to 3×3 common convolution to further learn important content information in the feature map to obtain fm ₁₂ 、fm ₂₂ 、fm ₃₂ 、fm ₄₂ In order to effectively fuse the original input of R-ASPP transferred to the point and the obtained deeper features, a long jump connection mode is adopted to obtain the output fm of each branch feature graph ₁ 、fm ₂ 、fm ₃ 、fm ₄ 、fm ₅ . Let n denote each branch, the range of n is {1, … }, then the branch outputs other than global average pooling can be expressed as:

fm _n ＝fm _n2 +D _fm

finally, the outputs of the five branches are spliced together according to the channel dimension, so that Fa is the output of the R-ASPP part, and then Fa meets the formula:

F _a ＝fm ₁ +fm ₂ +fm ₃ +fm ₄ +fm ₅

the image features acquired in each branch are fused, and input containing rich high-level semantic information is provided for a decoder.

In step 401, the AT-Decoder module processes and merges the features of the various parts of different network structures.

The AT-Decoder module mainly comprises three parts of DF, DC and DD branches, which respectively correspond to the output DF of the DCNN part in the network Decoder _a Output DD of R-ASPP module _fm And splice feature DC of both in channel dimension ₁ . The specific network structure of the AT-Decoder is shown in fig. 4, and fig. (a) is the overall structure of the AT-Decoder, wherein the channel attention module (Channel Attention) structure based on the advanced semantic features is shown in fig. 4 (b). It is easy to see that no adoption is made in each branchThe same network structure processes the characteristics of each part and then performs fusion. And finally, restoring the image into the original image resolution by adopting bilinear interpolation through the network, thereby obtaining a final segmentation result.

Step 401-1, DF branching treatment.

As shown in FIG. 4 (a), the input of the first branch DF in the AT-Decoder is a low-level feature DF learned by DCNN including detail information _a . In convolutional neural networks, the low-level features acquired by the convolutional layers, although containing the edge and detail information of the image, also have a lot of background information, which can interfere with the performance of the segmented network. Thus, to preserve more efficient image information features, a simple spatial attention model was designed in this branch, in which the main features were further learned using a 3 x 3 convolution operation, and features were classified using a softmax function, highlighting important detail location features, the output of this branch being DF ₂ 。

Step 401-2, DC branching process.

As shown in FIG. 4 (a), the input of the second branch DC in the AT-Decoder is the output profile DF of the Decoder DCNN _a And an output profile DD for R-ASPP _fm In the concatenation of the channel dimensions, the branch considers how to effectively fuse the low-level detail information and the high-level semantic information. In order not to lose the image characteristic learned in the network, on the basis of the two channel dimension splice, use 3X 3 convolution to extract and contain accurate position information and complete geometry, texture information characteristic directly, in order to get the accurate segmentation result to provide the effective help, the output of this branch is DC ₂ 。

Step 401-3, DD branch processing.

As shown in FIG. 4 (a), the input of the third branch DD in the AT-Decoder is the output of the encoder module, i.e., the high-level semantic features DD _fm 。DD _fm In order to preserve more semantic information, a channel-based attention module is first designed in the branch, aiming at focusing on the relationship between feature map channels, and the specific structure is shown in fig. 4 (b). The dieThe block is made up of two sub-branches, respectively a max-pooling and an average pooling branch. The average pooling can be implemented by globally describing the characteristics, and is used for feeding back each pixel point in the characteristic diagram, and when the maximum pooling is used for carrying out gradient back propagation calculation, only the pixel point with the largest response in the characteristic diagram is fed back with gradient, so that the characteristic after the average pooling is DD _fm1 The maximized pooled feature is DD _fm3 The formula can be expressed as:

wherein H W represents the size of the input feature map to be processed, (i, j) represents the pixel of the ith row and the jth column, the value of i is {1, … H }, and the value range of j is {1, … W }. Then, respectively adding full connection layers behind the maximum pooling layer and the average pooling layer, and fusing the feature graphs among the channels to respectively obtain DD _fm2 And DD _fm4 . Finally, the two sub-branches are fused, so that the channel characteristics containing more image information, namely the output DD of the channel attention module, are obtained ₁ ：

DD ₁ ＝w ₁ DD _fm1 +w ₂ DD _fm3

Therein, wDD _fm Indicating a full connection.

In the DD branch, DD containing more channel characteristic information is obtained ₁ Then, classifying the information by adopting a softmax function to obtain the most important characteristic information DD in the information ₂ Providing assistance in obtaining a final accurate segmentation result.

Step 401-4, feature fusion.

After feature enhancement in all three branches of the AT-Decoder, the next step will be to consider how to better fuse this information so that the fused features contain enough lower layersThe level detail information contains high-level semantic information rich enough. Thus, in this module, three branches are fused in a stepwise fusion manner, as shown in the rightmost column in fig. 4 (a). First, output DF branching DF ₂ Output DC from DC branch ₂ Performing multiplication operation to obtain a feature map B1 containing low-level information and high-level information simultaneously:

B ₁ ＝DF ₂ ×DC ₂

will B ₁ Output DD of DD branch ₂ Performing multiplication operation to further fuse high-level and low-level information to obtain a feature map B ₂ ：

B ₂ ＝B ₁ ×DD ₂

Thereafter, B is carried out ₂ Adding with the output of the DC branch to obtain a characteristic diagram B3, and fully fusing the characteristic diagrams of the three branches to obtain a fused characteristic diagram B ₃ 。

B ₃ ＝B ₂ +B ₁

Finally, the 3 multiplied by 3 convolution is used for further learning the characteristics of the fused characteristic diagram to obtain a characteristic diagram B ₄ . At this time, feature map B ₄ Contains a large amount of image characteristic information.

Step 501, bilinear interpolation upsampling processing.

After the low-level detail feature information and the high-level semantic information in the network are sufficiently fused, the acquired image features are up-sampled to the same size as the original input image. Typically, in convolutional neural networks, the upsampling is performed by bilinear interpolation, deconvolution, and anti-pooling. The bilinear interpolation method is widely used in the image field by performing linear interpolation in two directions on the basis of a function having two variables. Deconvolution, also known as transpose convolution, is understood to be a special way of convolving, first expanding the image size by a complement of 0 in proportion, then rotating the convolution kernel, and then performing the same operation as forward convolutions. The up-sampling mode of inverse pooling generally refers to the inverse operation of maximum pooling, and the position of the maximum value in the pooled area is reserved after the maximum pooling is performed in the feature map. Then, the anti-pooling will use the maximum information, and if there is no maximum information at this location in the image, then all 0's are complemented to expand the feature map for up-sampling purposes, which still has information loss problems for the recovery of the main content in the image features.

After three upsampling methods are analyzed, bilinear interpolation is adopted in the network. Because the size of the output B4 after the low-layer information and the high-layer information are fused in the network is 1/4 of that of the original input image, bilinear interpolation up-sampling operation is needed, and the image is restored to the original image resolution while more image characteristic contents are reserved, so that a final segmentation result image B5 is obtained.

The specific training details of the invention are experimental data based on public street view data, and mainly comprise two types of Camvid data sets and Cityscapes data sets. The Camvid dataset is marked by Cambridge university, and specifically mainly comprises 600 images including 11 categories of roads, buildings, automobiles, pedestrians and the like, wherein 367 images are set as training images, and 233 additional images are set as test images. At the same time, to ensure the validity of the inspection of the present invention, images taken in both daytime and dusk scenes are deliberately chosen, as shown in fig. 5. The Cityscapes data set is pushed and released by the Benz company in 2015, and is one of the most powerful and professional image semantic segmentation evaluation data sets in the currently accepted computer vision field. City map focuses on urban road environment understanding in real scenes, and tasks are more difficult and more suitable for evaluating performance of visual algorithms in terms of complex street view semantic understanding. The Cityscapes dataset contains 50 city streets of different scenes, different backgrounds, different seasons, which provides 5000 fine-labeled images, 20000 Zhang Culve labeled images, class 30 labeled objects. The Cityscapes dataset shares two sets of evaluation criteria, fine and coarse, the former providing 5000 finely annotated images and the latter providing 5000 finely annotated plus 20000 coarsely annotated images, the fine evaluation criteria being employed in this patent.

The most important evaluation index in semantic segmentation is the average intersection ratio (Mean Intersection Over Union, MIoU), which evaluates the network model of the present invention by calculating the intersection ratio between the true value (GT) and the predicted segmentation result. The cross ratio IoU is calculated based on each category, and IoU of all categories is averaged to obtain MIoU, and the formula is as follows:

wherein k represents the number of categories, m represents the true value, q represents the predicted value, p _mq Indicating that m is predicted as q.

In order to verify the effectiveness of the complex street view image semantic segmentation method provided by the invention, the method is subjected to actual calculation analysis with other methods based on a Camvid data set and a Cityscapes data set respectively, and analysis results are compared.

Camvid dataset contrast analysis

The results of the comparative analysis based on the Camvid dataset are shown in Table 1 below, and it can be seen from Table 1 that the accuracy of the invention is improved by 1.2% compared with the other methods Deeplabv3+.

TABLE 1

Cityscapes dataset contrast analysis

Table 2 below shows the results of the analysis of the method based on the Cityscapes dataset, and from the data in Table 2, it can be seen that the method of the present invention has a 1.3% improvement in the accuracy of Deeplabv3+ over the current mainstream method.

TABLE 2

Camvid dataset visual contrast analysis

As shown in fig. 6, two sets of segmentation results in Camvid complex street view images are shown. The first column is an original image in Camvid test data, the second column is a group Truth, the third column is a result graph obtained by an ENT method, the fourth column is a segmentation result graph obtained by a SegNet, the fifth column is a segmentation result graph obtained by a deep v3+ method, and the last column is a segmentation result graph of the invention. For ease of viewing, the addition of red, yellow and white boxes, respectively, in the figures indicates a significant improvement in this position over the segmentation effect of the present invention. The group separation image of (a) is a schematic diagram of the effect of the separation. The object in the color frame in each image belongs to one or more types of objects, but the pixel value of the object is too different from the pixel values of surrounding objects, so that the object is segmented into two or more types of objects. For example, the pedestrians with red frames in the first image, the buildings in the yellow frames and the like all have the phenomena of multi-division and wrong-division; the segmented image of the component (b) is a schematic view of the under-segmentation effect. The object in the color frame in each image is undersegmented into objects of the same category due to the fact that the pixel value difference between the object in the color frame and surrounding objects is too small, and the phenomenon of undersegmentation is caused. For example, the buildings shown by the red frames in the first diagram of the group (b), the vehicles shown by the yellow frames, and the like all have the phenomena of little division and wrong division. From the visual result of fig. 6, the invention is obviously improved for the problems of under segmentation and over segmentation in the process of segmenting the complex street view image.

Visual contrast analysis of cityscapes dataset

As shown in fig. 7, the first column is an original image in the Cityscapes test data, the second column is a group Truth, the third column is a segmentation result image obtained by the PSPNet method, the fourth column is a segmentation result image obtained by the deep v3+ method, and the fifth column is a segmentation result image obtained by the present invention, wherein the red and white boxes are significantly improved. In fig. 7 (a), the images of the group show the result of over-segmentation, in which the objects in the white frame in each image, such as trees in front of the building, garbage cans beside the car, etc., are affected by the surrounding environment, and the pixel values of the segmented objects and the surrounding objects are greatly different, so that over-segmentation is caused; (b) The group images show under-segmentation results, such as street lamps in a white frame in a first image, signboards in a white frame in a second chapter image and the like, and in the whole pair of street view images, the street lamps are distant objects in the images, are close to the sky, and are very easy to ignore in the segmentation process. Similarly, since the difference in pixel value between the roadside sign board and the pole is small, a small fraction, an underfraction, are liable to be caused. The visual results in fig. 7 demonstrate that the present invention has a significant effect on improving the under-segmentation and over-segmentation problems.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto. Any new method for performing semantic segmentation on complex street view images by adopting a global convolution neural network belongs to the protection scope of the technical conception set forth in the invention, and any person skilled in the art can substitute or change the technical scheme and the invention conception according to the invention or obviously combine with the prior art within the protection scope of the invention within the technical scope of the invention.

Claims

1. A complex street view image semantic segmentation method comprises the following steps:

step 1, obtaining an image to be processed;

step 2, extracting low-level features containing position information by using DCNN;

step 3, acquiring enough high-level semantic geometry and texture information of the image by adopting an R-ASPP method; the specific operation comprises the following steps:

considering that the multi-scale image features can contain more feature information, converting the DCNN output image features obtained in the step 2 into five branches through a spatial pyramid pooling module; the other branches except the global pooled branch are subjected to 3X 3 common convolution to further learn important content information in the feature map, original input of the R-ASPP is transmitted to the point by adopting a long jump connection mode to be fused with image features acquired by the 3X 3 common convolution, and outputs of the five branches are spliced together according to channel dimensions to be used as output of an R-ASPP part;

step 4, adopting an AT-Decoder module to process and fuse the characteristics of each part of different network structures; the specific operation comprises the following steps:

the method comprises the steps of adopting three different network structures of DF branch, DC branch and DD branch of an AT-Decoder module to process the characteristics of each part respectively, and finally fusing the results obtained by each branch in sequence on the channel dimension; firstly, multiplying the output of a DF branch with the output of a DC branch to obtain B2, then adding the B2 and the output of the DC branch to obtain a feature map B3, and finally further learning features of the fused feature map by using 3X 3 convolution to obtain a feature map B4; wherein, the liquid crystal display device comprises a liquid crystal display device,

the DF branch is adopted to process low-level features containing detail information learned by DCNN, a simple spatial attention model is designed, 3X 3 convolution operation is used for further learning main features, and a softmax function is adopted for classifying the features;

splicing the output feature map of the DCNN and the output feature map of the R-ASPP in the channel dimension by adopting a DC branch, and directly extracting features containing accurate position information and complete geometric and texture information by using 3X 3 convolution;

the DD branch is adopted to process high-level semantic features output by the encoder module, the method comprises the steps of designing a relation between attention module attention feature map channels based on the channels, wherein the attention module consists of two sub-branches, namely maximum pooling and average Chi Huazi branches; then, adding a full-connection layer, and fusing the feature graphs among the channels; finally, fusing the result feature graphs obtained by the two sub-branches of the maximum pooling and the average pooling to obtain channel features containing more image information;

and step 5, obtaining a final segmentation result image by adopting bilinear interpolation up-sampling operation.

2. The complex street view image semantic segmentation method according to claim 1, wherein:

the average pooling sub-branch can describe the feature globally, and feed back each pixel point in the feature map, and when the gradient counter-propagation calculation is carried out on the maximum pooling sub-branch, only the pixel point with the largest response in the feature map has gradient feedback, so that the feedback can be used as a supplement.

3. The complex street view image semantic segmentation method according to claim 1 or 2, wherein the operation of step 5 comprises:

and (3) adopting bilinear interpolation up-sampling operation, and restoring the image to the original image resolution while maintaining more image characteristic contents to obtain a final segmentation result image B5.