CN109461157B

CN109461157B - Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field

Info

Publication number: CN109461157B
Application number: CN201811218436.2A
Authority: CN
Inventors: 周鹏程
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2021-07-09
Anticipated expiration: 2038-10-19
Also published as: CN109461157A

Abstract

The invention discloses an image semantic segmentation method based on multistage feature fusion and a Gaussian conditional random field, which comprises the following steps of: 1) constructing an image pyramid; 2) keeping the resolution of the feature map unchanged by using hole convolution; 3) the multilevel characteristics are combined with the tuning framework layer by layer; 4) performing up-sampling by using a bilinear interpolation method; 5) defining a loss function; 6) and (4) optimizing output of the Gaussian conditional random field. According to the method, a full convolution architecture of multi-level feature layer-by-layer fusion is realized by constructing an image pyramid, a top-down tuning frame is used for replacing a popular parallel pooling module, features of different scales are obtained and are fused layer by layer, the feature priority fusion between adjacent layers of the pyramid is guaranteed, context information is captured to the maximum extent, the output of a front end is further optimized by a Gaussian conditional random field, more space details are captured, the object boundary in a segmentation effect graph is more accurate, and finally the output of the overall architecture obtains the optimal semantic segmentation effect.

Description

Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field

Technical Field

The invention relates to an image semantic segmentation method based on multistage feature fusion and a Gaussian conditional random field.

Background

Early image segmentation only roughly divides the content in an image into a plurality of regions, and with the development of research, the coarseness of image segmentation cannot meet the requirements of various applications, and then semantic segmentation is proposed. The semantic meaning of an image refers to semantic information such as the type of an object or entity included in an image or an image region, and image segmentation in the semantic meaning is called semantic segmentation. The image semantic segmentation can separate the foreground from the background in a single-frame image, and identify the category of each foreground target, which is equivalent to endowing each pixel with a semantic label. Semantic segmentation is a major upgrade of image segmentation in both precision and fineness.

The purpose of semantic segmentation of the image is to classify each pixel in the image into one of the predetermined category labels, so that not only can the region be segmented, but also the content of the region can be labeled. In reality, more and more visual applications need to infer relevant knowledge or semantics, i.e. concrete to abstract processes, from imagery. These applications include autopilot, human-computer interaction, computational photography, image search engines, augmented reality, etc., and the real need behind them is an accurate and efficient segmentation technique.

The most advanced image semantic segmentation models at present use a combination of deep learning and probabilistic graphical models. The front end of the model is based on a deep convolutional neural network designed for an image classification task at first, and the last full connection layer is replaced by a convolutional layer, so that the model is called a full convolutional network and can carry out semantic segmentation on images with any size. The semantic features roughly extracted by the convolutional network are optimized by the aid of the fully-connected conditional random field at the rear end of the model, and the capability of the model for capturing spatial details is enhanced. Macroscopically, full-convolutional networks are more like a trick, and continue to advance as the performance of the underlying network increases. The deep learning plus probability map model is a trend: deep learning performs feature extraction, and the probabilistic graphical model can well explain the essential connection between things from a mathematical perspective.

Specifically, the image semantic segmentation method based on the full convolution network utilizes the existing convolution neural network as one of the modules to generate hierarchical features, and converts the existing well-known classification model into a full convolution model: the fully connected layers are all replaced by convolutional layers, and the space mapping rather than the classification scores are output. These mappings are obtained by small step convolution upsampling (deconvolution) to produce dense pixel level labels. The work Long et al first considered a milestone advance because it demonstrated how convolutional neural networks can train end-to-end on semantic segmentation problems and efficiently learn how to generate pixel-level label predictions for semantic segmentation problems based on arbitrarily sized inputs. Vijay et al propose that SegNet introduces more hopping connections and uses the largest pooled index rather than the encoder features in a full convolutional network, so SegNet is more memory efficient than a full convolutional network. Fisher et al propose that using hole convolution as a convolution layer for realizing pixel level prediction can increase the receptive field exponentially without reducing the spatial dimension. Zhao et al think that the global scene can provide the category distribution information of image semantic segmentation, and then propose the pyramid pooling module to obtain these information by using the pooling layer of the larger kernel;

the method based on the deep convolutional nerve seems to be inherently difficult to obtain between classification performance and positioning precision, the strong invariance of the convolutional network to image space transformation enables the convolutional network to accurately predict the existence and rough position information of a target, the segmentation result is often not harmonious enough, and some small regions may have incorrect labels which are inconsistent with the labels of surrounding pixels, so that the boundary of an object cannot be accurately drawn. Also, the cross-entropy loss function is not the most ideal loss function for semantic segmentation, because the final loss value of the image is just the superposition of the loss values of each pixel, and the cross-entropy loss cannot guarantee the continuity of the pixels. A general method for optimizing the output of a segmentation architecture and enhancing the capture of fine-grained information thereof is to introduce a conditional random field as a back-end processing module thereof, and to use a fully-connected conditional random field model as an independent post-processing step in the flow thereof, so as to optimize the segmentation result. While a fully connected model is generally inefficient, the model may be relatively efficient because it may be approximated by probabilistic reasoning. Conditional random fields exploit the similarity between pixels in the original image.

With modern architectures for image semantic segmentation, there are mainly two problems:

1. extracting the lost context information by the front-end features: in current semantic segmentation problems based on full convolution networks, the resolution of the output image is consistent with the input image. However, successive pooling and skipping operations reduce the resolution of the resulting feature map, with subsequent feature maps becoming smaller and smaller, and the feature map must be upsampled by a deconvolution operation to restore the reduced feature map to its original size to restore spatial resolution. This process inevitably results in lost information that cannot be recovered, and the frequent deconvolution operations also require additional memory and time. Therefore, the semantic segmentation of the image needs to integrate information of multiple scales, and also needs to balance local information and global information. On the one hand, fine-grained or local information is critical to improve the correctness of pixel-level labeling, and on the other hand, integrating global context information of an image is also important to solve the local ambiguity problem. The standard convolutional neural network architecture is not good at processing the balance between the local information and the global information, the pooling layer can enable the network to achieve certain degree of spatial invariance and keep the same computational efficiency, but the global context information is lost, the convolutional network without the pooling layer is also limited, and after all, the receptive field of the neuron can only linearly increase along with the number of layers.

2. Back-end model optimization is difficult to converge: the back end introduces conditional random fields to optimize the output of the segmentation framework to enhance its ability to capture details. However, both the commonly used gibbs conditional random field and the markov random field have problems such as complex formula and unchanged solution, which often results in slow convergence during model training and difficulty or even no convergence.

Disclosure of Invention

The invention aims to provide an image semantic segmentation method based on multistage feature fusion and a Gaussian conditional random field.

The technical scheme of the invention is as follows: an image semantic segmentation method based on multi-stage feature fusion and Gaussian conditional random fields comprises the following steps:

1) constructing an image pyramid

Constructing a four-layer image pyramid for a single-frame image, wherein each layer is numbered from bottom to top, and the ith layer is represented as G_iTo generate the i +1 th layer in the image pyramid, a Gaussian kernel k is used_GaussianFor G_iPerforming a convolution operation and then deleting each even row and column, Gaussian kernel k_GaussianExpressed as formula (1):

then inputting the original image G₀Iterating the above process to finally generate a whole image pyramid;

2) preserving feature map resolution using hole convolution

Looking at two-dimensional signals such as images from the perspective of a filter, performing hole convolution operation on an input feature diagram x by using a filter w to obtain an output y of a position i:

where the void rate r corresponds to the step of sampling the input signal, corresponding to the convolution of the input x with an upsampling filter generated by inserting r-1 zeros between two successive filtered values in each spatial dimension, a standard convolution corresponding to the special case of a void rate r ═ 1;

3) multilevel feature layer-by-layer fusion tuning framework

Generating an image pyramid containing images with different resolutions from an original image, generating a feature map containing local information by a plurality of full convolution operations from an image with the minimum resolution at the top layer, sampling the feature map to the resolution same as that of an image of an adjacent layer, stacking the feature map and the initial feature map of the adjacent layer to participate in a plurality of subsequent full convolution operations of the layer, namely fusing to obtain new local features, gradually optimizing from top to bottom, and performing 1 multiplied by 1 convolution operation and fine adjustment on the last layer in order to obtain a final segmentation map from the feature map;

4) upsampling using bilinear interpolation

Upsampling a low-resolution feature map containing local features by a bilinear interpolation method to fuse into a high-resolution feature map containing relative global features;

5) defining a loss function

And calculating the sum of cross entropy terms of each spatial position in the output graph of the convolution network by adopting a minimized cross entropy loss function, or calculating the distance between the predicted probability distribution of each pixel and the real probability distribution of each pixel:

where L is the cross-entropy loss function for the error classification label, R is a regularization term, and the function L is typically decomposed as the sum of the losses for a given pixel:

then, a network model is finely adjusted on a segmentation task of the PASCALVOC 2012 data set by executing random gradient descent on the cross entropy loss function;

6) optimized output of gaussian conditional random fields

And (3) using a Gaussian conditional random field as a rear end to carry out optimization, wherein an energy function E (x) of the Gaussian conditional random field is as follows:

when A + λ I is a symmetric positive definite matrix, solving for the minimum of E (x) is equivalent to solving the equation:

(A+λI)x＝B (12)

further, in the present invention, in step 2), the hole convolution allows to adaptively modify the size of the receptive field by changing the hole rate r, and the size of the receptive field can be expressed as:

f＝((k_size+1)r_rate-1) (3)

wherein k is_sizeRepresenting the actual size of the convolution kernel, r_rateThis means the void ratio.

Further, in the present invention, in step 4), the bilinear interpolation method uses four points in the original image to calculate the target pixel value for interpolation, and performs interpolation according to Q₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) The f function values at the four points may calculate a value of the unknown function f at the point P (x, y), specifically including:

firstly, linear interpolation is carried out in the x direction to obtain:

wherein R is₁＝(x,y₁)，R₂＝(x,y₂)；

Second, linear interpolation is carried out on the y direction, and R calculated in the first step₁And R₂And (3) interpolating and calculating a point P in the y direction:

wherein P ═ x, y.

Compared with the prior art, the invention has the following advantages:

1) the invention provides a multi-level feature layer-by-layer fusion tuning strategy, which is characterized in that an image pyramid is utilized to carry out multi-scale pixel sampling on an original image so as to generate a plurality of images with different resolutions, the image with the lowest resolution at the top layer is input to the front end of a network to carry out crude extraction of semantic features, and finally, the image is up-sampled until the image is equivalent to the image with the higher resolution at the next layer, and the image enters the next layer to participate in the input of the front end of the network together with the image with the new resolution. The layer-by-layer fusion mode of learning layered abstraction at the previous layer and capturing high-precision information at the later layer effectively increases the capability of the model for capturing context information.

2) In the invention, the back end uses the Gaussian conditional random field in the fully-connected conditional random field, which obviously has the advantages of a global solution of secondary energy and the like, and effectively relieves the problems of slow convergence speed, even difficulty in convergence and the like during model training.

Drawings

The invention is further described with reference to the following figures and examples:

FIG. 1 is a basic block diagram of the process of the present invention;

FIG. 2 is a schematic diagram of an image pyramid structure according to the present invention;

FIG. 3(a) is a schematic diagram of convolution of 3 × 3 holes with a hole rate of 1 according to the present invention;

FIG. 3(b) is a schematic diagram of convolution of 3 × 3 holes with a hole rate of 2 according to the present invention;

FIG. 3(c) is a schematic diagram of convolution of 3 × 3 holes with a hole rate of 4 according to the present invention;

FIG. 4 is a schematic diagram of a multi-level feature layer-by-layer fusion architecture according to the present invention;

FIG. 5 is a diagram illustrating bilinear interpolation according to the present invention;

FIG. 6 is a schematic representation of a fully connected conditional random field according to the present invention;

FIG. 7(a) is a schematic diagram of an iterative process for penalizing the regularization term in the present invention;

FIG. 7(b) is a regression loss diagram of the segmentation model in the present invention;

FIG. 7(c) is a schematic of the sum of the regularization and regression losses in the present invention;

FIG. 8 is a graph comparing the score on the PASCAL VOC2012 versus the velocity in the present invention;

fig. 9 is a graph comparing visualization on the PASCAL VOC2012 data in the present invention.

Detailed Description

Example (b):

the method is a specific implementation mode of the image semantic segmentation method based on the multilevel feature fusion and the Gaussian conditional random field, and a basic frame diagram of the method is shown in figure 1. In addition, the output of the front end is further optimized by using the Gaussian conditional random field, more space details are captured, and the object boundary in the segmentation effect graph is more accurate. And finally, outputting the whole framework to obtain the optimal semantic segmentation effect.

The initial characteristic diagram of each layer of the image pyramid is calculated, and the initial characteristic diagram and the up-sampling result of the characteristic diagram of the previous layer are connected in series and then enter the residual full convolution operation of the layer, and the final semantic segmentation diagram is obtained through the last full convolution operation of the pyramid bottom layer. The method comprises the following steps: the method comprises the steps of constructing an image pyramid at the front end, keeping the resolution of a feature map unchanged by using cavity convolution, fusing and optimizing a framework layer by layer of multi-level features, performing up-sampling by using a bilinear interpolation method, defining a loss function, and optimally outputting a Gaussian conditional random field at the rear end.

1. Extraction of rough semantic features by front-end full convolution network

1) Constructing an image pyramid

The SIFT algorithm acquires feature information on different scales by constructing an image pyramid, and the method also constructs a four-layer image pyramid for a single-frame image. Imagine that a pyramid is a set of layers, the higher the layers, the smaller the size.

As shown in connection with FIG. 2, each layer is numbered from bottom to top, with the ith layer designated G_iTherefore, the (i + 1) th layer (with G)_i+1Represented by G) is smaller than the i-th layer (represented by G)_iRepresentation). To generate the i +1 th layer in the image pyramid, we use a Gaussian kernel k_GaussianFor G_iPerforming a convolution operation and then deleting each even row and column, Gaussian kernel k_GaussianExpressed as formula (1):

it can be easily noted that the image generated will be one quarter of its predecessors. In inputting the original image G₀And iterating the above process to finally generate the whole image pyramid.

2) Preserving feature map resolution using hole convolution

The hole convolution method, which is originally applied to wavelet transform analysis in the signal processing field, is adopted to extract dense features by removing the downsampling operation of the last layers in a full convolution network and the related upsampling operation. Therefore, the resolution of the feature map in the convolutional neural network can be effectively controlled without learning additional parameters, so that the problem that the resolution of the feature map in the standard convolutional network is smaller and smaller, the feature map needs to be up-sampled by deconvolution operation to restore the reduced feature map to the original size to restore the spatial resolution, but information is inevitably lost, and additional memory and time are also needed by frequent deconvolution operation.

where the hole rate r corresponds to the step of sampling the input signal, which is equivalent to convolving the input x with an upsampling filter generated by inserting r-1 zeros between two successive filtered values in each spatial dimension, hence the name hole convolution. The standard convolution corresponds to the special case of a void rate r of 1.

The hole convolution allows us to adaptively modify the size of the receptive field by changing the hole rate r, as shown in fig. 3(a), 3(b), 3 (c). Where FIG. 3(a) is a standard 3X 3 convolution kernel applied to the receptive field, as in a conventional convolution operation. Fig. 3(b) is a 3 × 3 hole convolution with rate 2, whose convolution kernel size is still 3 × 3, except that the receptive field is increased to 7 × 7. Fig. 3(c) shows a 3 × 3 hole convolution with r being 4, and the convolution kernel size is still 3 × 3, while the receptive field is increased to 15 × 15. The size of the receptive field can be expressed as:

f＝((k_size+1)r_rate-1) (3)

wherein k is_sizeRepresenting the actual size of the convolution kernel, r_rateThis means the void ratio. It can be seen that k is used at different void ratios_size×k_sizeThe convolution kernel can obtain the characteristics of different receptive fields by acting on the same input.

3) Multilevel feature layer-by-layer fusion tuning framework

The current most excellent parallel pooling idea well fuses local information and global information of each level, but the final aggregation is just to simply upsample uniformly to the same resolution to form a final feature map. The method considers that the information lost due to the reduction of the resolution can be better recovered by the prior fusion of the feature maps with adjacent sizes as the relatively local information and the global information, and then a new feature fusion method is provided.

With reference to fig. 4, an image pyramid including images with different resolutions is generated from an original image, a feature map including local information is generated by a plurality of full convolution operations starting from an image with the top layer minimum resolution, and then the feature map is sampled to the same resolution as an image of an adjacent layer and is stacked with the original feature map of the adjacent layer to participate in a plurality of subsequent full convolution operations of the layer, that is, new local features are obtained by fusion, so that the top-down stepwise optimization is performed. The gradual process obtains the semantic information as strong as possible on the basis of obtaining good details, and more effectively integrates the contexts of different areas to obtain the global context information. The last layer is subjected to a 1 × 1 convolution operation and fine-tuning in order to obtain the final segmentation map from the feature map.

4) Upsampling using bilinear interpolation

Our method fuses features by restoring the feature map to the same size as the neighboring layers using bilinear interpolation. So as to ensure the simplicity and clarity of the network architecture and reduce the computational complexity during training.

The bilinear interpolation method is an up-sampling method which is used more in the current semantic segmentation. The method has the characteristics of no need of learning, high running speed and simple operation, and only needs to set fixed parameter values.

With particular reference to FIG. 5, assume that the value of the unknown function f at point P (x, y) is calculated, given the known condition Q₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) F-function values at four points.

Firstly, linear interpolation is carried out in the x direction to obtain:

wherein R is₁＝(x,y₁)，R₂＝(x,y₂)；

wherein P ═ x, y.

In summary, the bilinear interpolation method performs interpolation by calculating a target pixel value using four points in the original image. Our network model upsamples a low resolution feature map containing local features by bilinear interpolation to fuse into a high resolution feature map containing relatively global features. And a bilinear interpolation method is used for up-sampling to replace complex deconvolution operation, so that the complexity of the model is reduced.

5) Defining a loss function

The loss function used by our model, like most semantic segmentation models, uses a minimum cross-entropy loss, and the objective function calculates the sum of cross-entropy terms for each spatial location in the output map of the convolutional network, or calculates the distance between the predicted probability distribution of each pixel and its true probability distribution:

the above defined loss function defaults to a uniform distribution of weights for each pixel in the image, so that the learning algorithm favors large regions in the image over small regions. We fine-tune the network model on the partitioning task of the paschaloc 2012 data set by performing a stochastic gradient descent on the cross entropy loss function.

2. Optimized output of back-end conditional random fields

The segmentation results can be optimized using a fully connected conditional random field model between each two as a separate post-processing step, as shown in connection with fig. 6. The model models each pixel as a node in a certain region, and the relationship between two pixels is measured no matter how far the two pixels are, so the model is also called a dense or full-connected factor graph. Using this model, pixel interrelationships, both short-term and long-term, are taken into account so that the system can take into account the detailed information needed in the segmentation process.

With a class label x for each pixel i_iAnd also the corresponding observed value y_iThus, each pixel point is used as a node, and the relation between pixels is used as an edge, so that a conditional random field is formed. And we observe the variable y_iTo infer the class label x corresponding to the pixel i_i。

The energy function E (x) of the fully-connected conditional random field is:

in which the univariate potential function Σ_iΨ_u(x_i) I.e. the output from the front-end full convolutional network. And the binary potential function is as follows:

the binary potential function is used for describing the relationship between pixel points and pixel points, similar pixels are encouraged to be assigned with the same label, pixels with larger differences are encouraged to be assigned with different labels, and the definition of the distance is related to the color value and the actual relative distance. So that the conditional random field enables the image to be segmented as much as possible at the boundaries. The difference of the fully connected conditional random field is that the binary potential function describes the relationship of each pixel to all other pixels, so called fully connected.

6) Optimized output of gaussian conditional random fields

The discrete full-connection conditional random field has the problems of complex formula, unchanged solution and the like, and further causes the conditions of slow convergence, difficult convergence and even no convergence during model training. We use the Gaussian conditional random field as the back end to optimize, the energy function of the Gaussian conditional random field is different from the previous one:

(A+λI)x＝B (12)

therefore, the quadratic energy function of the Gaussian conditional random field has a definite global minimum, and the solution of the linear equation is much simpler than the solution of the complex energy function optimization.

The data set used for the demonstration experiments of the invention was: PASCAL VOC 2012. The PASCAL VOC2012 is a data set which is most commonly used in the field of image semantic segmentation at present, and we can verify that the image semantic segmentation effect of the invention has certain superiority on the data set. The data set contained 20 foreground objects and 1 background object, of which there were 1464 images for training, 1449 images for verification and 1456 images for testing, respectively. Harihara et al provided a number of additional labels to the data set, adding 10582 training images to the data set. The accuracy of our method is measured by the intersection set mean of these 21 classes of pixels:

where k is the number of foreground objects, p_ijThat is, the original belongs to the ith class but is classified into the jth classThe number of pixels.

Experiment hardware environment: ubuntu16.04, Core i7 processor, with a master frequency of 3.6GHz, a memory of 48G, and a video card model of NVIDIA GTX 1080. The code execution environment is: python3.6.5+ TensorFlow1.8.0.

Our experiment was based on TensorFlow, executing a random gradient descent algorithm with a batch minimum of 14 and a momentum of 0.9, iterating 600000 times, with a learning rate that obeys the iterative protocol:

wherein power is 0.9.

As shown in fig. 7(a), 7(b), and 7(c), fig. 7(a) is an iterative process of penalizing the regularization term added to prevent overfitting, fig. 7(b) is a regression loss of the segmentation model, and gradually converges to a certain more optimized region after 600000 iterations, and fig. 7(c) is an optimization process of the entire objective function, which is the sum of the regularization loss and the regression loss. It can be seen that the optimization process of the objective function is not plain, and the overall convergence trend gradually appears after a large number of iterations.

The algorithm of the present invention was compared with FCN (full volumetric Networks for continuous Segmentation 2015), DeepLab (DeepLab: continuous Image Segmentation with Deep computational Networks, and full Connected CRFs 2017) and PSPNet (dense Scene matching Networks 2017), the results are presented in FIG. 9. From the figure, it can be found that the method can well segment the target object, the pixel positioning is relatively accurate, and the method has obvious advantages compared with FCN and deep Lab. Because our method is an improvement based on the idea of PSPNet's parallel pooling multi-scale fusion, the segmentation effect is closer to that of PSPNet, but it can be found that the method is slightly better than the PSPNet method in a few subtle places, and has certain advantages in general.

As can be seen from fig. 8, the FCN network has the simplest structure and no deconvolution layer, so the processing speed is the fastest; DeepLab adds conditional random fields to perform back-end optimization on the basis of FCN, so that the processing speed is much slower. The method has a slight advantage in precision compared with the PSPNet, and the processing speed is slower than that of the PSPNet because of the fact that the Gaussian conditional random field carries out the rear-end secondary processing, but the secondary energy function formula of the Gaussian conditional random field is concise, convenient to calculate and quicker than that of deep Lab.

TABLE 1 score on PASCAL VOC2012 data

We finally performed the evaluation on the PASCAL VOC2012 test set. As shown in the table, to analyze the segmentation of different objects, we list the segmentation accuracy of all objects in the PASCAL VOC2012 data set. It can be seen that the best method is not optimal for the segmentation of all objects, and that some special or partially occluded objects are more difficult to segment than disconnected objects. However, our approach does predict how well the multilevel feature layer-by-layer fusion has some advantages over many other approaches in terms of overall mlou.

It should be understood that the above-mentioned embodiments are only illustrative of the technical concepts and features of the present invention, and are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims

1. An image semantic segmentation method based on multi-stage feature fusion and a Gaussian conditional random field is characterized by comprising the following steps:

1) constructing an image pyramid

For constructing a four-layer image for a single frameImage pyramid, each layer numbered from bottom to top, layer i denoted G_iTo generate the i +1 th layer in the image pyramid, a Gaussian kernel k is used_GaussianFor G_iPerforming a convolution operation and then deleting each even row and column, Gaussian kernel k_GaussianExpressed as formula (1):

2) preserving feature map resolution using hole convolution

3) multilevel feature layer-by-layer fusion tuning framework

4) upsampling using bilinear interpolation

5) defining a loss function

6) optimized output of gaussian conditional random fields

(A+λI)x＝B (12) 。

2. the image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field according to claim 1, characterized in that: in step 2), the hole convolution allows to adaptively modify the size of the receptive field by changing the hole rate r, and the size of the receptive field can be expressed as:

f＝((k_size+1)r_rate-1) (3)

3. The image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field according to claim 1, characterized in that: in step 4), calculating a target pixel value by using four points in the original image by a bilinear interpolation method to perform interpolation, and calculating the target pixel value according to Q₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) The f function values at the four points may calculate a value of the unknown function f at the point P (x, y), specifically including:

firstly, linear interpolation is carried out in the x direction to obtain:

wherein R is₁＝(x,y₁)，R₂＝(x,y₂)；

wherein P ═ x, y.