CN109461157B - Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field - Google Patents

Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field Download PDF

Info

Publication number
CN109461157B
CN109461157B CN201811218436.2A CN201811218436A CN109461157B CN 109461157 B CN109461157 B CN 109461157B CN 201811218436 A CN201811218436 A CN 201811218436A CN 109461157 B CN109461157 B CN 109461157B
Authority
CN
China
Prior art keywords
layer
image
convolution
conditional random
gaussian
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811218436.2A
Other languages
Chinese (zh)
Other versions
CN109461157A (en
Inventor
周鹏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201811218436.2A priority Critical patent/CN109461157B/en
Publication of CN109461157A publication Critical patent/CN109461157A/en
Application granted granted Critical
Publication of CN109461157B publication Critical patent/CN109461157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Abstract

The invention discloses an image semantic segmentation method based on multistage feature fusion and a Gaussian conditional random field, which comprises the following steps of: 1) constructing an image pyramid; 2) keeping the resolution of the feature map unchanged by using hole convolution; 3) the multilevel characteristics are combined with the tuning framework layer by layer; 4) performing up-sampling by using a bilinear interpolation method; 5) defining a loss function; 6) and (4) optimizing output of the Gaussian conditional random field. According to the method, a full convolution architecture of multi-level feature layer-by-layer fusion is realized by constructing an image pyramid, a top-down tuning frame is used for replacing a popular parallel pooling module, features of different scales are obtained and are fused layer by layer, the feature priority fusion between adjacent layers of the pyramid is guaranteed, context information is captured to the maximum extent, the output of a front end is further optimized by a Gaussian conditional random field, more space details are captured, the object boundary in a segmentation effect graph is more accurate, and finally the output of the overall architecture obtains the optimal semantic segmentation effect.

Description

Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field
Technical Field
The invention relates to an image semantic segmentation method based on multistage feature fusion and a Gaussian conditional random field.
Background
Early image segmentation only roughly divides the content in an image into a plurality of regions, and with the development of research, the coarseness of image segmentation cannot meet the requirements of various applications, and then semantic segmentation is proposed. The semantic meaning of an image refers to semantic information such as the type of an object or entity included in an image or an image region, and image segmentation in the semantic meaning is called semantic segmentation. The image semantic segmentation can separate the foreground from the background in a single-frame image, and identify the category of each foreground target, which is equivalent to endowing each pixel with a semantic label. Semantic segmentation is a major upgrade of image segmentation in both precision and fineness.
The purpose of semantic segmentation of the image is to classify each pixel in the image into one of the predetermined category labels, so that not only can the region be segmented, but also the content of the region can be labeled. In reality, more and more visual applications need to infer relevant knowledge or semantics, i.e. concrete to abstract processes, from imagery. These applications include autopilot, human-computer interaction, computational photography, image search engines, augmented reality, etc., and the real need behind them is an accurate and efficient segmentation technique.
The most advanced image semantic segmentation models at present use a combination of deep learning and probabilistic graphical models. The front end of the model is based on a deep convolutional neural network designed for an image classification task at first, and the last full connection layer is replaced by a convolutional layer, so that the model is called a full convolutional network and can carry out semantic segmentation on images with any size. The semantic features roughly extracted by the convolutional network are optimized by the aid of the fully-connected conditional random field at the rear end of the model, and the capability of the model for capturing spatial details is enhanced. Macroscopically, full-convolutional networks are more like a trick, and continue to advance as the performance of the underlying network increases. The deep learning plus probability map model is a trend: deep learning performs feature extraction, and the probabilistic graphical model can well explain the essential connection between things from a mathematical perspective.
Specifically, the image semantic segmentation method based on the full convolution network utilizes the existing convolution neural network as one of the modules to generate hierarchical features, and converts the existing well-known classification model into a full convolution model: the fully connected layers are all replaced by convolutional layers, and the space mapping rather than the classification scores are output. These mappings are obtained by small step convolution upsampling (deconvolution) to produce dense pixel level labels. The work Long et al first considered a milestone advance because it demonstrated how convolutional neural networks can train end-to-end on semantic segmentation problems and efficiently learn how to generate pixel-level label predictions for semantic segmentation problems based on arbitrarily sized inputs. Vijay et al propose that SegNet introduces more hopping connections and uses the largest pooled index rather than the encoder features in a full convolutional network, so SegNet is more memory efficient than a full convolutional network. Fisher et al propose that using hole convolution as a convolution layer for realizing pixel level prediction can increase the receptive field exponentially without reducing the spatial dimension. Zhao et al think that the global scene can provide the category distribution information of image semantic segmentation, and then propose the pyramid pooling module to obtain these information by using the pooling layer of the larger kernel;
the method based on the deep convolutional nerve seems to be inherently difficult to obtain between classification performance and positioning precision, the strong invariance of the convolutional network to image space transformation enables the convolutional network to accurately predict the existence and rough position information of a target, the segmentation result is often not harmonious enough, and some small regions may have incorrect labels which are inconsistent with the labels of surrounding pixels, so that the boundary of an object cannot be accurately drawn. Also, the cross-entropy loss function is not the most ideal loss function for semantic segmentation, because the final loss value of the image is just the superposition of the loss values of each pixel, and the cross-entropy loss cannot guarantee the continuity of the pixels. A general method for optimizing the output of a segmentation architecture and enhancing the capture of fine-grained information thereof is to introduce a conditional random field as a back-end processing module thereof, and to use a fully-connected conditional random field model as an independent post-processing step in the flow thereof, so as to optimize the segmentation result. While a fully connected model is generally inefficient, the model may be relatively efficient because it may be approximated by probabilistic reasoning. Conditional random fields exploit the similarity between pixels in the original image.
With modern architectures for image semantic segmentation, there are mainly two problems:
1. extracting the lost context information by the front-end features: in current semantic segmentation problems based on full convolution networks, the resolution of the output image is consistent with the input image. However, successive pooling and skipping operations reduce the resolution of the resulting feature map, with subsequent feature maps becoming smaller and smaller, and the feature map must be upsampled by a deconvolution operation to restore the reduced feature map to its original size to restore spatial resolution. This process inevitably results in lost information that cannot be recovered, and the frequent deconvolution operations also require additional memory and time. Therefore, the semantic segmentation of the image needs to integrate information of multiple scales, and also needs to balance local information and global information. On the one hand, fine-grained or local information is critical to improve the correctness of pixel-level labeling, and on the other hand, integrating global context information of an image is also important to solve the local ambiguity problem. The standard convolutional neural network architecture is not good at processing the balance between the local information and the global information, the pooling layer can enable the network to achieve certain degree of spatial invariance and keep the same computational efficiency, but the global context information is lost, the convolutional network without the pooling layer is also limited, and after all, the receptive field of the neuron can only linearly increase along with the number of layers.
2. Back-end model optimization is difficult to converge: the back end introduces conditional random fields to optimize the output of the segmentation framework to enhance its ability to capture details. However, both the commonly used gibbs conditional random field and the markov random field have problems such as complex formula and unchanged solution, which often results in slow convergence during model training and difficulty or even no convergence.
Disclosure of Invention
The invention aims to provide an image semantic segmentation method based on multistage feature fusion and a Gaussian conditional random field.
The technical scheme of the invention is as follows: an image semantic segmentation method based on multi-stage feature fusion and Gaussian conditional random fields comprises the following steps:
1) constructing an image pyramid
Constructing a four-layer image pyramid for a single-frame image, wherein each layer is numbered from bottom to top, and the ith layer is represented as GiTo generate the i +1 th layer in the image pyramid, a Gaussian kernel k is usedGaussianFor GiPerforming a convolution operation and then deleting each even row and column, Gaussian kernel kGaussianExpressed as formula (1):
Figure GDA0003039404630000031
then inputting the original image G0Iterating the above process to finally generate a whole image pyramid;
2) preserving feature map resolution using hole convolution
Looking at two-dimensional signals such as images from the perspective of a filter, performing hole convolution operation on an input feature diagram x by using a filter w to obtain an output y of a position i:
Figure GDA0003039404630000032
where the void rate r corresponds to the step of sampling the input signal, corresponding to the convolution of the input x with an upsampling filter generated by inserting r-1 zeros between two successive filtered values in each spatial dimension, a standard convolution corresponding to the special case of a void rate r ═ 1;
3) multilevel feature layer-by-layer fusion tuning framework
Generating an image pyramid containing images with different resolutions from an original image, generating a feature map containing local information by a plurality of full convolution operations from an image with the minimum resolution at the top layer, sampling the feature map to the resolution same as that of an image of an adjacent layer, stacking the feature map and the initial feature map of the adjacent layer to participate in a plurality of subsequent full convolution operations of the layer, namely fusing to obtain new local features, gradually optimizing from top to bottom, and performing 1 multiplied by 1 convolution operation and fine adjustment on the last layer in order to obtain a final segmentation map from the feature map;
4) upsampling using bilinear interpolation
Upsampling a low-resolution feature map containing local features by a bilinear interpolation method to fuse into a high-resolution feature map containing relative global features;
5) defining a loss function
And calculating the sum of cross entropy terms of each spatial position in the output graph of the convolution network by adopting a minimized cross entropy loss function, or calculating the distance between the predicted probability distribution of each pixel and the real probability distribution of each pixel:
Figure GDA0003039404630000041
where L is the cross-entropy loss function for the error classification label, R is a regularization term, and the function L is typically decomposed as the sum of the losses for a given pixel:
Figure GDA0003039404630000042
then, a network model is finely adjusted on a segmentation task of the PASCALVOC 2012 data set by executing random gradient descent on the cross entropy loss function;
6) optimized output of gaussian conditional random fields
And (3) using a Gaussian conditional random field as a rear end to carry out optimization, wherein an energy function E (x) of the Gaussian conditional random field is as follows:
Figure GDA0003039404630000043
when A + λ I is a symmetric positive definite matrix, solving for the minimum of E (x) is equivalent to solving the equation:
(A+λI)x=B (12)
further, in the present invention, in step 2), the hole convolution allows to adaptively modify the size of the receptive field by changing the hole rate r, and the size of the receptive field can be expressed as:
f=((ksize+1)rrate-1) (3)
wherein k issizeRepresenting the actual size of the convolution kernel, rrateThis means the void ratio.
Further, in the present invention, in step 4), the bilinear interpolation method uses four points in the original image to calculate the target pixel value for interpolation, and performs interpolation according to Q11(x1,y1)、Q12(x1,y2)、Q21(x2,y1)、Q22(x2,y2) The f function values at the four points may calculate a value of the unknown function f at the point P (x, y), specifically including:
firstly, linear interpolation is carried out in the x direction to obtain:
Figure GDA0003039404630000051
Figure GDA0003039404630000052
wherein R is1=(x,y1),R2=(x,y2);
Second, linear interpolation is carried out on the y direction, and R calculated in the first step1And R2And (3) interpolating and calculating a point P in the y direction:
Figure GDA0003039404630000053
wherein P ═ x, y.
Compared with the prior art, the invention has the following advantages:
1) the invention provides a multi-level feature layer-by-layer fusion tuning strategy, which is characterized in that an image pyramid is utilized to carry out multi-scale pixel sampling on an original image so as to generate a plurality of images with different resolutions, the image with the lowest resolution at the top layer is input to the front end of a network to carry out crude extraction of semantic features, and finally, the image is up-sampled until the image is equivalent to the image with the higher resolution at the next layer, and the image enters the next layer to participate in the input of the front end of the network together with the image with the new resolution. The layer-by-layer fusion mode of learning layered abstraction at the previous layer and capturing high-precision information at the later layer effectively increases the capability of the model for capturing context information.
2) In the invention, the back end uses the Gaussian conditional random field in the fully-connected conditional random field, which obviously has the advantages of a global solution of secondary energy and the like, and effectively relieves the problems of slow convergence speed, even difficulty in convergence and the like during model training.
Drawings
The invention is further described with reference to the following figures and examples:
FIG. 1 is a basic block diagram of the process of the present invention;
FIG. 2 is a schematic diagram of an image pyramid structure according to the present invention;
FIG. 3(a) is a schematic diagram of convolution of 3 × 3 holes with a hole rate of 1 according to the present invention;
FIG. 3(b) is a schematic diagram of convolution of 3 × 3 holes with a hole rate of 2 according to the present invention;
FIG. 3(c) is a schematic diagram of convolution of 3 × 3 holes with a hole rate of 4 according to the present invention;
FIG. 4 is a schematic diagram of a multi-level feature layer-by-layer fusion architecture according to the present invention;
FIG. 5 is a diagram illustrating bilinear interpolation according to the present invention;
FIG. 6 is a schematic representation of a fully connected conditional random field according to the present invention;
FIG. 7(a) is a schematic diagram of an iterative process for penalizing the regularization term in the present invention;
FIG. 7(b) is a regression loss diagram of the segmentation model in the present invention;
FIG. 7(c) is a schematic of the sum of the regularization and regression losses in the present invention;
FIG. 8 is a graph comparing the score on the PASCAL VOC2012 versus the velocity in the present invention;
fig. 9 is a graph comparing visualization on the PASCAL VOC2012 data in the present invention.
Detailed Description
Example (b):
the method is a specific implementation mode of the image semantic segmentation method based on the multilevel feature fusion and the Gaussian conditional random field, and a basic frame diagram of the method is shown in figure 1. In addition, the output of the front end is further optimized by using the Gaussian conditional random field, more space details are captured, and the object boundary in the segmentation effect graph is more accurate. And finally, outputting the whole framework to obtain the optimal semantic segmentation effect.
The initial characteristic diagram of each layer of the image pyramid is calculated, and the initial characteristic diagram and the up-sampling result of the characteristic diagram of the previous layer are connected in series and then enter the residual full convolution operation of the layer, and the final semantic segmentation diagram is obtained through the last full convolution operation of the pyramid bottom layer. The method comprises the following steps: the method comprises the steps of constructing an image pyramid at the front end, keeping the resolution of a feature map unchanged by using cavity convolution, fusing and optimizing a framework layer by layer of multi-level features, performing up-sampling by using a bilinear interpolation method, defining a loss function, and optimally outputting a Gaussian conditional random field at the rear end.
1. Extraction of rough semantic features by front-end full convolution network
1) Constructing an image pyramid
The SIFT algorithm acquires feature information on different scales by constructing an image pyramid, and the method also constructs a four-layer image pyramid for a single-frame image. Imagine that a pyramid is a set of layers, the higher the layers, the smaller the size.
As shown in connection with FIG. 2, each layer is numbered from bottom to top, with the ith layer designated GiTherefore, the (i + 1) th layer (with G)i+1Represented by G) is smaller than the i-th layer (represented by G)iRepresentation). To generate the i +1 th layer in the image pyramid, we use a Gaussian kernel kGaussianFor GiPerforming a convolution operation and then deleting each even row and column, Gaussian kernel kGaussianExpressed as formula (1):
Figure GDA0003039404630000071
it can be easily noted that the image generated will be one quarter of its predecessors. In inputting the original image G0And iterating the above process to finally generate the whole image pyramid.
2) Preserving feature map resolution using hole convolution
The hole convolution method, which is originally applied to wavelet transform analysis in the signal processing field, is adopted to extract dense features by removing the downsampling operation of the last layers in a full convolution network and the related upsampling operation. Therefore, the resolution of the feature map in the convolutional neural network can be effectively controlled without learning additional parameters, so that the problem that the resolution of the feature map in the standard convolutional network is smaller and smaller, the feature map needs to be up-sampled by deconvolution operation to restore the reduced feature map to the original size to restore the spatial resolution, but information is inevitably lost, and additional memory and time are also needed by frequent deconvolution operation.
Looking at two-dimensional signals such as images from the perspective of a filter, performing hole convolution operation on an input feature diagram x by using a filter w to obtain an output y of a position i:
Figure GDA0003039404630000072
where the hole rate r corresponds to the step of sampling the input signal, which is equivalent to convolving the input x with an upsampling filter generated by inserting r-1 zeros between two successive filtered values in each spatial dimension, hence the name hole convolution. The standard convolution corresponds to the special case of a void rate r of 1.
The hole convolution allows us to adaptively modify the size of the receptive field by changing the hole rate r, as shown in fig. 3(a), 3(b), 3 (c). Where FIG. 3(a) is a standard 3X 3 convolution kernel applied to the receptive field, as in a conventional convolution operation. Fig. 3(b) is a 3 × 3 hole convolution with rate 2, whose convolution kernel size is still 3 × 3, except that the receptive field is increased to 7 × 7. Fig. 3(c) shows a 3 × 3 hole convolution with r being 4, and the convolution kernel size is still 3 × 3, while the receptive field is increased to 15 × 15. The size of the receptive field can be expressed as:
f=((ksize+1)rrate-1) (3)
wherein k issizeRepresenting the actual size of the convolution kernel, rrateThis means the void ratio. It can be seen that k is used at different void ratiossize×ksizeThe convolution kernel can obtain the characteristics of different receptive fields by acting on the same input.
3) Multilevel feature layer-by-layer fusion tuning framework
The current most excellent parallel pooling idea well fuses local information and global information of each level, but the final aggregation is just to simply upsample uniformly to the same resolution to form a final feature map. The method considers that the information lost due to the reduction of the resolution can be better recovered by the prior fusion of the feature maps with adjacent sizes as the relatively local information and the global information, and then a new feature fusion method is provided.
With reference to fig. 4, an image pyramid including images with different resolutions is generated from an original image, a feature map including local information is generated by a plurality of full convolution operations starting from an image with the top layer minimum resolution, and then the feature map is sampled to the same resolution as an image of an adjacent layer and is stacked with the original feature map of the adjacent layer to participate in a plurality of subsequent full convolution operations of the layer, that is, new local features are obtained by fusion, so that the top-down stepwise optimization is performed. The gradual process obtains the semantic information as strong as possible on the basis of obtaining good details, and more effectively integrates the contexts of different areas to obtain the global context information. The last layer is subjected to a 1 × 1 convolution operation and fine-tuning in order to obtain the final segmentation map from the feature map.
4) Upsampling using bilinear interpolation
Our method fuses features by restoring the feature map to the same size as the neighboring layers using bilinear interpolation. So as to ensure the simplicity and clarity of the network architecture and reduce the computational complexity during training.
The bilinear interpolation method is an up-sampling method which is used more in the current semantic segmentation. The method has the characteristics of no need of learning, high running speed and simple operation, and only needs to set fixed parameter values.
With particular reference to FIG. 5, assume that the value of the unknown function f at point P (x, y) is calculated, given the known condition Q11(x1,y1)、Q12(x1,y2)、Q21(x2,y1)、Q22(x2,y2) F-function values at four points.
Firstly, linear interpolation is carried out in the x direction to obtain:
Figure GDA0003039404630000091
Figure GDA0003039404630000092
wherein R is1=(x,y1),R2=(x,y2);
Second, linear interpolation is carried out on the y direction, and R calculated in the first step1And R2And (3) interpolating and calculating a point P in the y direction:
Figure GDA0003039404630000093
wherein P ═ x, y.
In summary, the bilinear interpolation method performs interpolation by calculating a target pixel value using four points in the original image. Our network model upsamples a low resolution feature map containing local features by bilinear interpolation to fuse into a high resolution feature map containing relatively global features. And a bilinear interpolation method is used for up-sampling to replace complex deconvolution operation, so that the complexity of the model is reduced.
5) Defining a loss function
The loss function used by our model, like most semantic segmentation models, uses a minimum cross-entropy loss, and the objective function calculates the sum of cross-entropy terms for each spatial location in the output map of the convolutional network, or calculates the distance between the predicted probability distribution of each pixel and its true probability distribution:
Figure GDA0003039404630000094
where L is the cross-entropy loss function for the error classification label, R is a regularization term, and the function L is typically decomposed as the sum of the losses for a given pixel:
Figure GDA0003039404630000095
the above defined loss function defaults to a uniform distribution of weights for each pixel in the image, so that the learning algorithm favors large regions in the image over small regions. We fine-tune the network model on the partitioning task of the paschaloc 2012 data set by performing a stochastic gradient descent on the cross entropy loss function.
2. Optimized output of back-end conditional random fields
The segmentation results can be optimized using a fully connected conditional random field model between each two as a separate post-processing step, as shown in connection with fig. 6. The model models each pixel as a node in a certain region, and the relationship between two pixels is measured no matter how far the two pixels are, so the model is also called a dense or full-connected factor graph. Using this model, pixel interrelationships, both short-term and long-term, are taken into account so that the system can take into account the detailed information needed in the segmentation process.
With a class label x for each pixel iiAnd also the corresponding observed value yiThus, each pixel point is used as a node, and the relation between pixels is used as an edge, so that a conditional random field is formed. And we observe the variable yiTo infer the class label x corresponding to the pixel ii
The energy function E (x) of the fully-connected conditional random field is:
Figure GDA0003039404630000101
in which the univariate potential function ΣiΨu(xi) I.e. the output from the front-end full convolutional network. And the binary potential function is as follows:
Figure GDA0003039404630000102
the binary potential function is used for describing the relationship between pixel points and pixel points, similar pixels are encouraged to be assigned with the same label, pixels with larger differences are encouraged to be assigned with different labels, and the definition of the distance is related to the color value and the actual relative distance. So that the conditional random field enables the image to be segmented as much as possible at the boundaries. The difference of the fully connected conditional random field is that the binary potential function describes the relationship of each pixel to all other pixels, so called fully connected.
6) Optimized output of gaussian conditional random fields
The discrete full-connection conditional random field has the problems of complex formula, unchanged solution and the like, and further causes the conditions of slow convergence, difficult convergence and even no convergence during model training. We use the Gaussian conditional random field as the back end to optimize, the energy function of the Gaussian conditional random field is different from the previous one:
Figure GDA0003039404630000111
when A + λ I is a symmetric positive definite matrix, solving for the minimum of E (x) is equivalent to solving the equation:
(A+λI)x=B (12)
therefore, the quadratic energy function of the Gaussian conditional random field has a definite global minimum, and the solution of the linear equation is much simpler than the solution of the complex energy function optimization.
The data set used for the demonstration experiments of the invention was: PASCAL VOC 2012. The PASCAL VOC2012 is a data set which is most commonly used in the field of image semantic segmentation at present, and we can verify that the image semantic segmentation effect of the invention has certain superiority on the data set. The data set contained 20 foreground objects and 1 background object, of which there were 1464 images for training, 1449 images for verification and 1456 images for testing, respectively. Harihara et al provided a number of additional labels to the data set, adding 10582 training images to the data set. The accuracy of our method is measured by the intersection set mean of these 21 classes of pixels:
Figure GDA0003039404630000113
where k is the number of foreground objects, pijThat is, the original belongs to the ith class but is classified into the jth classThe number of pixels.
Experiment hardware environment: ubuntu16.04, Core i7 processor, with a master frequency of 3.6GHz, a memory of 48G, and a video card model of NVIDIA GTX 1080. The code execution environment is: python3.6.5+ TensorFlow1.8.0.
Our experiment was based on TensorFlow, executing a random gradient descent algorithm with a batch minimum of 14 and a momentum of 0.9, iterating 600000 times, with a learning rate that obeys the iterative protocol:
Figure GDA0003039404630000112
wherein power is 0.9.
As shown in fig. 7(a), 7(b), and 7(c), fig. 7(a) is an iterative process of penalizing the regularization term added to prevent overfitting, fig. 7(b) is a regression loss of the segmentation model, and gradually converges to a certain more optimized region after 600000 iterations, and fig. 7(c) is an optimization process of the entire objective function, which is the sum of the regularization loss and the regression loss. It can be seen that the optimization process of the objective function is not plain, and the overall convergence trend gradually appears after a large number of iterations.
The algorithm of the present invention was compared with FCN (full volumetric Networks for continuous Segmentation 2015), DeepLab (DeepLab: continuous Image Segmentation with Deep computational Networks, and full Connected CRFs 2017) and PSPNet (dense Scene matching Networks 2017), the results are presented in FIG. 9. From the figure, it can be found that the method can well segment the target object, the pixel positioning is relatively accurate, and the method has obvious advantages compared with FCN and deep Lab. Because our method is an improvement based on the idea of PSPNet's parallel pooling multi-scale fusion, the segmentation effect is closer to that of PSPNet, but it can be found that the method is slightly better than the PSPNet method in a few subtle places, and has certain advantages in general.
As can be seen from fig. 8, the FCN network has the simplest structure and no deconvolution layer, so the processing speed is the fastest; DeepLab adds conditional random fields to perform back-end optimization on the basis of FCN, so that the processing speed is much slower. The method has a slight advantage in precision compared with the PSPNet, and the processing speed is slower than that of the PSPNet because of the fact that the Gaussian conditional random field carries out the rear-end secondary processing, but the secondary energy function formula of the Gaussian conditional random field is concise, convenient to calculate and quicker than that of deep Lab.
TABLE 1 score on PASCAL VOC2012 data
Figure GDA0003039404630000121
Figure GDA0003039404630000131
We finally performed the evaluation on the PASCAL VOC2012 test set. As shown in the table, to analyze the segmentation of different objects, we list the segmentation accuracy of all objects in the PASCAL VOC2012 data set. It can be seen that the best method is not optimal for the segmentation of all objects, and that some special or partially occluded objects are more difficult to segment than disconnected objects. However, our approach does predict how well the multilevel feature layer-by-layer fusion has some advantages over many other approaches in terms of overall mlou.
It should be understood that the above-mentioned embodiments are only illustrative of the technical concepts and features of the present invention, and are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims (3)

1. An image semantic segmentation method based on multi-stage feature fusion and a Gaussian conditional random field is characterized by comprising the following steps:
1) constructing an image pyramid
For constructing a four-layer image for a single frameImage pyramid, each layer numbered from bottom to top, layer i denoted GiTo generate the i +1 th layer in the image pyramid, a Gaussian kernel k is usedGaussianFor GiPerforming a convolution operation and then deleting each even row and column, Gaussian kernel kGaussianExpressed as formula (1):
Figure FDA0001834138700000011
then inputting the original image G0Iterating the above process to finally generate a whole image pyramid;
2) preserving feature map resolution using hole convolution
Looking at two-dimensional signals such as images from the perspective of a filter, performing hole convolution operation on an input feature diagram x by using a filter w to obtain an output y of a position i:
Figure FDA0001834138700000012
where the void rate r corresponds to the step of sampling the input signal, corresponding to the convolution of the input x with an upsampling filter generated by inserting r-1 zeros between two successive filtered values in each spatial dimension, a standard convolution corresponding to the special case of a void rate r ═ 1;
3) multilevel feature layer-by-layer fusion tuning framework
Generating an image pyramid containing images with different resolutions from an original image, generating a feature map containing local information by a plurality of full convolution operations from an image with the minimum resolution at the top layer, sampling the feature map to the resolution same as that of an image of an adjacent layer, stacking the feature map and the initial feature map of the adjacent layer to participate in a plurality of subsequent full convolution operations of the layer, namely fusing to obtain new local features, gradually optimizing from top to bottom, and performing 1 multiplied by 1 convolution operation and fine adjustment on the last layer in order to obtain a final segmentation map from the feature map;
4) upsampling using bilinear interpolation
Upsampling a low-resolution feature map containing local features by a bilinear interpolation method to fuse into a high-resolution feature map containing relative global features;
5) defining a loss function
And calculating the sum of cross entropy terms of each spatial position in the output graph of the convolution network by adopting a minimized cross entropy loss function, or calculating the distance between the predicted probability distribution of each pixel and the real probability distribution of each pixel:
Figure FDA0001834138700000021
where L is the cross-entropy loss function for the error classification label, R is a regularization term, and the function L is typically decomposed as the sum of the losses for a given pixel:
Figure FDA0001834138700000022
then, a network model is finely adjusted on a segmentation task of the PASCALVOC 2012 data set by executing random gradient descent on the cross entropy loss function;
6) optimized output of gaussian conditional random fields
And (3) using a Gaussian conditional random field as a rear end to carry out optimization, wherein an energy function E (x) of the Gaussian conditional random field is as follows:
Figure FDA0001834138700000023
when A + λ I is a symmetric positive definite matrix, solving for the minimum of E (x) is equivalent to solving the equation:
(A+λI)x=B (12) 。
2. the image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field according to claim 1, characterized in that: in step 2), the hole convolution allows to adaptively modify the size of the receptive field by changing the hole rate r, and the size of the receptive field can be expressed as:
f=((ksize+1)rrate-1) (3)
wherein k issizeRepresenting the actual size of the convolution kernel, rrateThis means the void ratio.
3. The image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field according to claim 1, characterized in that: in step 4), calculating a target pixel value by using four points in the original image by a bilinear interpolation method to perform interpolation, and calculating the target pixel value according to Q11(x1,y1)、Q12(x1,y2)、Q21(x2,y1)、Q22(x2,y2) The f function values at the four points may calculate a value of the unknown function f at the point P (x, y), specifically including:
firstly, linear interpolation is carried out in the x direction to obtain:
Figure FDA0001834138700000031
Figure FDA0001834138700000032
wherein R is1=(x,y1),R2=(x,y2);
Second, linear interpolation is carried out on the y direction, and R calculated in the first step1And R2And (3) interpolating and calculating a point P in the y direction:
Figure FDA0001834138700000033
wherein P ═ x, y.
CN201811218436.2A 2018-10-19 2018-10-19 Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field Active CN109461157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811218436.2A CN109461157B (en) 2018-10-19 2018-10-19 Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811218436.2A CN109461157B (en) 2018-10-19 2018-10-19 Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field

Publications (2)

Publication Number Publication Date
CN109461157A CN109461157A (en) 2019-03-12
CN109461157B true CN109461157B (en) 2021-07-09

Family

ID=65607897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811218436.2A Active CN109461157B (en) 2018-10-19 2018-10-19 Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field

Country Status (1)

Country Link
CN (1) CN109461157B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11676284B2 (en) * 2019-03-22 2023-06-13 Nvidia Corporation Shape fusion for image analysis
CN110070022A (en) * 2019-04-16 2019-07-30 西北工业大学 A kind of natural scene material identification method based on image
CN110047047B (en) * 2019-04-17 2023-02-10 广东工业大学 Method for interpreting three-dimensional morphology image information device, apparatus and storage medium
CN110210485A (en) * 2019-05-13 2019-09-06 常熟理工学院 The image, semantic dividing method of Fusion Features is instructed based on attention mechanism
CN110246084B (en) * 2019-05-16 2023-03-31 五邑大学 Super-resolution image reconstruction method, system and device thereof, and storage medium
CN110188765B (en) 2019-06-05 2021-04-06 京东方科技集团股份有限公司 Image semantic segmentation model generation method, device, equipment and storage medium
CN110263732B (en) * 2019-06-24 2022-01-21 京东方科技集团股份有限公司 Multi-scale target detection method and device
CN110348447B (en) * 2019-06-27 2022-04-19 电子科技大学 Multi-model integrated target detection method with abundant spatial information
CN110490842B (en) * 2019-07-22 2023-07-04 同济大学 Strip steel surface defect detection method based on deep learning
CN110705344B (en) * 2019-08-21 2023-03-28 中山大学 Crowd counting model based on deep learning and implementation method thereof
CN110633715B (en) * 2019-09-27 2021-09-10 深圳市商汤科技有限公司 Image processing method, network training method and device and electronic equipment
CN110738647B (en) * 2019-10-12 2020-06-12 成都考拉悠然科技有限公司 Mouse detection method integrating multi-receptive-field feature mapping and Gaussian probability model
CN110837811B (en) * 2019-11-12 2021-01-05 腾讯科技(深圳)有限公司 Method, device and equipment for generating semantic segmentation network structure and storage medium
CN110969166A (en) * 2019-12-04 2020-04-07 国网智能科技股份有限公司 Small target identification method and system in inspection scene
CN111145196A (en) * 2019-12-11 2020-05-12 中国科学院深圳先进技术研究院 Image segmentation method and device and server
CN111274995B (en) * 2020-02-13 2023-07-14 腾讯科技(深圳)有限公司 Video classification method, apparatus, device and computer readable storage medium
CN111738902A (en) * 2020-03-12 2020-10-02 超威半导体(上海)有限公司 Large convolution kernel real-time approximate fitting method based on bilinear filtering image hierarchy
CN111539458B (en) * 2020-04-02 2024-02-27 咪咕文化科技有限公司 Feature map processing method and device, electronic equipment and storage medium
CN111523546B (en) * 2020-04-16 2023-06-16 湖南大学 Image semantic segmentation method, system and computer storage medium
CN112381020A (en) * 2020-11-20 2021-02-19 深圳市银星智能科技股份有限公司 Video scene identification method and system and electronic equipment
CN113159038B (en) * 2020-12-30 2022-05-27 太原理工大学 Coal rock segmentation method based on multi-mode fusion
CN112837320B (en) * 2021-01-29 2023-10-27 华中科技大学 Remote sensing image semantic segmentation method based on parallel hole convolution
CN113033570B (en) * 2021-03-29 2022-11-11 同济大学 Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN112948952B (en) * 2021-04-08 2023-05-23 郑州航空工业管理学院 Evolution prediction method for cavity behind shield tunnel lining
CN113034371B (en) * 2021-05-27 2021-08-17 四川轻化工大学 Infrared and visible light image fusion method based on feature embedding
CN113744169A (en) * 2021-09-07 2021-12-03 讯飞智元信息科技有限公司 Image enhancement method and device, electronic equipment and storage medium
CN114549913B (en) * 2022-04-25 2022-07-19 深圳思谋信息科技有限公司 Semantic segmentation method and device, computer equipment and storage medium
CN116777768A (en) * 2023-05-25 2023-09-19 珠海移科智能科技有限公司 Robust and efficient scanned document image enhancement method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104637045A (en) * 2013-11-14 2015-05-20 重庆理工大学 Image pixel labeling method based on super pixel level features
CN108062756A (en) * 2018-01-29 2018-05-22 重庆理工大学 Image, semantic dividing method based on the full convolutional network of depth and condition random field

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104637045A (en) * 2013-11-14 2015-05-20 重庆理工大学 Image pixel labeling method based on super pixel level features
CN108062756A (en) * 2018-01-29 2018-05-22 重庆理工大学 Image, semantic dividing method based on the full convolutional network of depth and condition random field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Gaussian Conditional Random Field Network for Semantic Segmentation;Vemulapalli R 等;《2016 IEEE Conference on Computer Vision and Pattern Recognition》;20161212;全文 *

Also Published As

Publication number Publication date
CN109461157A (en) 2019-03-12

Similar Documents

Publication Publication Date Title
CN109461157B (en) Image semantic segmentation method based on multistage feature fusion and Gaussian conditional random field
CN111462126B (en) Semantic image segmentation method and system based on edge enhancement
Yang et al. Deeperlab: Single-shot image parser
Zhou et al. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction
Bansal et al. Pixelnet: Towards a general pixel-level architecture
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN108596330B (en) Parallel characteristic full-convolution neural network device and construction method thereof
CN112219223A (en) Generating a displacement map of pairs of input datasets of image or audio data
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
Pfeuffer et al. Semantic segmentation of video sequences with convolutional lstms
Liu et al. TSingNet: Scale-aware and context-rich feature learning for traffic sign detection and recognition in the wild
CN110826596A (en) Semantic segmentation method based on multi-scale deformable convolution
CN112561027A (en) Neural network architecture searching method, image processing method, device and storage medium
CN112329801B (en) Convolutional neural network non-local information construction method
CN113361485A (en) Hyperspectral image classification method based on spectral space attention fusion and deformable convolution residual error network
CN111401455A (en) Remote sensing image deep learning classification method and system based on Capsules-Unet model
CN114037640A (en) Image generation method and device
CN110889360A (en) Crowd counting method and system based on switching convolutional network
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN115565043A (en) Method for detecting target by combining multiple characteristic features and target prediction method
Eldesokey et al. Normalized convolution upsampling for refined optical flow estimation
CN115482518A (en) Extensible multitask visual perception method for traffic scene
Liu et al. Focus first: Coarse-to-fine traffic sign detection with stepwise learning
CN115410030A (en) Target detection method, target detection device, computer equipment and storage medium
Jiang et al. Semantic segmentation network combined with edge detection for building extraction in remote sensing images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant