CN110472653B

CN110472653B - Semantic segmentation method based on maximized region mutual information

Info

Publication number: CN110472653B
Application number: CN201910585061.1A
Authority: CN
Inventors: 赵帅; 蔡登�; 王阳
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2021-09-21
Anticipated expiration: 2039-07-01
Also published as: CN110472653A

Abstract

The invention discloses a semantic segmentation method based on maximized region mutual information, which comprises the following steps: (1) inputting a real scene picture into a segmentation model to obtain a prediction segmentation picture; (2) constructing high-dimensional distribution of a prediction picture and a label picture; (3) calculating an approximate value of the posterior variance of the high-dimensional distribution of the label pictures under the condition of the high-dimensional distribution of the given prediction pictures; (4) calculating the lower bound of mutual information of two high-dimensional distributions of the predicted picture and the label picture; (5) updating the weight parameters of the segmentation model according to the lower bound of the obtained mutual information, and maximizing the mutual information of two high-dimensional distributions so as to maximize the similarity of the predicted picture and the label picture; (6) and (5) repeating the steps (1) to (5), finishing training after the preset training times are reached, and applying semantic segmentation to the trained model. According to the invention, the segmentation effect of the segmentation model can be enhanced by maximizing the mutual region information between the model segmentation picture and the label picture.

Description

Semantic segmentation method based on maximized region mutual information

Technical Field

The invention belongs to the field of image semantic segmentation in computer vision, and particularly relates to a semantic segmentation method based on maximized region mutual information.

Background

Image semantic segmentation is a basic problem in the field of computer vision, and aims to assign corresponding semantic labels to each pixel in an image, wherein the semantic labels represent object categories to which pixel points belong, such as sky, people, vehicles, buildings and the like. Semantic segmentation has wide application scenes in the fields of automatic driving, medical image analysis, robot vision and the like; the semantic segmentation research also has great heuristic significance to other computer vision problems. In practice, image semantic segmentation is usually treated as a pixel-by-pixel multi-classification problem. In recent years, the problem of semantic segmentation has gained dramatic progress with the development of convolutional neural networks and the introduction of various deep learning models that are well-designed for the task of semantic segmentation. In general, training and optimization of the segmentation model are accomplished by minimizing the average classification loss of pixels. Among these, the most commonly used semantic segmentation loss function is the softmax cross-entropy loss function:

wherein p is the classification probability predicted by the model [0, 1], y is the real object class label {0, 1}, N represents the number of pixel points in the picture, and C is the class number of the object to be classified. Minimizing the cross-entropy loss between y and p is equivalent to minimizing the relative entropy between them, or the Kullback-Leibler (KL) dispersion.

As can be seen from the above formula, the cross entropy loss is calculated pixel by pixel, so it ignores the relationship between the pixel points. However, strong dependencies exist between pixel points in one picture, and structural information of an object is implied in the dependencies. Since the pixel-by-pixel loss function (such as the cross entropy loss mentioned above) ignores the relationship between pixels, the semantic segmentation model supervised and trained by the pixel-by-pixel loss function cannot well identify the pixel points of objects with poor foreground information or objects with small spatial structures, and the segmentation effect of the model also deteriorates.

Some previous methods also focus on using the correlation of pixel points in the picture to enhance the segmentation effect. The article "efficiency Information in full Connected CRFs with Gaussian Edge powers" on Conference on Neural Information Processing Systems in the 26 nd proceedings of 2012 and the article "deep Lab" on sheet Analysis and Machine Intelligence, volume 40, page 834 and 848 of the 2018 edition, IEEE Transactions: the Semantic Image Segmentation with Deep relational Nets, atom correlation, and full Connected CRFs uses a Conditional Random Field (CRF) to fit the relationship of the pixels in the picture. However, CRF typically has a time-consuming iterative inference process, and CRF is also sensitive to the appearance of objects in the image. An article "Learning Affinity visual space prediction Networks" on Conference on Neural Information Processing Systems in 31 st year in 2017 and an article "Adaptive Affinity Fields for Semantic Segmentation" on European Computer Vision international Conference video in 2018 in 2017 enhance the Segmentation effect by using the similarity relationship between points, but the similarity relationship of the pixels in the picture usually needs additional network model branches, and the obtained similarity matrix also needs additional memory space to be stored.

Disclosure of Invention

The invention provides a semantic segmentation method based on maximized region mutual information, which is used for enhancing the segmentation effect of a segmentation model by maximizing the region mutual information between a model segmentation picture and a label picture.

A semantic segmentation method based on maximized region mutual information comprises the following steps:

(1) inputting a real scene picture into a segmentation model to obtain a prediction segmentation picture;

(2) constructing high-dimensional distribution of a prediction picture and a label picture;

(3) calculating an approximate value of the posterior variance of the high-dimensional distribution of the label pictures under the condition of the high-dimensional distribution of the given prediction pictures;

(4) calculating the lower bound of mutual information of two high-dimensional distributions of the prediction picture and the label picture according to the obtained approximate value of the posterior variance;

(5) updating the weight parameters of the segmentation model according to the lower bound of the obtained mutual information, and maximizing the mutual information of two high-dimensional distributions so as to maximize the similarity of the predicted picture and the label picture;

(6) and (5) repeating the steps (1) to (5), finishing training after the preset training times are reached, and applying semantic segmentation to the trained model.

In the step (2), the random variables corresponding to the high-dimensional distribution of the prediction picture and the label picture are as follows:

P＝[p₁，p₂，...，p_d]^T，

Y＝[y₁，y₂，...，y_d]^T，

wherein P ∈ R^dHigh dimensional variable representing a predicted picture, Y ∈R^dHigh dimensional variable, p, representing a tagged picture_iIs in the interval [0, 1]]Y of_iIs 0 or 1, d is the dimension of these high-dimensional vectors, and is also the area size of the regions used to construct these high-dimensional vectors.

The probability density distribution functions (PDFs) of the random variables P and Y are f (P) and f (Y), respectively, and the PDFs of their joint distributions are f (Y, P). The distribution of the random variable P can also be considered to be the variable P₁，p₂，...，p_dThis means that f (p) is₁，p₂，...，p_d). The invention aims to maximize the similarity between P and Y by maximizing the mutual information of P and Y, so that the predicted picture and the real label picture of the segmentation model have higher-level consistency than when only pixel-by-pixel loss is used. Mutual information I (Y; P) of P and Y is defined as follows:

here, ,

and

support sets for Y and P, respectively. Now we aim to maximize the mutual information I (Y; P) between Y and P to maximize the similarity between them, and thus achieve high consistency of the predicted pictures and the real labels.

To calculate I (Y; P), one straightforward way is to find the PDFs mentioned above. However random variable p₁，p₂，...，p_dAre related variables, which make their joint probability density function f (p) difficult to analyze. The article Image Similarity Using Mutual Information of Regions at the European Conference on Computer Vision, European Computer Vision Conference 2004, demonstrated that the distribution of Y and P tends to be Gaussian for grayscale images when d is large enough. However in the context of segmentationIn the case where Y and P tend to have a Gaussian distribution, the dimension d is 900 or more. This means a significant consumption of computing resources, which is unacceptable in most cases. This method of constructing a gaussian distribution is therefore theoretically feasible but not achievable.

Since the high-dimensional distribution of Y and P is uncertain and it is not feasible to directly calculate the exact value of their mutual information, the present invention derives a lower bound of mutual information and maximizes the true mutual information I (Y; P) by maximizing this lower bound.

From the nature of mutual information, I (Y; P) ═ H (Y) -H (Y | P), where H (Y) is the entropy (entropy) of the random variable Y and H (Y | P) is the conditional entropy of the random variable pair (Y, P). Meanwhile, in the distribution of the covariance, the entropy of the gaussian distribution is the largest. And one covariance matrix is ∈ R^d×dHas an entropy of

Where e is the Euler constant and det (-) represents the value of the determinant of the matrix. Thus, a lower bound of mutual information I (Y; P) can be obtained:

here sigma_Y|PIs the posterior variance matrix of given P, Y, which is a symmetric semi-positive definite matrix, and d is the degree of dimension of the random vectors Y and P. By imitating the common cross entropy loss, a constant part irrelevant to the parameters of the model in the objective function is omitted, and a simplified mutual information lower bound I for optimization can be obtained_l(Y；P)：

In the step (3), since the specific distribution of Y and P is unknown, the posterior variance matrix cannot be directly calculated. Therefore, the invention calculates an approximate value of the posterior variance to obtain an approximate value of the lower bound of mutual information.

For random variables Y and P, E (Y) is the mean of Y (also denoted as μ)_y) Var (Y) is the variance of Y (also denoted as ∑_Y) Cov (Y, P) is the covariance of Y and P. Symbol Y ^ T₂P denotes that Y and P are second order independent, which means that for any value P of P there are E (Y | P ═ E (Y) and Var (Y | P ═ P) ═ Var (Y). Second order independence is a weaker constraint than strict mutual independence. Furthermore, given a regression matrix for P, Y

By calculating Y-A_ypThe correlation coefficient of P and P, it can be easily seen that both are uncorrelated. To get an approximation of the posterior variance, assume that:

(Y-A_ypP)⊥₂P.

this assumption implies that the posterior variance Var (Y-A)_ypP | P ═ P) is independent of the value P of P. According to the property of the covariance matrix and the second-order independent definition, the approximate calculation formula of the posterior variance is as follows:

wherein P ∈ R^d、Y∈R^dHigh-dimensional variables respectively representing the constructed prediction picture and the label picture; sigma_YIs the variance of the Y and is,

is the inverse of the variance of P; cov (Y, P) is the covariance matrix between Y and P, Cov (Y, P)^TRepresenting its transposed matrix; var (Y | P ═ P) is the posterior variance of Y given P;

is the regression matrix for Y given P.

In the step (4), the lower bound calculation formula of the mutual information of the two high-dimensional distributions is as follows:

wherein, I_l(Y; P) represents a lower bound of mutual information between the random variables Y and P; sigma_YIs the variance of the Y and is,

is the inverse of the variance of P; cov (Y, P) is the covariance matrix between Y and P, Coy (Y, P)^TRepresenting its transposed matrix; det (-) represents the value of the determinant of the matrix in brackets, log (-) represents the logarithm operation, with the base being the natural number e.

For brevity, define

I.e. an approximation of the posterior variance in step (3). When the lower bound of the mutual information is calculated according to the formula, the discovery is carried out

Where λ is the eigenvalue of the matrix M. As can be seen from this equation, I_lThe size of the value of (Y; P) is most likely related to the number of dimensions of the matrix. Thus I_lThe value of (Y; P) is further divided by the dimension d of the matrix to eliminate the influence of the matrix dimension on the value of the lower bound of mutual information:

where d is the number of dimensions of the random variables Y and P.

In step (5), after the lower bound of the mutual information is obtained by calculation, a formula for optimizing a total loss function of the segmentation model is as follows:

wherein,

is the total loss for training the model,

is the cross entropy loss between the labeled picture y and the predicted picture p, B represents the number of training pictures in a random training batch, C represents the number of channels of the predicted picture, i.e., the number of classes of objects in the real picture,

is the lower bound of the mutual information of the c channel corresponding to the b training picture, λ is a weighting factor, and the value of λ is set to 0.5 by default. Here, the maximization of mutual information is also translated into a problem of minimizing its negative value.

Compared with the prior art, the invention has the following beneficial effects:

1. the loss function of the regional mutual information provided by the invention provides a very intuitive method for fitting the relation of pixel points in the pictures and measuring the structural similarity between the two pictures, only a small amount of computing resources are needed during realization, and the method is easy to use.

2. The method provided by the invention has the advantages that the model is easy to train, and no additional inference step or additional network structure is needed. A large number of experiments prove that the segmentation model trained by the method provided by the invention can obtain performance superior to a reference algorithm and other similar methods.

Drawings

FIG. 1 is a general framework and flow diagram of the present invention;

FIG. 2 is a diagram illustrating the construction of high-dimensional distribution of predicted pictures and labeled pictures according to the present invention;

fig. 3 shows the result of qualitative segmentation on the PASCAL VOC 2012 validation set according to an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a semantic segmentation method based on maximized regional mutual information constructs high-dimensional distributions of a predicted picture and a real tag picture after obtaining the predicted picture output by a segmentation model, and calculates a posterior variance matrix of the real tag distribution given the high-dimensional distribution of the predicted picture; and then the lower bound of the mutual information of the two high-dimensional distributions can be calculated, and the segmentation model can maximize the value of the real mutual information between the segmentation picture and the real label picture by maximizing the lower bound of the mutual information, so that the similarity between the segmentation picture and the real label picture is maximized. The segmentation model trained in the way has better segmentation performance than the model trained by only using the cross entropy loss of each pixel.

As shown in fig. 2, for a predicted picture or a real label picture, a pixel itself and several pixels around the pixel are used to represent the pixel, so that the pixel can be represented as a high-dimensional point, and for a picture, the pixel can be moved from pixel to pixel, so that many high-dimensional points can be obtained. In this case, a picture can be represented by these high-dimensional distributed points. By maximizing the similarity of these two high-dimensional distributions, the prediction picture and the tag picture can achieve higher consistency than if only pixel-wise cross-entropy loss is used. In other words, the segmentation effect of the model becomes better.

By constructing the high-dimensional distribution in the above manner, the problem of excessive memory overhead may exist, and further, the number of floating point calculations may be increased. For example, assuming a 9-dimensional point is constructed, the storage space required to construct the high-dimensional distribution is 9 times that of the original. In the context of current large-scale deep learning, this means memory overhead of several GB or even tens of GB, and the increase in floating point computation load that follows is also what we cannot afford. In order to reduce the consumption of computing resources, the invention firstly carries out down-sampling processing on the prediction picture and the construction picture before constructing the high-dimensional distribution, so that the expense of the computing resources can be controlled within an acceptable range. The down-sampling process has a certain negative effect on the performance of the model, but the down-sampling process makes the loss function for maximizing the regional mutual information, which is provided by the invention, have real practical application significance.

When calculating the lower bound of the mutual information, the value of the determinant of the covariance matrix M needs to be calculated, which may cause a problem of numerical underflow. This is because the probability given by the segmentation model is given by softmax or sigmoid operation, and the value of the probability may be very small, and at the same time, the number of pixels in one picture may be very large. For example, the size of the image commonly used in model training is 513 × 513, which means that there are approximately 263000 points in the picture. If according to Cov (Y, Y) E ((Y-. mu.))_y)(Y-μ_y)^T) The covariance matrix is calculated by the formula, and the values of some elements in the matrix are very likely to be very small, so that the calculation formula of the lower bound of mutual information is rewritten as follows in practical application:

here, Tr (-) indicates the trace of the matrix,

in addition, since M is a symmetric semi-positive definite matrix, in practical application, a small positive number is added to the element on the diagonal of M, and M + ξ I is formed. The influence on the optimal solution of the system is extremely small, but the operation speed of the matrix operation can be accelerated by carrying out the Kariski decomposition on the new symmetric positive definite matrix M. In practice ξ -1 e-6 is set. In order to ensure the accuracy of calculation, the invention uses double-precision floating point number in the operation related to the lower bound of the calculation mutual information. It is also noted that log (det (M)) is a concave function with respect to matrix M when matrix M is a positive definite matrix. This property makes the system easy to optimize.

The overall objective function for optimizing the segmentation model in practice is:

where, λ is a weighting factor,

is the cross entropy loss between y and p, B represents the number of training pictures in a random training batch, and the problem of maximizing the lower bound of mutual information is translated into a minimization problem. In practice, λ is set to 0.5. The effect of the conventional cross entropy loss is to measure the similarity between the pixel intensities of the two pictures, while the effect of the region mutual information is to measure the structural similarity between the two pictures.

In practice, when the model finally outputs the predicted probability value, the invention adopts sigmoid operation to obtain the predicted probability value instead of the common softmax operation. This is because when the lower bound of the regional mutual information is calculated, the picture of each channel is calculated separately, whereas when the softmax operation is used, a very strong correlation is explicitly introduced between the channels, which may cause unpredictable results. Experiments show that the performance of the model trained by using the softmax cross entropy loss and the sigmoid cross entropy loss is approximately the same. The use of these two operations has a negligible impact on model performance.

In fig. 3, the segmentation effect of the segmentation model trained with the algorithm of the present invention and with the conventional method is shown. It can be seen that the segmentation model trained by the algorithm of the invention has better performance on segmentation details, such as segmentation effects of animal legs and human limbs, compared with the segmentation model trained by the conventional method; the overall visual effect of the segmented picture is greatly improved. This qualitatively demonstrates the effectiveness of the proposed algorithm.

In order to embody the technical effects and advantages of the present invention, the method proposed by the present invention is applied to practical examples, and compared with other methods of the same type.

The segmentation models adopted by the invention are the DeepLabv3 and DeepLabv3+ semantic segmentation models at the current leading edge, and the performance of the segmentation models can be compared by using the method provided by the invention and other methods.

The invention was tested on two public data sets PASCAL VOC 2012 and CamVid. The PASCAL VOC 2012 data set is divided into three parts: training, validation and test sets with 1464, 1449 and 1456 pictures, respectively. The present invention was trained using an enhanced data set of PASCAL VOCs 2012, containing 10582 pictures. The CamVid dataset is a street view dataset, and the training set, validation set, and test set contain 367, 101, and 233 pictures, respectively. The segmentation model is trained on a training set and a verification set of a CamVid data set, and the effect is tested on a test set.

The evaluation index used by the invention is mean intersection-over-unity (mIoU) score, namely the ratio of intersection and union of objects in the predicted segmentation picture and the real segmentation picture. The effect of the algorithm is firstly verified on the PASCAL VOC 2012 verification set, and the result is shown in Table 1.

TABLE 1

In table 1, CE and BCE are the conventional softmax and sigmoid cross entropy losses, respectively, and the data for CE one row is the data reported in the deplab v3 and the deplab v3+ model papers.

The CRF and Affinity are the same type of algorithms as the method provided by the invention, and the aim is to fit the relation of pixel points in the picture. CRF-X means that when CRF is used, the iterative inference step for CRF is X. The reasoning time is the time required by the method to output the segmentation picture when a real picture is input, and the CRF has a time-consuming iterative reasoning process as can be seen from the table. It can also be clearly seen that the segmentation model trained by the algorithm proposed by the present invention has better performance than the conventional method and some methods of the same type.

Furthermore, the invention performs control variable experiments on the PASCAL VOC 2012 validation set to test the influence of different elements in the image semantic segmentation system proposed by the invention on the final segmentation result, and the results are shown in table 2.

TABLE 2

In table 2, the down-sampling modes include interpolation (Int.), maximum pooling (Max.), and mean pooling (Avg.). The down-sampling factor is the ratio of the original picture size and the size of the down-sampled picture. The side length of the square region determines the dimension size of the high-dimensional distribution of the constructed prediction picture and the label picture. As can be seen from table 2, average pooling, smaller down-sampling factors and larger square areas (higher high-dimensional distribution dimensions) may lead to relatively better performance.

Furthermore, the invention verifies the validity of the proposed algorithm on a CamVid verification set.

TABLE 3

As shown in table 3, in comparing the segmentation effect of each object class in the CamVid dataset with the total segmentation effect, the algorithm proposed by the present invention still exhibits better performance than the baseline algorithm and some methods of the same type. This proves the universality and superiority of the loss function and image semantic segmentation system based on the maximized region mutual information provided by the invention.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A semantic segmentation method based on maximized regional mutual information is characterized by comprising the following steps:

(1) inputting a real scene picture into a segmentation model to obtain a prediction picture;

(3) calculating an approximate value of the posterior variance of the high-dimensional distribution of the label pictures under the condition of the high-dimensional distribution of the given prediction pictures; the approximate calculation formula of the posterior variance is as follows:

is the inverse of the variance of P; cov (Y, P) is the covariance matrix between Y and P, Cov (Y, P)^TRepresenting its transposed matrix; var (Y | P ═ P) is an approximation of the posterior variance of Y given P;

is a regression matrix for Y given P;

(4) calculating the lower bound of mutual information of two high-dimensional distributions of the prediction picture and the label picture according to the obtained approximate value of the posterior variance; the lower bound calculation formula of the mutual information of the two high-dimensional distributions is as follows:

is the inverse of the variance of P; cov (Y, P) is the covariance matrix between Y and P, Cov (Y, P)^TRepresenting its transposed matrix; det (-) generationThe table evaluates the determinant of the matrix in brackets, log (-) represents the logarithm operation, and the base is the natural number e;

after the mutual information lower bound is obtained through calculation, the formula of the total loss function for optimizing the segmentation model is as follows:

wherein,

is the total loss for training the model,

is the conventional cross entropy loss between the labeled picture y and the predicted picture p, B represents the number of training pictures in a random training batch, C represents the number of channels of the predicted picture, i.e. the number of classes of objects in the real picture,

is the mutual information lower bound of the c channel corresponding to the b training picture, and lambda is a weight factor;

2. The semantic segmentation method based on the maximized regional mutual information as claimed in claim 1, wherein in the step (2), the random variables corresponding to the high-dimensional distribution of the predicted picture and the tagged picture are:

P＝[p₁，p₂，...，p_d]^T，

Y＝[y₁，y₂，...，y_d]^T，

wherein P ∈ R^dHigh dimensional variable representing a predicted picture, Y ∈ R^dHigh dimensional variable, p, representing a tagged picture_dIs in the interval [0, 1]]Y of_dIs 0 or 1, d is the dimension of these high-dimensional vectors, and is also the area size of the regions used to construct these high-dimensional vectors.

3. The semantic segmentation method based on the mutual information of the maximized area as claimed in claim 1, wherein when calculating the lower bound of the two high-dimensional distributed mutual information, the influence of matrix dimension on the value size of the lower bound of the mutual information is eliminated, and the specific formula is as follows:

where d is the number of dimensions of the random variables Y and P.

4. The method for semantic segmentation based on maximized regional mutual information as claimed in claim 1, wherein the value of λ is set to 0.5.