CN110136062B

CN110136062B - Super-resolution reconstruction method combining semantic segmentation

Info

Publication number: CN110136062B
Application number: CN201910389111.9A
Authority: CN
Inventors: 向炟; 陈军; 杨玉红
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-03
Anticipated expiration: 2039-05-10
Also published as: CN110136062A

Abstract

The invention provides an image super-resolution reconstruction method combined with semantic segmentation, which utilizes an intermediate result and a final result generated when a low-quality image is subjected to semantic segmentation to perform super-resolution reconstruction, and has a realistic effect when large-multiple super-resolution reconstruction is performed. Because the high-level semantic information of the image is used as the inherent information of the image and contains a large amount of class priors on the pixel level, the high-level semantic information can be used as constraint information in the super-resolution reconstruction process to improve the quality of the reconstruction result. The method combines the computer vision low-level problem of image super-resolution reconstruction and the image semantic segmentation as a high-level problem, utilizes various information generated after the image is subjected to semantic segmentation to constrain and enhance the super-resolution reconstruction process, solves the problem that the reconstruction of a low-resolution image lacks authenticity under the condition of a large zoom factor, and has higher improvement on subjective quality evaluation.

Description

Super-resolution reconstruction method combining semantic segmentation

Technical Field

The invention relates to the technical field of image processing, in particular to a method for reconstructing super-resolution of an image by utilizing semantic segmentation.

Background

The image super-resolution reconstruction means that various technical means are used for converting a low-resolution image into a high-resolution image, and more high-frequency information is recovered, so that the image has clearer texture and detail. Since the first proposal, the super-resolution reconstruction method of images has been developed for half a century, and many super-resolution reconstruction methods of images can be roughly classified into three categories according to their principles: interpolation-based methods, reconstruction-based methods, and learning-based methods.

The method based on interpolation links the super-resolution reconstruction problem with the image interpolation problem, and is the most direct method in super-resolution reconstruction. Common interpolation methods include a nearest neighbor interpolation method, a bilinear interpolation method, a bicubic interpolation method and the like. The core idea is that points in a target image are searched for points related to the points in a source image according to a scaling relationship, and then the pixel values of the target point are obtained through interpolation calculation according to the pixel values of the related points in the source image. The interpolation-based method has the advantages of simplicity, intuition, high running speed and relatively poor adaptability, is not easy to add prior information of an image, is easy to introduce extra noise, and causes the reconstructed image to lack details, generate the phenomena of fuzziness, sawtooth and the like.

The reconstruction-based method obtains the most extensive attention and research, and the method assumes that a high-resolution image is subjected to proper motion transformation, blurring and noise to obtain a low-resolution image, and converts the super-resolution reconstruction problem into the optimization problem of a cost function under a constraint condition. The key idea of the method is to extract key information in a low-resolution image by utilizing methods such as regularization and the like from a degradation model of the image, and to combine the priori knowledge of an unknown super-resolution image to constrain the generation of the super-resolution image. The method only needs a few local prior assumptions during reconstruction, the fuzzy or sawtooth effect generated by the interpolation method is relieved to a certain extent, but when the magnification factor is too large, the degradation model cannot well provide the prior knowledge required by reconstruction, and the reconstruction result lacks high-frequency information.

Learning-based methods are a hot direction for super-resolution algorithm research in recent years. The basic idea is to learn a combined system model through training a group of training sets simultaneously comprising high-resolution images and low-resolution images, and perform super-resolution reconstruction on similar low-resolution images by using the learned model to achieve the aim of improving the image resolution. The learning-based method fully utilizes the priori knowledge of the image, can recover more high-frequency information in the low-resolution image, and obtains a better reconstruction result than the other two methods. Among all learning-based methods, the super-resolution reconstruction method based on deep learning has achieved excellent performance in recent years.

Although the current single-image super-resolution reconstruction technology makes a breakthrough in accuracy and speed by means of deep learning, the effect is reduced when a more complex low-resolution image is processed. For example: when the processed low resolution image contains many objects and there is a large portion of overlap and occlusion between the objects, the existing method cannot divide the boundary between the overlapped and occluded objects well, resulting in insufficient texture details of the reconstruction result, and even reconstructing multiple overlapped objects into one.

Disclosure of Invention

In order to solve the problems, the invention provides a brand-new super-resolution reconstruction method combining semantic segmentation. Semantic segmentation is one of the basic tasks in computer vision, the purpose of which is to classify visual input into different semantically interpretable classes, that is, for an image, pixels in the image into different classes. Based on the characteristic that the pixels are classified by semantic segmentation, the super-resolution reconstruction method combined with the semantic segmentation can better process a low-resolution image with a plurality of overlapped and shielded objects.

Aiming at the defects of the prior art, the invention provides a method for performing super-resolution reconstruction on a low-resolution image, which comprises the following steps:

step 1, constructing a low-resolution semantic segmentation data set, wherein the low-resolution semantic segmentation data set comprises a low-resolution image and a corresponding semantic layout;

step 2, training a semantic segmentation network by using a low-resolution semantic segmentation data set;

step 3, constructing a data set for training the super-resolution reconstruction network, wherein the data set for training the super-resolution reconstruction network comprises a semantic layout chart and a semantic feature chart of a low-resolution image and a corresponding high-resolution image, and the semantic layout chart and the semantic feature chart of the low-resolution image are obtained by inputting the low-resolution image into the semantic segmentation network trained in the step 2;

step 4, taking the semantic layout map and the semantic feature map as input, taking a high-resolution image corresponding to the semantic layout map as a true value, and training a super-resolution reconstruction network to output a corresponding high-resolution reconstruction result according to the input semantic layout map;

and 5, inputting a low-resolution picture to be reconstructed into the semantic segmentation result obtained in the step 2 to obtain a semantic layout picture and a semantic feature picture, inputting the semantic layout picture and the semantic feature picture into the super-resolution reconstruction network obtained by training in the step 4, and finally obtaining a reconstructed high-resolution image.

Further, the low-resolution semantic segmentation data set in step 1 is obtained by down-sampling the high-resolution image and the semantic layout map in the normal semantic segmentation data set with the same scaling factor, and the obtained low-resolution image and the semantic layout map form the low-resolution semantic segmentation data set.

Further, the semantic segmentation network in step 2 is a full convolution network obtained by changing a full connection layer in VGG16 into a convolution layer, and the specific network structure is as follows: convolutional layer × 2+ pooling layer + convolutional layer × 3+ pooling layer + convolutional layer × 2+ deconvolution layer, wherein the convolutional core size of the convolutional layer is 3 × 3, and the pooling layer is the largest pooling layer.

Further, the weight of the full convolution network is initialized to the weight in the pre-trained VGG 16; the loss function optimized in the training is the sum of deviations of pixel predicted values of the last layer of the network; the specific parameters of the training are as follows: the training batch size was 20, using a momentum of 0.9 and a decay rate of 10^-4The Adam algorithm is optimized, and the learning rate of the network is 10^-4。

Further, the super-resolution reconstruction network in step 4 is a cascaded reconstruction network composed of a series of cascaded reconstruction modules, the cascaded reconstruction modules operate at increasing resolutions, wherein each reconstruction module is composed of 3 network layers: the first layer is a feature fusion layer and is used for fusing an input semantic layout graph and a semantic feature graph with an output result of the previous layer; the latter two layers are convolution layers with a 3 x 3 convolution kernel, layer regularization and modified linear units as a means of reconstructing the fused features.

Further, the specific operation relationship among the reconstruction modules in the super-resolution reconstruction network is as follows,

the first reconstruction module takes the semantic layout map and the semantic feature map which are down-sampled to the current resolution as input, outputs a result of the current resolution, the result is regarded as the feature map which is merged and convolved, the later reconstruction module takes the result of the previous module, the semantic layout map and the semantic feature map which are down-sampled as input, outputs a new result, and the final result output by the reconstruction module is the super-resolution reconstruction result after a plurality of processes, wherein the mathematical description of the process is as follows:

wherein, O_iRepresents the output of the ith reconstruction module, F represents the convolution and other operations in the reconstruction module, L represents the semantic layout, F represents the semantic feature map,

representing feature fusion.

Furthermore, the loss function used in training the super-resolution reconstruction network in step 4 is,

wherein, I is a high-resolution image representing a true value, f is a cascade reconstruction network to be trained, theta is a parameter set in f, L is an input semantic layout, phi is a trained visual perception network which is a VGG network, and phi is a parameter set in f_lRepresenting convolutional layers in a visual perception network, lambda_lIn order to control the over-parameters of the weights, the values are adjusted along with the training process in the training process.

Further, when training the super-resolution reconstruction network, the specific settings are as follows: the integral iteration times are 200 generations; the learning rate of the model is 10^-4And the learning rate is reduced to half after every 100 training generations; using a momentum of 0.9 and a decay rate of 10^-4The Adam algorithm of (1) is optimized.

Compared with the prior art, the invention has the following advantages and positive effects:

because the high-level semantic information of the image is used as the inherent information of the image and contains a large amount of class priors on the pixel level, the high-level semantic information can be used as constraint information in the super-resolution reconstruction process to improve the quality of the reconstruction result. The method combines the computer vision low-level problem of image super-resolution reconstruction and the image semantic segmentation as a high-level problem, utilizes various information generated after the image is subjected to semantic segmentation to constrain and enhance the super-resolution reconstruction process, solves the problem that the reconstruction of a low-resolution image lacks authenticity under the condition of a large zoom factor, and has higher improvement on subjective quality evaluation.

Drawings

Fig. 1 is a network structure diagram of a full convolutional network in an embodiment of the present invention.

Fig. 2 is a block diagram of a cascaded reconstruction network according to an embodiment of the present invention.

Fig. 3 is an overall flow chart of the present invention.

FIG. 4 is a comparison graph of visual effects of the present invention and the comparison method, wherein (a) is Bicubic, (b) is SRCNN, (c) is SRDenNet, (d) is SRGAN, and (e) is the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The invention combines the characteristics of two computer vision tasks of image semantic segmentation and image super-resolution reconstruction, takes the characteristics of an image generated by semantic segmentation as prior information of super-resolution reconstruction, and provides an image super-resolution reconstruction method combining semantic segmentation. The general flow illustrated by this method is shown in fig. 3, this method can be implemented by computer software technology, and the embodiment specifically illustrates the flow of the present invention by taking training of the network as a main content, as follows:

step 1, constructing a low-resolution semantic segmentation data set, wherein the low-resolution semantic segmentation data set comprises a low-resolution picture and a corresponding semantic layout. The general semantic segmentation data set comprises a high-resolution picture and a semantic layout corresponding to the high-resolution picture, and the high-resolution picture and the semantic layout corresponding to the high-resolution picture are uniformly downsampled to obtain a low-resolution semantic segmentation data set.

In specific implementation, image processing software is used for reading all high-resolution pictures and corresponding semantic layout maps thereof, unifying the sizes of the pictures, and then performing down-sampling on all the high-resolution pictures by using a bicubic interpolation value with a scaling factor of 4. And then the corresponding semantic layout is down-sampled to the same resolution by using the same method. In this way, a low resolution semantic segmentation dataset is obtained that is composed of a low resolution image and a corresponding semantic layout.

And 2, training a semantic segmentation network by using the semantic segmentation data set with low resolution. The object processed by the common semantic segmentation network is a high-resolution picture, and the semantic segmentation network is trained by using the low-resolution data set obtained in the step 1, so that the semantic segmentation network can output a corresponding accurate semantic layout when a low-resolution image is input;

in the present embodiment, the semantic segmentation network is described by taking a Full Convolution Network (FCN) as an example. The full convolutional network is a convolutional neural network without a full connection layer, can predict and classify each pixel in a picture to obtain a semantic segmentation result, and has a network structure shown in fig. 1. In particular, the full convolutional network improvement in the present embodiment is obtained by changing the full link layer in the VGG16 to the convolutional layer from the VGG16 classification network. In a full convolutional network, note x_ijIs a data vector, y, of a certain layer (i, j) position of the network_ijData vector, y, for the (i, j) position of the next network layer_ijFrom x_ijThis can be obtained by the following equation:

y_ij＝f_ks({xsi+i,sj+j}0≤i,j≤k)

wherein k represents a convolution kernelS represents the step size of the convolution kernel or the down-sampling factor, si, sj represents the change of the position coordinate of the data vector of the original network layer (i, j) position after the convolution or pooling operation, which is related to s, and i, j represents the space displacement generated in the convolution or pooling process, usually caused by the zero padding operation. f. of_ksThe type of network layer is determined, which may be a matrix multiplication for convolution or pooling, or a spatial maximization for maximal pooling, or an elemental nonlinear mapping of the activation function. For a full convolutional network, the functions implemented by each network layer can be summarized by the above formula.

The specific implementation of training the full convolutional network is as follows:

1. and constructing a network. In this embodiment, the full convolution network main body is composed of VGG16, and its network structure is: convolutional layer × 2+ pooling layer + convolutional layer × 3+ pooling layer + convolutional layer × 2+ deconvolution layer. The convolution kernel size of the convolution layer is 3 multiplied by 3, the pooling layer adopts maximum pooling, and the data size becomes smaller and the channels become more as the convolution layer goes deeper.

2. Weights in the network are initialized. Unlike the random initialization in the usual case, the weights in this embodiment are initialized to the weights in the pre-trained VGG 16.

3. And training the network. The loss function optimized in the training is the sum of the deviations of the pixel predicted values of the last layer of the network. In this embodiment, the specific parameters of the training are: the training batch size was 20, using a momentum of 0.9 and a decay rate of 10^-4The Adam algorithm is optimized, and the learning rate of the network is 10^-4。

And 3, constructing a data set for training the super-resolution reconstruction network, wherein the data set for training the super-resolution reconstruction network comprises a semantic layout of low-resolution pictures and corresponding high-resolution pictures. And (3) inputting the low-resolution picture into the semantic segmentation network obtained in the step (2) to obtain a semantic segmentation result-semantic layout. In addition, an intermediate result and a semantic feature map generated in the semantic segmentation process can be obtained. The semantic layout map, the corresponding feature map and the corresponding high-resolution picture form a new data set to train the super-resolution reconstruction network. After the image is input into the semantic segmentation network, the final output result of the network is a semantic layout map, and the semantic feature map needs to be extracted from different network layers of the semantic segmentation network. In this embodiment, the selected semantic feature map is a feature in the convolutional layer before the pooling layer in the full convolutional network.

in this embodiment, a cascaded reconstruction network composed of a series of cascaded reconstruction modules is selected as the super-resolution reconstruction network, and the structure of the cascaded reconstruction network is shown in fig. 2. Each reconstruction module operates at a different resolution, the resolution of the first module is set to 8 x 16, the resolution of the following modules is doubled in turn, and after 5 reconstruction modules, the final output resolution is 256 x 512. The first reconstruction module takes the semantic layout and the feature map which are down-sampled to the current resolution as input, and outputs a result of the current resolution, wherein the result can be regarded as the feature map after combination and convolution. The later reconstruction module takes the result of the former module, the semantic layout map after down sampling and the feature map as input and outputs a new result. After a plurality of processes, the final result output by the reconstruction module is the super-resolution reconstruction result. The mathematical description of this process is as follows:

wherein, O_iRepresenting the output of the ith reconstruction module, F representing convolution and other operations in the reconstruction module, L representing the semantic layout, F representing the semantic featuresIn the figure, the figure shows that,

representing feature fusion.

Each reconstruction module consists of 3 network layers: the first layer is a feature fusion layer and is used for fusing an input semantic layout graph and a semantic feature graph with an output result of the previous layer; the latter two layers are convolution layers with a 3 x 3 convolution kernel, layer regularization and modified linear units as a means of reconstructing the fused features. Except for the last reconstruction module, the structure of each reconstruction module is the same, but the emphasis point of reconstruction of each module is different, because the input feature map contains different levels of information.

The cascade reconstruction network takes the semantic layout as a frame, and reconstructs details of an image by utilizing various information contained in a characteristic diagram, so that a loss function used in training is different from a general super-resolution reconstruction method, the reconstruction result is directly compared with an original high-definition image pixel by pixel unlike a conventional mean square error loss function, the cascade reconstruction network uses a loss function called perception loss, the aim is to compare the characteristic difference of the reconstruction result and a true value in the visual perception network, and the definition is as follows:

wherein I is a high-resolution image representing a true value, f is a cascade reconstruction network to be trained, theta is a parameter set in f, L is an input semantic layout, and lambda_lThe value of the hyper-parameter for controlling the weight can be adjusted along with the training process in the training process, phi is a trained visual perception network, phi_lThe method represents a convolutional layer in a visual perception network, the visual perception network is an image classification network trained by a large amount of data, has the capability of correctly classifying objects in an input image, and commonly uses a publicly released VGG series network which can be found on an official website. Through the training of the perception loss function, the cascade reconstruction network can reconstruct a more real reconstruction result.

The specific implementation scheme for training the super-resolution reconstruction network is as follows:

1. and constructing a network. The super-resolution reconstruction network in this embodiment is formed by cascading a series of reconstruction modules, and the structure of each cascaded module is consistent. The built reconstruction module consists of three layers of networks, wherein the first layer of network fuses input features, the second layer of network is a convolution layer, the size of a convolution kernel of the convolution layer is 3 multiplied by 3, and the convolution layer is provided with layer regularization and LRELU activation functions.

2. And initializing the weight in the network, and randomly initializing the weight in the network.

3. And training the network. The function to be optimized in the training is the perception loss, and the specific setting of the training is that the integral iteration times are 200 generations; the learning rate of the model is 10^-4And the learning rate is reduced to half after every 100 training generations; using a momentum of 0.9 and a decay rate of 10^-4The Adam algorithm of (1) is optimized.

And 5, performing super-resolution reconstruction by using the network obtained by training. The specific implementation scheme is as follows: and (3) inputting a low-resolution picture to be reconstructed into the semantic segmentation network obtained in the step (2) to obtain a semantic layout map and a semantic feature map of the low-resolution picture, inputting the semantic layout map and the semantic feature map into the super-resolution reconstruction network obtained by training in the step (4), and finally obtaining a reconstructed high-resolution image.

In order to verify the technical effect of the invention, a City landscape data set of Cityscapes is used for verification. The cityscaps dataset had 2975 high resolution images and corresponding fine semantic maps. 1000 of the 2975 images were used as training semantic segmentation networks, and the rest 1975 images were used to train super-resolution reconstruction networks. Examples of methods for comparison include Bicubic (Bicubic), super-resolution convolutional neural network SRCNN (c.dong, c.c.loy, k.he, and x.tang. Image super-resolution using device convolutional network. IEEE Transactions on pattern Analysis and Machine Analysis, 38(2): 295. 307,2016.), dense super-resolution neural network SRDenNet (t.tong, g.li, x.liu, and q.gao, "Image super-resolution using network connection," in 2017 IEEE International Conference on computer vision (ic cv. IEEE,2017, super-resolution, 4809-4817.), and antagonistic network generating network gan (c.light, l.die, l.balance, voice. transform, call. sub.35J.forward, sample software, and software).

Table 1 shows the corresponding objective and subjective evaluation indexes of each method under the condition of a scaling factor of 4, including PSNR (peak signal-to-noise ratio), SSIM (structural similarity) and MOS (mean subjective opinion score). As can be seen from Table 1, the method of the present invention has a stable improvement in the subjective quality of the restored image.

TABLE 1 Objective and subjective Scoring for each method

For example, as shown in fig. 4, it can be seen from the comparison result that the method of the present invention is more vivid and specific in the reconstructed details than other methods, has higher sense of realism and visual persuasion as a whole, and has a greater improvement in the subjective evaluation index on the level that the objective evaluation index is kept basically unchanged.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the embodiments is in some detail, and not to be taken as limiting the scope of the invention. Those skilled in the art can make substitutions and modifications without departing from the scope of the invention as defined by the appended claims, and the scope of the invention is not limited by the claims.

Claims

1. A super-resolution reconstruction method combining semantic segmentation is characterized by comprising the following steps:

2. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 1, wherein: the low-resolution semantic segmentation data set in the step 1 is obtained by down-sampling a high-resolution image and a semantic layout in a common semantic segmentation data set by the same scaling factor, and the obtained low-resolution image and the semantic layout form the low-resolution semantic segmentation data set.

3. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 1, wherein: the semantic segmentation network in the step 2 is a full convolution network obtained by changing a full connection layer in the VGG16 into a convolution layer, and the specific network structure is as follows: convolutional layer × 2+ pooling layer + convolutional layer × 3+ pooling layer + convolutional layer × 2+ deconvolution layer, wherein the convolutional core size of the convolutional layer is 3 × 3, and the pooling layer is the largest pooling layer.

4. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 3, wherein: initializing the weight of the full convolution network to the weight in the pre-trained VGG 16; the loss function optimized in the training is the sum of deviations of pixel predicted values of the last layer of the network; the specific parameters of the training are as follows: the training batch size was 20, using a momentum of 0.9 and a decay rate of 10^-4The Adam algorithm is optimized, and the learning rate of the network is 10^-4。

5. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 1, wherein: step 4, the super-resolution reconstruction network is a cascade reconstruction network composed of a series of cascade reconstruction modules, the cascade reconstruction modules operate with increasing resolution, wherein each reconstruction module is composed of 3 network layers: the first layer is a feature fusion layer and is used for fusing an input semantic layout graph and a semantic feature graph with an output result of the previous layer; the latter two layers are convolution layers with a 3 x 3 convolution kernel, layer regularization and modified linear units as a means of reconstructing the fused features.

6. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 5, wherein: the specific operational relationship between reconstruction modules in a super-resolution reconstruction network is as follows,

wherein, O_iThe output of the ith reconstruction module is shown, F is the convolution operation in the reconstruction module, L is the semantic layout, F is the semantic feature map, and ^ c is the feature fusion.

7. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 5, wherein: the loss function used in training the super-resolution reconstruction network in step 4 is,

8. The super-resolution reconstruction method based on semantic segmentation as claimed in claim 5, wherein: when the super-resolution reconstruction network is trained, the specific setting is as follows: the integral iteration times are 200 generations; the learning rate of the model is 10^-4And the learning rate is reduced to half after every 100 training generations; using a momentum of 0.9 and a decay rate of 10^-4The Adam algorithm of (1) is optimized.