CN116524048A

CN116524048A - Natural image compressed sensing method based on potential diffusion model

Info

Publication number: CN116524048A
Application number: CN202310476361.2A
Authority: CN
Inventors: 孙桂玲; 郑博文; 董亮; 张彭晨
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-08-01

Abstract

The invention belongs to the technical field of signal processing, and particularly relates to a natural image compressed sensing method based on a potential diffusion model. The invention discloses a two-stage image reconstruction method based on deep learning, which is used for deeply exploring the discreteness and compressibility of natural images. In the first stage, vector quantization is used to generate depth features of the images learned on a large data set by an countermeasure network, and discretized codes of the images on a low-dimensional potential space are obtained so as to eliminate redundant information of natural images. In the second stage, subject to the compression measurements of the image, a diffusion model is used to infer its corresponding potential coding and the original image is further reconstructed. Experimental results show that compared with the existing method, the image compressed sensing method provided by the invention can greatly improve the visual effect of the reconstructed image under the condition of low sampling rate.

Description

Natural image compressed sensing method based on potential diffusion model

[ field of technology ]

The invention discloses a natural image compressed sensing method based on a potential diffusion model, and belongs to the technical field of signal processing.

[ background Art ]

Images are important carriers of information and have been derived from a number of applications in the big data age. Conventional image acquisition systems typically first sample all pixels in the field of view to form an original image, and then use complex compression algorithms to eliminate redundant information in the image for storage and transmission. This requires a strong computing power and sufficient energy for the image sensor, which may not be suitable for some energy-limited and inexpensive sensor applications, such as large-scale wireless internet of things.

Compressed sensing (Compressed Sensing, CS) breaks the limitation of nyquist-shannon sampling law and enables reconstruction of signals from a few measurements. Meanwhile, the linear projection is utilized to directly obtain the measured value through the combination sampling and compression process, so that the requirement on the computing capacity of the sensor and the energy consumption are reduced, and the method is an efficient image acquisition mode. Nowadays, the image CS has been widely used in various fields such as MRI, single-pixel camera, fast imaging, and holographic imaging.

The CS procedure can be expressed as:

wherein the method comprises the steps ofAs the original image is to be taken,as a result of the measurement value,is a measurement matrix. CS is intended to utilize a givenAndsolving the underdetermined linear system shown in the formula to obtain an original imageIs a solution to the approximation of (a).

From the perspective of probability theory, the purpose of the reconstruction algorithm is to maximize the posterior probabilityTo determine variables. According to Bayesian theory, there are:

wherein for a given，Is a constant.Is determined by the observation model, typically by minimizing the fidelity terms, as likelihood functionsAnd (5) solving.Derived from a priori knowledge of the image, can be solved by constructing regularization terms. Common image priors include sparsity over the transform domain, low rank, non-local self-similarity, and the like. The Bayesian-based algorithm solves the CS problem by embedding priori knowledge into the likelihood reasoning process, and the main research difficulty is the prior setting problem.

In recent years researchers have proposed depth-based approachesThe conventional algorithm solves the CS reconstruction problem. In contrast to conventional algorithms, these algorithms are able to automatically learn a priori information of the image through neural networks and directly to the image through training on the datasetModeling has higher precision and faster speed. Furthermore, in the depth CS algorithm, the sampling matrix is learned rather than designed manually, which is more appropriate for the particular application. These algorithms can be generalized into two broad categories, data driven algorithms and depth expansion algorithms. Wherein the former learns the image prior by stacking convolution layers, which is a black box model. The latter aims to design a suitable network to replace certain modules in model-based approaches, such as sparse transforms and denoising.

The pure data driven algorithm cannot explain the reconstruction process, and the convolution layer is focused on the local features of the image, and ignores the global information. Depth expansion algorithms are still inherently utilizing known image priors. Finding more suitable a priori information remains an open problem. The existing image CS reconstruction method mostly aims at improving the peak signal-to-noise ratio, but the evaluation result is probably different from the knowledge of people. One major drawback is that these indices may result in a reconstructed image that is too smooth. How to improve the perceived quality of the reconstructed image remains an open question.

At the same time, the diffusion model shows a strong ability in the field of image generation to transform an unknown data distribution into some known distribution, e.g. a standard gaussian distribution, by defining a markov chain of expansion steps, which is equivalent toModeling. The potential diffusion model (Latent Diffusion Model, LDM) is introduced into an automatic encoder, and an image is generated by using the diffusion model on a low-dimensional potential space, so that redundant information in the image can be removed, the calculation efficiency is improved, and the method is more suitable for a likelihood-based generation model.

[ invention ]

The invention aims to apply a potential diffusion model to a natural image compressed sensing reconstruction process and provides a natural image compressed sensing method based on the potential diffusion model.

The invention relates to a method for measuring the value of a measured valueThe set condition guides the generation process of LDM, and a high-quality reconstructed image is obtained。

Assume that the original image to be observed in the CS procedure isPotentially encoded asThe observed value is. The goal of the present invention is to train a parameterized reconstructed network approximation. The Bayesian formula can be derived:

for a given measurement value,is constant because the measurement is related only to the original image and the sampling matrix. The reconstruction process is broken down into 3 sub-problems as follows:

(1) Maximization of：Is toIs sufficient, i.e. According to the invention, the likelihood estimation problem is solved through a sampling module with parameterized design;

(2) Maximization of: the problem is modeled by pre-trained vector quantization generation against a network decoder.

(3) Maximization of: this is the focus of the present invention, solved by the training condition LDM.

The invention aims to solve the problem of natural image depth compressed sensing reconstruction in an energy-limited system by the following technical scheme:

the present invention uses a block compressed sensing based learnable sampling matrix to establish a linear mapping from image space to measurement space. Specifically, the natural image is first divided into sizesNon-overlapping blocks with channel 3, giveAn input. Then use the operationExpanding each block intoDimension vector and use of a learnable measurement matrixSampling to obtainAnd (5) measuring values. For example, for. At the time of the sampling rate of 0.1,the sampling module will giveAnd outputs. The sampling module is expressed as：

The reconstruction module of the present invention includes three sub-modules, namely, conditional encoding, latent variable reconstruction, and decoding.

In LDM, in order to perform a splicing operation, the size of the condition is always adjusted to be the same as the latent variable, although the number of channels may be different. The present invention uses a learnable initial reconstruction matrix and downsampling network to encode the measurements into conditions that are usable for LDM. First use the initial reconstruction matrixMapping the measured values to initial reconstruction vectors; then use the operationRemodelling the vector into an image; finally, the initial reconstructed image is encoded using a downsampling network as the condition required by LDM. Specifically, assume that the image has a size ofLatent variableIs of the size of. For the followingIs measured by a sampling block of (a)Is of the size of. First use the initial reconstruction matrixMapping the measured values to sizeIs included in the vector. Then use the operationRemodelling reconstructed vectors to sizeIs included in the image data. Finally, using convolutional neural network based downsampling networkEncoding an image to sizeConditions of (2). The conditional coding submodule is expressed as：

The LDM module is aimed at being based onThe encoding of the original image is predicted using a diffusion process. Assume an original imageIs potentially encoded as. The forward diffusion process is carried out by continuous directionAdding noise to obtain a sequence. For any stepThe method comprises the following steps:

wherein the method comprises the steps of，And (2) and. The latent variable reconstruction module can be described as a conditional back-diffusion process:

specifically, the present invention predicts the first through UNet with attentive mechanismsStep added noise and passDenoising for several times to obtain. The focus of the reconstruction is to predict the noise of each step. The present invention uses UNet modeling conditional timing denoising self-encoder with cross attention. In the stepSplicingAndstitching is used as input to UNet at the current step, and thereforeIs of the shape of. UNet obtains feature maps of different sizes through 4 downsampling, and reserves details in jump connection of the same size. After sufficient training, the module is rebuiltAndfor input, the potential encoding of the original image is predicted:

the present invention uses a pre-trained decoderThe original image is reconstructed. Vector quantization generates discretized encoding of potential features of the image against the network, which better conforms to the modality of nature. The decoder will firstIs replaced with the nearest code in the code table. Then reconstructed using a decoder:

the invention relates to an end-to-end CS structure which can measureAn image is reconstructed for the input. The sampling module and the reconstruction module are combined in the training processAnd (5) optimizing. All the parameters are set as. The training goal is to minimize the denoising score matching loss:

[ advantages and positive effects of the invention ]

Compared with the prior art, the invention has the following advantages and positive effects:

in the first aspect, the invention designs a sampling paradigm and a corresponding coding module for the perceptually enhanced CS, unlike the conventional method which uses a manually designed sampling matrix, the sampling matrix of the invention is jointly optimized with the reconstruction network. In addition, the invention designs an initialization coding network to code the measured values into conditions to guide the potential diffusion model to reconstruct the low-dimensional potential coding of the original image;

firstly, the invention designs a CS reconstruction model based on potential diffusion and explains the working principle of the CS reconstruction model. The image reconstruction work is transferred from a complex image space to a low-dimensional potential space, likelihood estimation and data priori problems in a Bayesian CS problem are solved simultaneously by using a conditional diffusion model, and the low-dimensional expression of the image is reconstructed;

the invention provides an end-to-end two-segment image CS architecture. The architecture uses a pre-trained automatic encoder to build a mapping of image space to low dimensional potential space, uses a diffusion model to derive hidden vectors conditioned on measured values, and reconstructs the image. Experiments have shown that it can reconstruct images of high quality at low sampling rates, while retaining key details.

[ description of the drawings ]

Fig. 1 is a flow chart of a depth compressed sensing image reconstruction method applied to a wireless sensor network;

FIG. 2 is a reconstructed sample comparison of the natural image at a sampling rate of 0.04 for the present invention and for the prior art advanced algorithm;

[ detailed description ] of the invention

In order that the manner in which the above-recited embodiments and advantages of the invention are obtained will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and examples.

(1) The sampling module firstly uses the original color natural imageCut into a size ofEach image block comprisesInputs and spreads each image block into a vector, and then uses a learnable measurement matrixWhereinWill beDimension input projection asDimension measurementFor transmission and storage, measuring matrices during trainingInitialized to a gaussian random matrix;

(2) The initial encoding module uses a leachable reconstruction matrix firstWill measureProjected asThe dimension vector is remolded and spliced into an initial reconstructed imageThe initial reconstruction is then encoded into conditions that can be input to the diffusion module using a learnable downsampling networkWherein the downsampling network uses twoThe average pooling of (a) to obtain a compression ratio of 4 and a channel number of 5, and processing a feature map by using two convolution layers with 'ReLU' activation functions on each size, wherein the channel number of the feature map is 64;

(3) The potential diffusion module comprises a forward process and a backward process, wherein the forward process firstly uses pre-trained vector quantization to generate an encoder of the countermeasure network, and potential encoding of the original image is obtainedThen adding noise with increasing intensity to the obtained value to obtain a series of noisy codes {}, whereinNear standard Gaussian noise, backward process with random noiseAnd conditionsFor input, noise added on image coding is predicted step by UNet, and finally the original image is restoredPotential encoding of (c)Wherein the number of layers of UNet is 4, eachThe channel numbers of the layer feature map are 160, 320, 640, and 1280, respectively, and an attention mechanism is used on the minimum size feature map;

(4) Iteratively optimizing steps (1) - (3) using a gradient descent method to obtain an optimal sampling matrixReconstruction matrixAnd initial coding network and UNet parameters, the loss function is:

(5) After training is completed, the inference process of LDM is accelerated by using a denoising diffusion implicit model (Denoising Diffusion Implicit Models, DDIM) to measureObtaining predictive coding for input；

(6) Decoding module generates a decoder against the network for pre-trained vector quantization to potentially diffuse the predicted encoding of the moduleFor input, first of all forFind the closest code on the pre-trained codebook and replace it with itCorresponding elements in the image are then convolved and upsampled for a plurality of times to obtain a reconstructed imageWherein the decoder is composed of two up-sampling networks with multiplying power of 2 and an output layer, the first up-sampling network is packedThe second up-sampling network comprises two residual blocks with the channel number of 128 and a bilinear interpolation operation, the second up-sampling network comprises two residual blocks with the channel number of 64 and a bilinear interpolation operation, the output layer is a single-layer convolutional neural network, the input channel is 64, and the output channel is 3.

The simulation experiment hardware of the invention is configured as follows: intel (R) Core (R) i5-13600KF CPU;32.0GB DDR4 memory; NVIDIA Quadro RTX3090 GPU.

The simulation experiment software of the invention is configured as follows: ubuntu 20.04 operating system, simulation language Python, software library Pytorch1.7.0.

In simulation experiments, a training set was constructed using the BSD 500. All images were randomly cropped to 60 sizesThe applied data enhancement modes include random rotation, horizontal folding and vertical folding. Thus, the training set includes 10000 sizesIs a picture of the image of (a). Five widely used natural image sets were tested as follows: set14, BSD100, urban100, and DIV2K. The natural image compressed sensing method based on the potential diffusion model is compared with the existing advanced algorithm through a comparison experiment, and the compared algorithms comprise ISTA-Net, OPINE-Net and AMP-Net, transCS, TCS-Net. The compared index is the learning perceived image block similarity (Learned Perceptual Image Patch Similarity, LPIPS). LPIPS is a commonly used evaluation index in underlying vision, and is more in line with human perception than peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) and structural similarity (Structural Similarity, SSIM). Lower LPIPS represents better reconstruction.

TABLE 1 comparison of the invention with the prior art advanced methods LPIPS at different sample rates

The sampling rates were set to 0.01, 0.04 and 0.1, respectively, in the experiment. Table 1 shows the average LPIPS of the present invention reconstructing images on different data sets than prior art advanced methods. It can be seen that the present invention is superior to the comparative method at all data sets and all sample rates in the table. And, as the sampling rate decreases, the advantages of the present invention are more apparent. For example, on the Urban100 dataset, the LPIPS of the present invention is 0.0815, 0.0393 lower than the suboptimal TransCS when the sampling rate is 0.1; when the sampling rate was reduced to 0.01, the LPIPS of the present invention was 0.1874, 0.1465 lower than the suboptimal TCSNet. Therefore, the invention is more suitable for the image compression sensing task under the low sampling rate and has better application prospect.

Fig. 2 shows a reconstructed sample of the lena plot at a sampling rate of 0.04 for the present invention and comparative method. The reconstructed image of the invention is clearer, contains richer details and has better visual effect.

Claims

1. The natural image compressed sensing method based on the potential diffusion model comprises a sampling module, an initial coding module, a potential diffusion module and a decoding module, and is characterized in that:

(1) The sampling module firstly uses the original color natural imageCut to size +.>Each image block then contains +.>Inputs and spreads each image block into a vector, and then uses a learnable measurement matrixWherein->Will->The dimension input projection is +.>Dimension measurement value->For transmission and storage, measuring matrix +.during training>Initialized to a gaussian random matrix;

(2) The initial encoding module uses a leachable reconstruction matrix firstMeasurement value +.>Projection is +.>Dimension vector, remodelling and stitching it into initial reconstructed image +.>Then encoding the initial reconstruction into conditions of the inputtable diffusion module using a learnable downsampling network>Wherein the downsampling network uses two +.>The average pooling layer of (2) obtains the condition that the compression ratio is 4 and the channel number is 5, and two convolution layers with 'ReLU' activation functions are used for processing the characteristic map on each size, wherein the channel number of the characteristic map is 64;

(3) The potential diffusion module includes a forward process and a backward process, the forward process first generating an countermeasure web using pre-trained vector quantizationComplex encoder, obtaining potential encoding of original imageThen adding noise with increasing intensity to the signal to obtain a series of noisy codes {>}, wherein->Near standard Gaussian noise, backward process with random noise +.>And condition->For input, noise added to image coding is predicted stepwise by UNet, and finally the original image +.>Potential coding of->Wherein the number of layers of UNet is 4, the number of channels of each layer of feature map is 160, 320, 640 and 1280, respectively, and the attention mechanism is used on the minimum size feature map;

(4) Decoding module generates a decoder against the network for pre-trained vector quantization to potentially diffuse the predicted encoding of the moduleFor input, first of all ∈>Searching the nearest code on the pre-trained codebook and replacing it with it>Corresponding elements of the image are then convolved and upsampled a plurality of times to obtain a reconstructed image +.>The decoder is composed of two up-sampling networks with multiplying power of 2 and an output layer, wherein the first up-sampling network comprises two residual blocks with 128 channels and a bilinear interpolation operation, the second up-sampling network comprises two residual blocks with 64 channels and a bilinear interpolation operation, the output layer is a single-layer convolutional neural network, the input channel is 64, and the output channel is 3.