CN112929664A

CN112929664A - Interpretable video compressed sensing reconstruction method

Info

Publication number: CN112929664A
Application number: CN202110082588.XA
Authority: CN
Inventors: 范益波; 黄博文
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-06-08

Abstract

The invention belongs to the technical field of video compressed sensing reconstruction, and particularly relates to an interpretable video compressed sensing reconstruction method. The method simulates the iteration process of the traditional algorithm by constructing the video compression sensing reconstruction neural network, and maps the traditional iteration optimization algorithm into the feedforward reasoning neural network; the video compressed sensing reconstruction neural network structure comprises a primary reconstruction module, a motion prediction module and a residual reconstruction module which are sequentially cascaded; the preliminary reconstruction module performs preliminary reconstruction on the signal subjected to compressed sensing sampling; the motion prediction module takes the adjacent frames which are stored in the cache and have completed the primary reconstruction as reference frames to carry out multi-hypothesis motion prediction; and the residual error reconstruction module reconstructs the difference between the sampling value and the network input sampling value to obtain a residual error reconstruction result. The invention can effectively improve the reconstruction effect of video compressed sensing and reduce the time required by reconstruction, thereby meeting the real-time reconstruction requirement of video signals.

Description

Interpretable video compressed sensing reconstruction method

Technical Field

The invention belongs to the technical field of video compressed sensing reconstruction, and particularly relates to an interpretable video compressed sensing reconstruction method.

Background

Conventional image or video compression methods, such as JPEG or H265, compress a signal after sampling it; the compressed sensing theory proposed by cans, Tao and Donoho in 2006 enables sensing and compression of signals to be performed simultaneously, i.e. only part of the signal is sampled and the original signal is restored by a reconstruction algorithm. It is theorized that if the original signal is sparse in some transform domains, it can be compressively sampled at a frequency lower than that required by nyquist's sampling law, and then restored by a specific reconstruction algorithm. Although the traditional compression method has higher compression rate and reconstruction quality at present, the characteristic of compressing and sensing to simultaneously perform sampling and compression makes the method have great application value in specific fields, such as medical image processing, high-speed photography and the like.

After the compressed sensing theory is proposed, the scholars propose various reconstruction algorithms to improve the reconstruction quality. The traditional algorithm usually completes reconstruction based on iterative optimization, and the aim of optimizing reconstruction quality is achieved by manually designing various prior conditions and transformation domains. The traditional algorithm based on iterative optimization has the advantages of clear algorithm concept, flexible processing size and the like, but the methods also have the defects of high calculation complexity, overlong calculation time, inaccurate prior condition and transform domain design and the like.

Thanks to the strong learning ability of the neural network, the neural network-based method is also widely applied to the compressed sensing reconstruction task. These methods typically use a network structure of feed forward reasoning, and a data-driven training method, thus greatly reducing computation time while improving reconstruction quality. However, the structure of these neural networks is often established by the experience of designers, and the algorithm concept is not clear, so that targeted optimization cannot be performed.

Therefore, if the two are combined, i.e. the traditional iterative algorithm is mapped to the neural network by the algorithm expansion method, the balance between the two can be well achieved.

Disclosure of Invention

The invention aims to provide an interpretable video compressed sensing reconstruction method, so as to effectively reduce the processing time of a video compressed sensing reconstruction task and improve the reconstruction quality.

The interpretable video compressed sensing reconstruction method provided by the invention simulates the iteration process of the traditional algorithm by constructing the video compressed sensing reconstruction neural network, and maps the traditional iterative optimization algorithm into the feedforward reasoning neural network; the video compressed sensing reconstruction neural network structurally comprises a primary reconstruction module, a motion prediction module and a residual reconstruction module which are sequentially cascaded; the specific reconstruction steps of the input signal are as follows:

(1) firstly, constructing a preliminary reconstruction module, and performing preliminary reconstruction on a signal subjected to compressed sensing sampling; the preliminary reconstruction module comprises a full connection layer, 5 convolution layers and an activation layer (the module parameters can be seen in the following table 1-1); the full connection layer is used for finishing signal size conversion, and the 5 convolution layers and the activation layer are used for finishing characteristic value extraction; in order to preserve the information obtained during reconstruction, the input-output size of each convolutional layer is identical to the original signal. The output signal length of the preliminary reconstruction module is consistent with the original signal.

(2) Then, a motion prediction module is constructed, and the adjacent frames which are stored in the cache and have been subjected to preliminary reconstruction are used as reference frames, and multi-hypothesis motion prediction is carried out through the motion prediction module; the structure of the motion prediction module includes a fully connected layer (the module parameters can be seen in tables 1-2 below); the motion prediction module prediction process is described by the following mathematical formula:

where the subscripts t, i denote the number of frames in the video sequence and the number of image blocks in the frames, x_t，iAnd

i.e. the t-th frame, the i-th image block and theCorresponding multi-reference prediction block, ω_t，iRepresenting K possible reference blocks in a reference frame, with dimension B²A matrix of xK, B being the size of the reference block, K being the number of reference blocks searched in the reference frame; h_t，iCoefficients representing a linear combination of the K reference blocks reflecting their contribution to the final motion predictor; h_t，iThe training learning is carried out, so that the optimal effect is ensured;

(3) then, a residual error reconstruction module is constructed, compressed sensing sampling is carried out on the motion prediction result again, and the motion prediction result is input into the residual error reconstruction module so as to further improve the prediction effect; the residual error reconstruction module reconstructs the difference between the sampling value and the network input sampling value to obtain a residual error reconstruction result, so that the reconstruction effect can be improved, and the training effect of the deep neural network is optimized; here, the residual reconstruction network includes a fully-connected layer for performing signal size transformation, 5 convolutional layers and an active layer (the module parameters can be seen in tables 1 to 3 below), and a plurality of convolutional layers and active layers for performing feature value extraction. In order to retain the information obtained in the reconstruction process, the input and output sizes of each convolution layer are consistent with those of the original signal;

(4) and finally, adding the output value of the residual error reconstruction module and the motion prediction output value, and averaging the output value and the preliminary reconstruction result to obtain a final output result.

The invention improves the quality of video compressed sensing reconstruction through multi-hypothesis motion estimation and residual reconstruction, the feedforward reasoning type neural network structure ensures that the reconstruction time is shorter, and the method of algorithm expansion is applied to ensure that the network structure can be obviously optimized for motion prediction.

Table 1-1: primary reconstruction Module construction parameters (cr configurable compression ratio parameter)

Tables 1-2: multi-reference motion prediction module structure parameter

Name of structure	Structural parameters
		Full connection layer	Inputting: 1024; and (3) outputting: 256

Tables 1 to 3: residual reconstruction module structure parameter (cr is a configurable compression ratio parameter)

Drawings

FIG. 1 is a schematic view of the overall process of the present invention.

Fig. 2 is a schematic diagram of the "preliminary reconstruction module".

Fig. 3 is a schematic diagram of the residual error reconstruction module.

Detailed Description

The invention will be further elucidated with reference to the schematic diagram 1.

The input of the present invention is the signal y after compressed perceptual sampling of one frame of the original video signal. After the network parameter weight is read from the model file, the input signal y is input into the network for reasoning, and the signal y is reconstructed by the first-stage module and then the reconstruction result x is output_out1Then to x_out1Resampling is carried out, a compressed sensing sampling matrix is the same as a matrix for sampling the original signal, and then the resampled signal y₁As input signal to the second stage module, and so onAnd repeating the steps for N times, wherein N is the number of modules in the network. Final output x_outI.e. the result of the reconstruction output of the network.

The model takes as input the compressed perceptual signal of an image block of 16x 16. Taking a 4-fold compression ratio as an example, after being measured by a compressed sensing matrix, the 16 × 16 image blocks are compressed into a signal with a length of 64 and input into a reconstruction network. In each module, the specific calculation steps for reconstructing the signal are as follows.

(1) First, an input signal with a length of 64 is primarily reconstructed by a primary reconstruction module. The preliminary reconstruction module outputs the input signal as a 256-length signal through a full link layer and adjusts it to a 16x16 two-dimensional signal. The adjusted signal passes through 5 convolutional layers and an active layer to extract characteristic values, and finally a preliminary reconstruction signal of 16x16 is output.

(2) Then, the motion prediction module reads the neighboring frame that has completed the reconstruction from the buffer as a reference frame. And the motion prediction module selects a region with the size of 32x32 as a reference block to be input into the motion prediction module by taking the corresponding position on the reference frame as the center according to the position of the current block to be reconstructed. The motion prediction module performs multi-hypothesis motion prediction through a full-link layer, namely, a reference block of 32x32 is divided into 4 sub-reference blocks of 16x16, and the 4 sub-reference blocks are linearly combined to be used as a motion prediction value, and finally the motion prediction value of 16x16 is output.

(3) Then, the motion prediction value is resampled by using the matrix same as the original compressed sensing measurement matrix, a resampled signal with the length of 64 is generated, and the resampled signal is subtracted from the original input signal, so that a residual signal is obtained. And the residual signal is sent to a residual reconstruction module, and the residual reconstruction module outputs the residual signal to be a signal with the length of 256 through a full connection layer and adjusts the signal to be a two-dimensional signal of 16x 16. Then 5 convolutional layers and activation functions are used for extracting characteristic values, and finally a residual reconstruction signal of 16x16 is output.

(4) And finally, adding the output value of the residual error reconstruction module and the output value of the motion prediction module to obtain an optimized predicted value, wherein the size of the optimized predicted value is 16x 16. And averaging the result with the output value of the preliminary reconstruction module, namely the sum of the two weights is 0.5, so as to obtain the final output value of the module, wherein the size of the final output value is 16x 16.

The final output result of the method is the reconstructed image block of the compressed sensing signal of the original image block, and experiments prove that the method can effectively complete the reconstruction task in a short time, and the final result has better image quality.

Claims

1. An interpretable video compressed sensing reconstruction method is characterized in that an iteration process of a traditional algorithm is simulated by constructing a video compressed sensing reconstruction neural network, and the traditional iterative optimization algorithm is mapped into a feedforward reasoning neural network; the video compressed sensing reconstruction neural network structurally comprises a primary reconstruction module, a motion prediction module and a residual reconstruction module which are sequentially cascaded; the specific reconstruction steps of the input signal are as follows:

(1) firstly, constructing a preliminary reconstruction module, and performing preliminary reconstruction on a signal subjected to compressed sensing sampling; the primary reconstruction module comprises a full connection layer, 5 convolution layers and an activation layer; the full connection layer is used for finishing signal size conversion, and the 5 convolution layers and the activation layer are used for finishing characteristic value extraction; in order to retain the information obtained in the reconstruction process, the input and output sizes of each convolution layer are consistent with those of the original signal; the length of an output signal of the preliminary reconstruction module is consistent with that of an original signal;

(2) then, a motion prediction module is constructed, and the adjacent frames which are stored in the cache and have been subjected to preliminary reconstruction are used as reference frames, and multi-hypothesis motion prediction is carried out through the motion prediction module; the structure of the motion prediction module comprises a full connection layer; the motion prediction module prediction process is described by the following mathematical formula:

i.e. the t-th frame, the i-th image block and its corresponding multi-reference motion prediction block, ω_t，iRepresenting K possible reference blocks in a reference frame, with dimension B²A matrix of xK, B being the size of the reference block, K being the number of reference blocks searched in the reference frame; h_t，iCoefficients representing a linear combination of the K reference blocks reflecting their contribution to the final motion predictor; h_t，iThe training learning is carried out, so that the optimal effect is ensured;

(3) then, a residual error reconstruction module is constructed, compressed sensing sampling is carried out on the motion prediction result again, and the motion prediction result is input into the residual error reconstruction module so as to further improve the prediction effect; the residual error reconstruction module reconstructs the difference between the sampling value and the network input sampling value to obtain a residual error reconstruction result, so that the reconstruction effect can be improved, and the training effect of the deep neural network is optimized; the residual error reconstruction network comprises a full connection layer, 5 convolution layers and an active layer, wherein the full connection layer is used for completing signal size transformation, and the convolution layers and the active layer are used for completing characteristic value extraction. In order to retain the information obtained in the reconstruction process, the input and output sizes of each convolution layer are consistent with those of the original signal;

(4) and finally, adding the output value of the residual error reconstruction module and the motion predicted value, and averaging the output value and the preliminary reconstruction result to obtain a final result.

2. The method according to claim 1, wherein the preliminary reconstruction module comprises a full-link layer, 5 convolutional layers and an active layer, and the specific structure parameters are as follows:

full connection layer: inputting: cr × 256, output: 256 of; cr is a configurable compression ratio parameter;

the convolutional layer 1: convolution kernel size: 1x1, step size: 1, filling: 0, input channel: 1, output channel: 128; activation function: a ReLU function;

and (3) convolutional layer 2: convolution kernel size: 1x1, step size: 1, filling: 0, input channel: 1, output channel: 128; activation function: a ReLU function;

and (3) convolutional layer: convolution kernel size: 3x3, step size: 1, filling: 1, input channel: 64, output channel: 32, a first step of removing the first layer; activation function: a ReLU function;

and (4) convolutional layer: convolution kernel size: 3x1, step size: 1, filling: 1, input channel: 32, output channel: 16; activation function: a ReLU function;

and (5) convolutional layer: convolution kernel size: 3x3, step size: 1, filling: 1, input channel: 16, output channel: 1.

3. the method of claim 1, wherein the parameters of the structure of the full-link layer in the structure of the motion prediction module are: inputting: 1024, outputting: 256.

4. the method of claim 1, wherein the residual reconstruction network comprises a full link layer, 5 convolutional layers and an active layer, and the specific structure parameters are as follows:

and (3) convolutional layer 2: convolution kernel size: 1x1, step size: 1, filling: 0, input channel: 128, output channel: 64; activation function: a ReLU function;

and (3) convolutional layer: convolution kernel size: 3x3, step size: 1, filling: 1, input channel: 64, output channel: 32, a first step of removing the first layer;

and (4) convolutional layer: convolution kernel size: 3x3, step size: 1, filling: 1, input channel: 32, output channel: 16;