CN109168002B

CN109168002B - Video signal measurement domain estimation method based on compressed sensing and convolutional neural network

Info

Publication number: CN109168002B
Application number: CN201810831091.1A
Authority: CN
Inventors: 郭洁; 吕军梅; 宋彬; 姚继鹏
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2020-06-12
Anticipated expiration: 2038-07-26
Also published as: CN109168002A

Abstract

The invention relates to a video signal measurement domain estimation method based on compressed sensing and a convolutional neural network, which comprises the following steps: dividing reference frame image data into a plurality of macro blocks; randomly selecting a macro block to be estimated, and selecting four macro blocks adjacent to the macro block to be estimated in a first direction and a second direction; calculating the pixel value and the real measured value of the macro block to be estimated according to the four adjacent macro blocks; constructing a convolutional neural network model, and calculating a predicted measurement value of a macro block to be estimated; and training the convolutional neural network model according to the real measured value and the predicted measured value, and obtaining the optimal parameter when the loss function of the output layer of the convolutional neural network model is lower than a preset threshold value. The invention overcomes the problem that the prior art can not consider both the calculation complexity and the model robustness, can quickly analyze the time correlation of the video frame, and can still realize real-time and accurate macro block analysis at any position under the conditions of very low measurement rate and arbitrarily changed motion vectors.

Description

Video signal measurement domain estimation method based on compressed sensing and convolutional neural network

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a video signal measurement domain estimation method based on compressed sensing and a convolutional neural network.

Background

With the rapid development of information technology, people have higher and higher requirements on multimedia information such as images and videos, and the pressure of signal acquisition equipment such as cameras and video cameras is also higher and higher. The conventional Nyquist sampling theorem states that the sampling frequency is required to be not less than twice the maximum frequency of the signal in order to recover the analog signal without distortion. The digital signal obtained by the sampling method has huge data quantity and information redundancy, and is not beneficial to storage and transmission. The sampling technology based on the Compressed Sensing (CS) theory reduces the requirement for the sampling frequency, can realize information acquisition with low power consumption and low complexity, and can realize more effective information acquisition, transmission, storage and processing by using the technology.

In order to process Video efficiently, a CS theory-based Video compression Sensing (CVS) codec system is proposed. It provides an efficient way of processing video information with low complexity. However, in the CVS system, after sampling once, the original pixel domain signal is converted to the measurement domain, and since we only obtain the measured value of the block, the time correlation between video frames is difficult to obtain accurately, which makes it impossible to directly obtain the motion information of the video sequence in the measurement domain, and the motion information between video sequences can ensure that the decoding end reduces the data amount required to be transmitted by the encoding end on the premise of recovering the original video signal well, thereby further improving the compression sampling efficiency.

On the other hand, there are many emerging tools in the process of processing visual information, voice information and natural language, and deep learning is widely used as a very effective tool. Among them, Convolutional Neural Network (CNN) is the most representative deep learning Network model, and is very suitable for processing image and video data. In 1959, Hubel & Wiesel discovered that cells of the visual cortex of animals were responsible for detecting optical signals. In the 90 s of the 20 th century, l.cun et al published papers that established the modern structure of CNN. CNN was once striking in ImageNet competition in 2012, and directly established its importance. Through continuous improvement of structure and algorithm, the strong feature extraction and fitting capability of the CNN makes the CNN widely used in various fields.

In recent years, many researchers have sought effective combinations of CS and CNN to address the challenges presented by CVS. In the "CS-CNN" method proposed by y.shen, t.han et al, the CS sampling and reconstruction process is implemented directly using the CNN structure, which brings an increase in computational complexity and ignores the temporal correlation between video frames. In addition, the article "Estimation of measurement for block-based compressed video sensing" of simulation in measurement domain proposes a CVS measurement domain Estimation model based on macro blocks, which has the disadvantages that the pseudo-inverse matrix of the measurement matrix is obtained by approximation through singular value decomposition, the model robustness is not high, and the Estimation error is very large when a Gaussian matrix is used as the measurement matrix.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a video signal measurement domain estimation method based on compressed sensing and a convolutional neural network. The technical problem to be solved by the invention is realized by the following technical scheme:

the embodiment of the invention provides a video signal measurement domain estimation method based on compressed sensing and a convolutional neural network, which comprises the following steps:

dividing a video sequence into a plurality of image groups, wherein each image group comprises at least two frames of image data, selecting a first frame of each image group as a reference frame, and dividing the reference frame image data into a plurality of macro blocks;

selecting a macro block to be estimated at any position in the reference frame image data, and selecting four adjacent macro blocks in a first direction and a second direction with the macro block to be estimated;

calculating the pixel value and the real measured value of the macro block to be estimated according to the four adjacent macro blocks;

constructing a convolutional neural network model, and calculating a predicted measurement value of a macro block to be estimated;

and training the convolutional neural network model according to the real measured value and the predicted measured value, and obtaining an optimal parameter when the loss function of the output layer of the convolutional neural network model is lower than a preset threshold value.

In an embodiment of the present invention, the calculating the pixel value of the macroblock to be estimated includes:

and carrying out modeling operation on the macro block to be estimated and the four adjacent macro blocks to obtain the pixel value of the macro block to be estimated.

In one embodiment of the present invention, the formula of the modeling operation is:

where x (B') represents the macroblock pixel value to be estimated, Σ represents the summation operation, Γ represents the macroblock pixel value to be estimated_iA positional relationship matrix, x (B), representing adjacent four macroblocks_i) Pixel values representing four adjacent macroblocks, i being 1,2,3, 4.

In an embodiment of the present invention, the calculating the true measurement value of the macroblock to be estimated includes:

calculating the measured values of four adjacent macro blocks according to a preset rule;

and calculating the real measured value of the macroblock to be estimated according to the measured values of the four adjacent macroblocks.

In an embodiment of the present invention, the preset rule is:

y(B_i)＝Φx(B_i)

wherein, y (B)_i) Denotes the measured value of the macroblock, x (B)_i) Denotes the pixel value of the macroblock, i denotes the index of the macroblock, and Φ denotes the measurement matrix.

In an embodiment of the present invention, the actual measurement values of the macroblock to be estimated are:

wherein, y_tureRepresenting the true measurement of the macroblock to be estimated, sigma a summation operation, lambda_iRepresenting a matrix of weighting coefficients determined by the motion vector and the measurement matrix, y (B)_i) Denotes the measured value, x (B), of four adjacent macroblocks_i) Pixel values representing four adjacent macroblocks, i ═ 1,2,3,4, and Φ represent the measurement matrix.

In one embodiment of the present invention, the convolutional neural network model includes: four convolutional neural networks and one perceptron layer.

In an embodiment of the present invention, the prediction measurement values of the macroblock to be estimated are:

wherein, y_predRepresents the prediction measure of the macroblock to be estimated, sigma represents the summation operation, w_iRepresenting the weight distribution of four adjacent macroblocks, f_iRepresenting a convolutional neural network, Φ Γ, corresponding to four adjacent macroblocks_iRepresenting the input of a convolutional neural network, y (B)_i) Denotes the measured value, x (B), of four adjacent macroblocks_i) Pixel values representing four adjacent macroblocks, i ═ 1,2,3,4, and Φ represent the measurement matrix.

In one embodiment of the present invention, the loss function is measurement domain dependent noise, and the expression is:

J＝||y_pred-y_true||₂/||y_true||₂

where J denotes the measurement domain correlated noise, y_trueRepresenting the true measurement value, y, of the macroblock to be estimated_predRepresenting the prediction measure of the macroblock to be estimated.

In one embodiment of the present invention, the preset threshold is 0.003.

Compared with the prior art, the invention has the beneficial effects that:

1. the method and the device have the advantages that the convolutional neural network is introduced into the video compression sensing system, the pseudo-inverse and the block weight of the measurement matrix are trained by using the convolutional neural network, the problem that the prior art cannot give consideration to both the calculation complexity and the model robustness is solved, the accuracy and the real-time performance of the method and the device are superior to those of the prior art, the accurate estimation of the macro block at any position on the measurement domain is realized, and the method and the device are used for quickly analyzing the time correlation of the video frame.

2. The method is based on supervised training of a labeled data set, and by means of the advantages of a convolutional neural network structure, the characteristics of locality and the like contained in the data are fully utilized through down sampling, and invariance of displacement and deformation to a certain degree is guaranteed, so that a model obtains strong generalization capability. The invention overcomes the defect that the multimedia information processing has certain difficulty under the condition of limited resources due to the requirement of the measurement rate in the prior art, and can still realize accurate and real-time macro block analysis at any position and effectively process image and video information in the environment of limited resources compared with the prior art even under the condition of very low measurement rate and arbitrarily changed motion vectors.

Drawings

FIG. 1 is a flow chart of a video signal measurement domain estimation method based on compressive sensing and convolutional neural networks according to an embodiment of the present invention;

FIG. 2 is an exploded view of the relative position of a macroblock to be estimated according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a convolutional neural network structure provided in an embodiment of the present invention;

FIG. 4 is an overall framework diagram of a convolutional neural network model provided by an embodiment of the present invention;

FIG. 5 is a graph comparing noise associated with simulation experiment content 1 according to an embodiment of the present invention;

fig. 6 is a comparison result of the noise correlation and the time complexity of the simulation experiment content 2 provided by the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1, fig. 1 is a flowchart of a video signal measurement domain estimation method based on compressive sensing and convolutional neural network according to an embodiment of the present invention.

The invention provides a video signal measurement domain estimation method based on compressed sensing and a convolutional neural network, which comprises the following steps:

the method comprises the steps of dividing a video sequence into a plurality of image groups, wherein each image group comprises at least two frames of image data, selecting a first frame of each image group as a reference frame, and dividing the reference frame image data into a plurality of macro blocks.

In order to collect enough samples, the first 15 frames of the video sequence are selected for analysis and modeling, and every two frames of image data are an image group.

Firstly, dividing reference frame image data into n two-dimensional macro blocks and carrying out two-dimensional to one-dimensional conversion to obtain a column vector x of each macro block_iWherein each macro block has the same size and is not overlapped with each other, i represents the index of the macro block, and n is an integer greater than 1. The embodiment of the invention divides each frame image into fixed macroblocks of size 8 x 8.

And selecting a macro block to be estimated at any position in the reference frame image data, and selecting four adjacent macro blocks in the first direction and the second direction with the macro block to be estimated.

The decomposition of the macroblock vector to be estimated is further described with reference to fig. 2:

in fig. 2, assuming that the motion vector MV is (m, n), the pixel value of the macroblock B' to be estimated can be decomposed into four adjacent macroblocks B in the first direction and the second direction_iI is a combination of 1,2,3,4 pixel values. Wherein, B₁' is macroblock B in first direction_iI-1, 2 pixel value combination, B₂' is macroblock B in first direction_iAnd i is a combination of pixel values of 3 and 4. B' is the macro block B in the second direction_iA pixel value combination of 1, 2.

And calculating the pixel value and the real measurement value of the macroblock to be estimated according to the four adjacent macroblocks.

The calculating the pixel value of the macroblock to be estimated comprises the following steps:

The formula of the modeling operation is as follows:

The calculating the true measurement value of the macro block to be estimated comprises the following steps:

and calculating the measured values of the four adjacent macro blocks according to a preset rule.

The preset rule is as follows:

y(B_i)＝Φx(B_i)

The specific embodiment of the invention respectively adopts a Gaussian matrix and a partial Hadamard matrix as measurement matrixes; the ratio of the number of rows to the number of columns of the measurement matrix represents the measurement rate MR. The MR range in the specific embodiment of the invention is 0.1-0.7, and the good generalization capability of the invention is verified.

And calculating the real measured value of the macro block to be estimated according to the measured values of the four adjacent macro blocks.

The actual measurement values of the macroblock to be estimated are:

And constructing a convolutional neural network model, and calculating a prediction measurement value of the macro block to be estimated.

The convolutional neural network model comprises four convolutional neural networks and a perceptron layer. Each convolutional neural network comprises 3 convolutional layers, each convolutional layer is composed of a plurality of two-dimensional feature mapping planes, and the activation function adopts a ReLu activation function y which is max (0, x); particularly, because the measurement domain vector dimension is relatively low, the convolutional neural network does not contain a pooling layer; the sensor layer adopts a full-connection network structure, and the activation function of the layer adopts an identity function y which is x.

The convolutional neural network in the convolutional neural network model is described below with reference to fig. 3.

The input size of the convolutional neural network in fig. 3 is 64 × 64, each layer in the middle is composed of a plurality of feature maps, each feature map is composed of a plurality of neural units, and all the neural units of the same feature map share one convolution kernel. The size of the first layer convolution kernel is set to 1 x 1, the first layer step size is set to (1, 3), the size of the second layer convolution kernel is 1 x 1, the second layer step size is set to (1, 3), the number of feature maps of the first layer is set to 1, and the number of feature maps of the second layer is set to 1. The image is translated in a two-dimensional plane through a convolution kernel, and each element of the convolution kernel is multiplied by the corresponding position of the convolved image and then summed. And obtaining the characteristic output of the next layer through the continuous movement of the convolution kernel. By convolution operation, the original signal characteristic can be enhanced and the noise can be reduced.

The overall framework of the convolutional neural network model is described below with reference to fig. 4.

In FIG. 4, phi Γ_iInput to a corresponding convolutional neural network f_iIn the method, corresponding output is obtained through the transformation and operation of the convolutional neural network and is marked as f_i(ΦΓ_i) And i is 1,2,3, 4. Measuring values y of four adjacent macroblocks of a macroblock to be estimated in a first direction and a second direction_iOutput f of the corresponding convolutional neural network_i(ΦΓ_i) I 1,2,3,4, by the perceptron layer weight parameter w_iWeighted linear summation to obtain output y of convolution neural network model_predNamely:

The convolution neural network realizes the input phi gamma_iConversion into a weighting coefficient matrix Λ_iThis has the same transforming effect as the pseudo-inverse of the measurement matrix phi.

Training a convolutional neural network model according to the real measurement value and the prediction measurement value, wherein a mini-batch stochastic gradient is adopted in the specific embodiment of the inventionA descending optimization method, a gradual learning rate adjustment mode and a loss function defined as the measurement domain correlated noise J | | y_pred-y_true||₂/||y_true||₂。

Let X represent an input, X comprising: measuring matrix phi and spatial position relation matrix gamma of four adjacent macro blocks_iAnd the measured values y (B) of the adjacent four macroblocks_i) I is 1,2,3, 4; y represents a label, y ═ y_ture。

The training of the convolutional neural network model according to the real measurement value and the prediction measurement value comprises the following steps:

(1) a forward propagation phase; inputting a sample set X of a batch into a convolutional neural network model to calculate corresponding output, and at this stage, carrying out step-by-step transformation on information from an input layer of the convolutional neural network model, extracting and combining sufficiently complex nonlinear features by using the characteristics of weight sharing and local connection of convolutional layers, and transmitting the extracted and combined nonlinear features to an output layer of the convolutional neural network model.

(2) A backward propagation phase; calculating actual output y of convolutional neural network model_predAnd a sample label y_trueAnd the error between the two is transmitted back and forth step by utilizing an error back propagation algorithm to adjust the weight of the convolutional neural network model, so that the related noise of the measurement domain is continuously reduced.

(3) Repeating the operations of (1) and (2), continuously inputting the next batch into the convolutional neural network model for training until the loss function of the output layer of the convolutional neural network model is lower than 0.003, and training to obtain the optimal parameter w₁、w₂、w₃、w₄And a convolutional neural network f₁、f₂、f₃、f₄Weight and bias in (1).

Based on the trained model, an average relative error of 0.0038 was measured given MR of 0.3 and MV of (3, 1). The mean relative error measured floated between 0.0039 and 0.0068 as the MR and MV changed. The convolutional neural network model has stronger modeling capability than that of the traditional method.

The effects of the present invention can be further explained by the following simulation experiments.

1. Simulation experiment conditions are as follows:

the experimental simulation environment of the invention is as follows:

operating the system: ubuntu 14.04, python2.7

An experiment platform: tensorflow-1.4

A processor: intel Core i7-7700k CPU @4.2GHZ 8

A display card: NVIDIA 1080Ti GPU

Memory: 15.6GB

The data source used in the simulation experiment of the invention is a representative video sequence Foreman CIF sequence. The first 15 frames of the Foreman CIF sequence are compressed and sampled, the total sample number is 11424, and a training set and a test set are randomly divided into 9632 and 1792; the motion vector is set to MV (3,1), 3 for horizontal motion and 1 for vertical motion; the measurement rate MR was 0.3.

2. Simulation experiment contents:

the simulation experiment of the invention is divided into two simulation experiments.

Simulation experiment I: in the same experiment simulation environment, in fig. 5, the gaussian measurement matrix is used for both the reference method and the method provided by the invention, and the correlated noise of the method provided by the invention is almost one sixtieth of the correlated noise of the reference method. In particular, the logarithmic correlation error of the proposed method using the gaussian measurement matrix is reduced by a factor of 15 in order to see a clear contrast on one graph.

In the same experimental simulation environment, in fig. 5, the partial hadamard matrix measurement matrix is used for both the reference method and the proposed method, and the correlation noise of the reference method is 2-3 times that of the proposed method (in order to see clear comparison on a graph, the logarithmic correlation error of the two methods is reduced by 15 times at the same time).

And (2) simulation experiment II: macroblock accurate estimation experiment under different motion vector and measuring rate condition

Different MVs and MRs were set for correlation analysis experiments using gaussian measurement matrix, batch 300. In fig. 6, the correlation noise (denoted by CN) and the time complexity (in seconds) of the measurement domain are used as the measurement criteria, and when the MV and MR change, the proposed method is always greatly superior to the reference method.

3. Simulation experiment result analysis:

as can be seen from fig. 5, when the measurement matrix is a gaussian matrix and a partial hadamard matrix, respectively, given the measurement rate and the motion vector, the logarithmic error of the correlated noise in the measurement domain of the method proposed by the present invention has a great advantage compared with the existing approximate model. It can be seen from fig. 6 that, when the motion vector is (2,2) and (-2, -2) respectively as the measurement rate is changed in the range of 0.1-0.7, the method proposed by the present invention is far superior to the existing approximate model in terms of the related noise error index and the time complexity. In summary, the invention uses the convolutional neural network to train the pseudo-inverse of the measurement matrix and the weights of the blocks, so that the estimation model of any macroblock in the video frame can be obviously improved in terms of precision and robustness, even by more than two orders of magnitude. Meanwhile, the model provided by the invention is near real-time, the processing time is almost one eighth of that of the existing approximate model, and the image and video information can be processed in the environment with limited resources.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A video signal measurement domain estimation method based on compressed sensing and a convolutional neural network is characterized by comprising the following steps:

calculating the pixel value and the real measured value of the macro block to be estimated according to the four adjacent macro blocks; the calculating the pixel value of the macroblock to be estimated comprises the following steps: modeling operation is carried out on the macro block to be estimated and the four adjacent macro blocks to obtain a pixel value of the macro block to be estimated; the calculating the true measurement value of the macro block to be estimated comprises the following steps: calculating the measured values of four adjacent macro blocks according to a preset rule; calculating the real measured value of the macro block to be estimated according to the measured values of the four adjacent macro blocks;

training the convolutional neural network model according to the real measured value and the predicted measured value, and obtaining an optimal parameter when a loss function of an output layer of the convolutional neural network model is lower than a preset threshold value;

the formula of the modeling operation is as follows:

where x (B') represents the macroblock pixel value to be estimated, Σ represents the summation operation, Γ_iA positional relationship matrix, x (B), representing adjacent four macroblocks_i) Pixel values representing four adjacent macroblocks, i ═ 1,2,3, 4;

the preset rule is as follows:

y(B_i)＝Φx(B_i)

wherein, y (B)_i) Denotes the measured value of the macroblock, x (B)_i) Pixel values representing a macroblock, i represents an index of the macroblock, and Φ represents a measurement matrix;

the prediction measurement value of the macroblock to be estimated is as follows:

wherein, y_predRepresenting predicted measurements of a macroblock to be estimated, sigmaDenotes a summation operation, w_iRepresenting the weight distribution of four adjacent macroblocks, f_iRepresenting a convolutional neural network, Φ Γ, corresponding to four adjacent macroblocks_iRepresenting the input of a convolutional neural network, y (B)_i) Denotes the measured value, x (B), of four adjacent macroblocks_i) Pixel values representing four adjacent macroblocks, i is 1,2,3,4, and Φ represents a measurement matrix;

the actual measurement values of the macroblock to be estimated are:

wherein, y_tureRepresenting true measurement values of the macroblock to be estimated, sigma representing a summing operation, Λ_iRepresenting a matrix of weighting coefficients determined by the motion vector and the measurement matrix, y (B)_i) Denotes the measured value, x (B), of four adjacent macroblocks_i) Pixel values representing four adjacent macroblocks, i is 1,2,3,4, and Φ represents a measurement matrix;

the loss function is measurement domain correlated noise, and the expression is as follows:

J＝||y_pred-y_true||₂/||y_true||₂

2. The method of claim 1, wherein the convolutional neural network model comprises: four convolutional neural networks and one perceptron layer.

3. The method of claim 1, wherein the preset threshold is 0.003.