CN108923984B

CN108923984B - Space-time video compressed sensing method based on convolutional network

Info

Publication number: CN108923984B
Application number: CN201810777563.XA
Authority: CN
Inventors: 谢雪梅; 刘婉; 赵至夫; 汪芳羽; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2021-01-12
Anticipated expiration: 2038-07-16
Also published as: CN108923984A

Abstract

The invention discloses a convolutional network-based spatio-temporal video compression sensing method, which mainly solves the problems of poor video compression spatio-temporal balance and poor reconstructed video real-time performance in the prior art. The scheme is as follows: preparing a training data set; designing a network structure of a space-time video compression sensing method; compiling training and testing files according to the designed network structure; training a network of a space-time video compression sensing method; and testing the network of the space-time video compression sensing method. The network of the space-time video compression sensing method adopts an observation technology of space-time simultaneous compression and a reconstruction technology of enhancing the space-time correlation by using a time-space block, not only can realize real-time video reconstruction, but also has stronger space-time balance of the reconstructed result, has high reconstruction quality and stability, and can be used for video compression transmission and subsequent video reconstruction.

Description

Space-time video compressed sensing method based on convolutional network

Technical Field

The invention belongs to the technical field of video processing, mainly relates to video compressed sensing, and particularly relates to a space-time video compressed sensing method based on a convolutional network, which can be used for realizing real-time high-quality video compressed sensing reconstruction.

Background

Compressed Sensing (CS) is a signal compressive sampling theory that samples a signal below the nyquist sampling rate and recovers the original signal by a reconstruction algorithm. The theory has been successfully applied to various signal processing fields, such as medical imaging, radar imaging, and the like. After the advent and popularity of hardware systems such as single pixel cameras, compressed sensing has been applied to still image compression and has shown excellent potential. Today compressed sensing is no longer limited to still images but is generalized to video. Compared with a static image, the video compression process needs to consider the correlation in the image time dimension, so that the video processing is more complicated by applying a compressed sensing (VCS) theory.

The method for processing the video by adopting the compressed sensing theory is approximately space video compressed sensing and time video compressed sensing, and the observation processes of the space video compressed sensing and the time video compressed sensing are respectively realized by a Space Multiplexing Camera (SMC) and a Time Multiplexing Camera (TMC). For spatial VCS, video is observed and reconstructed frame by frame for input video, while temporal VCS is observed and reconstructed uniformly for multiple consecutive frames of input video. In recent years, many methods have achieved video compression perception and achieve better video reconstruction. However, the iterative optimization calculation method has high time complexity, and cannot reconstruct a video in real time, or has poor real-time performance, so that actual requirements cannot be met.

With the development of deep learning technology, Deep Neural Networks (DNNs) are widely used in the fields of image and video processing, such as CS, super resolution, rain removal, noise removal, and restoration, and have achieved significant achievements. As the DNN has the characteristics of offline training and online testing, once the network training is completed, the testing process can be completed only by carrying out forward propagation, so that the reconstruction time is greatly shortened compared with the traditional method.

In the existing DNN-based VCS method, some full-connection networks are adopted for time compression sensing reconstruction, and the time complexity is high due to overlarge parameter quantity, so that real-time reconstruction cannot be realized; some methods adopt a convolutional neural network to perform spatial compression sensing reconstruction, and enhance the video frame-to-frame relation on the basis of the initial recovery of the video to obtain a better video reconstruction effect. Since these methods all compress, also called observations, in only a single dimension in space (or time), the resolution of the obtained observations in the compressed dimension is low. When the video reconstruction is carried out, the correlation between the pixel points of the reconstruction result on the compression dimension is insufficient, and the information of the compressed dimension is difficult to recover, so that the result of the whole video reconstruction is reduced.

Disclosure of Invention

The invention aims to provide a convolution network-based spatio-temporal video compressed sensing method with balanced spatio-temporal resolution, better reconstruction performance and stable reconstructed video result aiming at the defects of the prior art.

The invention relates to a space-time video compressed sensing method based on a convolutional network, which is characterized by comprising the following steps of:

1) preparing a training data set: downloading videos with different required resolutions, preprocessing all the downloaded videos, sequentially converting the downloaded videos into gray video frames, cutting the video frames into small blocks with a certain size according to spatial positions, storing the small blocks of each spatial position into pictures, storing the pictures into different subfolders, and naming the pictures according to the sequence of video time frames, namely 1.jpg, 2.jpg, and so on; all the subfolders finally form a whole folder, and the whole folder is used as a training data set; the test data set is any randomly selected video, is converted into a gray level video and is stored in a folder;

2) designing a network structure of a space-time video compression sensing method: the network structure comprises an observation part and a reconstruction part, wherein the observation part inputs an input video block into a three-dimensional convolution layer to be used as observation output, and the reconstruction part connects the observation output to a three-dimensional deconvolution layer, a plurality of 'space-time blocks', a BN layer and a three-dimensional convolution layer in sequence to obtain a reconstruction video block; each 'time-space block' is formed by connecting four three-dimensional convolution layers in series in sequence, adding a BN layer in front of each three-dimensional convolution layer, and then connecting the output end of the first three-dimensional convolution layer with the output end of the fourth three-dimensional convolution layer through residual errors;

3) compiling training and testing files according to the designed network structure:

3a) establishing project folders, and respectively establishing program files such as training, testing, network structures, network settings, functions and the like in the folders;

3b) setting and writing related file contents: setting reasonable network hyper-parameters in a network setting file, and writing functions required in training and testing codes in a function file; compiling a network structure in a network structure file according to a space-time video compression sensing method;

3c) defining the training process of the network in a training file: sequentially taking a plurality of video frames from a subfolder in a training data set to form a video block as a network input video block of a space-time video compression sensing method, calculating mean square error reconstruction loss by a reconstructed video block output by a network of the space-time video compression sensing method and the input video block, and performing back propagation on the reconstruction loss to update network parameters;

3d) defining the testing process of the network in the test file: converting the gray level video in the test data set into video frames, dividing the gray level frames into a plurality of continuous frames according to a time sequence, inputting each continuous frame as an input video block to a network of a space-time video compression sensing method to obtain a corresponding reconstructed video block, wherein the reconstructed video block comprises the same number of frames as the input video block, and arranging the reconstructed video blocks according to the time sequence to obtain a reconstructed video which is a test result of the network of the space-time video compression sensing method;

4) training a network of a space-time video compression sensing method: loading a network structure and related parameters through a network structure file, initializing all the parameters, loading hyper-parameters through a network setting file, adopting a random gradient descent algorithm, performing multiple times of training and updating parameters on a network of a space-time video compression sensing method by using a training data set according to a training process defined by a training file, and obtaining a final parameter model after training is finished;

5) the network for testing the space-time video compressed sensing method comprises the following steps: and loading a network structure through a network structure file, loading a final parameter model as a network structure parameter, defining a test process according to the test file, and testing the network of the space-time video compressed sensing method by using the test data set to obtain a real-time high-quality high-space-time balance reconstructed test video result of the space-time video compressed sensing.

The observation part of the network of the space-time video compressed sensing method designed by the invention adopts the three-dimensional convolution layer, and the space-time resolution balanced observation result can be obtained. The reconstruction part uses a three-dimensional deconvolution layer and a space-time block, can reduce the number of network parameters, remove the block effect and simultaneously improve the balance on the space-time dimension of the reconstructed video, and finally realizes the video reconstruction with real-time, high quality and high space-time balance.

Compared with the prior art, the invention has the following advantages:

1. the observation process enhances the equality of space-time compression: in compressed sensing, because video observation and still image observation have a large difference, the video observation needs to be processed by considering not only the correlation between pixel points in frames in a video spatial dimension, but also the correlation between frames in a video temporal dimension. The invention makes up the defects that the prior method only carries out compression observation on the time dimension or the space dimension, so that the observed space-time balance is poorer, and the difference of reconstruction effects on the time dimension and the space dimension is larger, thereby leading to poorer quality of reconstructed video.

2. The video reconstruction can be carried out in real time with high quality: the quality of reconstructed video is poor because the space-time equilibrium is neglected in the observation process by the existing video observation method. In addition, the traditional VCS method adopts an iterative optimization calculation method, which makes the calculation complexity very high and cannot realize video reconstruction. The existing video compression sensing method based on the neural network has large network parameter quantity, so that the video reconstruction time is long, and real-time reconstruction cannot be realized. The video compression sensing method based on the full convolution neural network has the advantages that the used parameter quantity is small, the forward propagation process time is extremely short, the real-time reconstruction can be realized, the time-space block in the reconstruction part in the network of the space-time video compression sensing method can enhance the space-time correlation, the video reconstruction effect can be greatly improved by combining with the obtained observation with strong space-time balance, and the high-quality reconstructed video can be obtained.

3. The method has high stability of the reconstructed video, can ensure that each frame of the reconstructed video has similar and better reconstruction quality, does not have the problem of inappropriate visual perception caused by the nonuniform reconstruction result among the reconstructed video quality frames, can more scientifically compress and transmit the video file, and ensures the integrity of the video in the compression and reconstruction processes.

Drawings

FIG. 1 is a flow chart of a spatio-temporal video compressed sensing method based on a convolutional network according to the present invention;

FIG. 2 is a network structure diagram of the spatiotemporal video compressed sensing method of the present invention;

FIG. 3 is a schematic diagram of a "time-space block" unit in the reconstruction portion of the present invention;

FIG. 4 shows the result of reconstructing video frames when testing the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

Example 1

Today compressed sensing is no longer limited to still images but is generalized to video. Compared with a static image, the video compression process needs to consider the correlation in the image time dimension, so that the video processing is more complicated by applying a compressed sensing (VCS) theory. Video compression sensing performs compression sampling on a video, reduces storage space and greatly increases transmission speed, a reconstructed video obtained after reconstruction of transmitted data can perform more complex tasks such as target detection and tracking, and the conventional methods only perform compression on a single spatial (or temporal) dimension, which is also called observation, so that the resolution of the obtained observation on the compressed dimension is very low. In the video reconstruction, the correlation between the pixel points of the reconstruction result on the compression dimension is insufficient, the information of the compressed dimension is difficult to recover, the result of the whole video reconstruction is reduced, and the quality of the reconstructed video is unstable, aiming at the phenomenon, the invention discusses a better method and provides a space-time video compression sensing method based on a convolution network, and the method is shown in figure 1 and comprises the following steps:

1) preparing a training data set: downloading videos with different required resolutions, in order to simplify the training process, preprocessing all the downloaded videos, sequentially converting the downloaded videos into gray video frames, cutting the videos into small blocks with a certain size according to spatial positions, storing the small blocks at each spatial position into pictures, storing the pictures into different subfolders, and naming the pictures according to the sequence of video time frames, namely 1.jpg, 2.jpg, and so on, wherein if n.jpg, the picture comes from the nth frame of the video; all the subfolders finally form a whole folder, and the whole folder is used as a training data set; the test data set is any randomly selected video, and also needs to be converted into a gray video frame to be stored in a folder.

2) Designing a network structure of a space-time video compression sensing method: the network structure of the space-time video compressed sensing method comprises an observation part and a reconstruction part, wherein the observation part inputs an input video block into a three-dimensional convolution layer to be used as observation output, the three-dimensional convolution layer respectively carries out convolution operation in the space dimension and the time dimension, observation information can be reasonably distributed to the space dimension and the time dimension by respectively adjusting the compression ratio of convolution kernels of the three-dimensional convolution layer in the space dimension and the time dimension to obtain observation with balanced space-time resolution, the existing method usually adopts the two-dimensional convolution layer to carry out same space-dimension observation on each time frame, so that the correlation of video time dimension can not be extracted, and the effect of a reconstructed video on the time dimension is poor; the reconstruction part connects the observation output to a three-dimensional deconvolution layer, a plurality of 'space-time blocks', a BN layer and a three-dimensional convolution layer in sequence to obtain a reconstructed video block; each 'space-time block' is formed by sequentially connecting four three-dimensional convolution layers in series, a BN layer is added in front of each three-dimensional convolution layer of the 'space-time block', and then the output end of the first three-dimensional convolution layer is connected with the output end of the fourth three-dimensional convolution layer in a residual error mode; the three-dimensional deconvolution layer corresponds to the three-dimensional deconvolution layer used in observation, and can perform dimension increasing on observation information to obtain an initial recovery result, wherein the purpose of dimension increasing is to map a solution space of the observation information to a solution space of a video, so that a subsequent space-time block is convenient to add more detailed information to the initial recovery result. The space-time block adopts residual connection to prevent network gradient diffusion, deepens the network, adds detail information to the video, enhances the space-time correlation of an initial recovery result, and finally uses a three-dimensional convolution layer to play an integration function after ensuring the recovery effect of the reconstructed video to further enhance the inter-frame relation.

3b) setting and writing related file contents: and reasonable network hyper-parameters including batch size, epoch and the like are set in the network setting file. Writing the required functions in the training and testing codes in the function file; and compiling the network structure in a network structure file according to a space-time video compression sensing method.

3c) Defining the training process of the network in a training file: t video frames are sequentially taken from subfolders in a training data set to form a video block as a network input video block of the space-time video compression sensing method, the value of t is positively correlated with the compression ratio of an observation part in a network of the space-time video compression sensing method in the time dimension, the adjustment is usually carried out according to the space compression ratio in the observation part, and the time dimension compression ratio and the space dimension compression ratio are in a relative balance proportion. The input video block is transmitted forward through a network to obtain a reconstructed video block output by the network of the space-time video compressed sensing method, the reconstructed video block and the input video block calculate the mean square error reconstruction loss, the mean square error reconstruction loss calculates the square sum of the distance between each pixel point of the input video block and each pixel point of the reconstructed video block, the reconstruction loss is transmitted reversely, a gradient descent algorithm is adopted to update network parameters, one training is completed until the network parameters are updated, when a training data set reaches the upper limit of the use times, the training is finished, otherwise, the training is repeated.

3d) Defining the testing process of the network in the test file: in order to realize full-image reconstruction, any test video is converted into a gray frame without cutting during testing, all gray frames in the test data set are divided into a plurality of continuous frames, each t frame from the first frame of the video is a video block, the number of the frames is not enough, each continuous t frame is used as an input video block, the input video block is input into a network of a space-time video compression sensing method to obtain a corresponding reconstructed video block, the reconstructed video block comprises the same number of frames as the input video block, the first frames of the last reconstructed video block which are not zero-filled are used as reconstruction results of the video reconstruction block, and the reconstructed video blocks are arranged according to a time sequence to obtain reconstructed videos which are the test results of the network of the space-time video compression sensing method.

4) Training a network of a space-time video compression sensing method: loading a network structure and related parameters through a network structure file, initializing all parameters, loading hyper-parameters through a network setting file, adopting a random gradient descent algorithm, performing multiple times of training and updating parameters on a network of a space-time video compression sensing method by using the training data set according to a training process defined by a training file, and obtaining a final parameter model after training.

The existing video observation method only performs compression sampling in a time dimension or a space dimension in the observation process, neglects space-time balance and ensures that the reconstructed video quality is poor. In addition, the conventional VCS method adopts a conventional iterative optimization calculation method, which causes high time complexity and fails to realize real-time video reconstruction. The existing video compression sensing method based on the neural network has large network parameter quantity, so that the video reconstruction time is long, and the real-time reconstruction can not be realized. The spatio-temporal video compression sensing method based on the convolutional network has the advantages of small used parameters, short forward propagation process time and capability of realizing real-time video reconstruction. Meanwhile, the time-space block in the reconstruction part in the network of the space-time video compression sensing method can enhance the space-time correlation of the recovery video information when reconstructing the video, and the combination of the time-space block and the observation of strong space-time balance can greatly improve the video reconstruction effect, thereby obtaining the high-quality and stable reconstructed video.

Example 2

The convolutional network-based spatio-temporal video compressed sensing method is implemented as in embodiment 1, and the network structure for designing the spatio-temporal video compressed sensing method in step 2) is shown in fig. 2, and comprises the following steps:

2a) three-dimensional convolutional layer setting of an observation part of a convolutional network-based space-time video compressed sensing method: setting the size of a convolution kernel of the three-dimensional convolution layer to be T multiplied by 3, wherein T multiplied by 16 is the size of the convolution kernel in a time dimension, 3 multiplied by 3 is the size in a space dimension, and setting non-zero padding in the convolution process and the step length to be 3; the size of an input video block is T multiplied by H multiplied by W, T is the number of frames contained in the input video block, H multiplied by W is the space dimension of each frame, and H and W are multiples of 3; when the number of convolution kernels is 1, the spatial compression ratio is 9, the temporal compression ratio is T, and the observation rate is

Setting the number of convolution kernels to N, then the observation rate is determined

Observation rate when N is 9 and T is 16

Therefore, the observation mode can obtain the pressure in both time and space dimensionsThe observation of the reduced space-time resolution equalization helps the reconstruction part to better recover the information of the time and space dimensions.

2b) The reconstruction section first connects the observation output to a three-dimensional deconvolution layer setting of a three-dimensional deconvolution layer reconstruction section: the convolution kernel of the three-dimensional deconvolution layer is set according to the symmetry of convolution and deconvolution, when the size of the convolution kernel of the deconvolution process is the same as that of the convolution kernel of the corresponding convolution process, the output of deconvolution is the same as the input size of convolution, so that the setting of the convolution kernel of the three-dimensional deconvolution layer is the same as that of the convolution kernel of the three-dimensional convolution layer in the observation process, namely, the size is T multiplied by 3, the number is N, zero padding is not carried out in the convolution process, and the step length is 3.

2c) The three-dimensional deconvolution layer is connected to the design of a plurality of space-time blocks of a space-time block reconstruction part: each 'space-time block' is formed by sequentially connecting four three-dimensional convolution layers in series, a BN layer is added in front of each three-dimensional convolution layer of the 'space-time block', the input end of the second three-dimensional convolution layer is connected with the output end of the fourth three-dimensional convolution layer in a residual error mode, and the second, third and fourth three-dimensional convolution layers form a residual error block through residual error connection; each "space-time block" is serially connected by a three-dimensional convolutional layer and a residual block.

2d) Three-dimensional convolutional layer setup of the reconstructed part: the convolution kernel size of the final three-dimensional convolution layer of the reconstruction part is 16 multiplied by 1, the number is 16, the step length is 1, and zero padding is not carried out in the convolution process. And further integrating and enhancing the interframe information by the non-zero-filling three-dimensional convolution layer in the convolution process to obtain a final reconstructed video frame.

The network structure of the designed time-space video compressed sensing method adopts the three-dimensional convolution layer at the observation part, and respectively carries out compressed sampling of time dimension and space dimension on the input video block, compared with the existing method which only carries out sampling on single dimension of the time dimension and the space dimension, the sampling of time-space equilibrium can be obtained, compared with the existing method which adopts a Gaussian matrix as an observation matrix and adopts a convolution learning mode, more reasonable compressed sampling can be carried out; the reconstruction part of the network of the space-time video compressed sensing method is composed of a three-dimensional deconvolution layer, a plurality of 'space-time blocks', a BN layer and a three-dimensional convolution layer, wherein the three-dimensional deconvolution layer and the three-dimensional convolution layer are symmetrically operated and are used for performing dimension increasing processing on an observation result and expanding the observation result to the dimension same as that of an input video block, so that further recovery of a subsequent network is facilitated, residual connection is adopted in the 'space-time blocks', more detailed information can be added to a reconstructed video, the residual connection learns the difference between the input and the output of the residual block, so that the residual connection focuses more on the detailed information in the video, and the three-dimensional deconvolution layer plays a role in network integration, so that the final reconstructed video block of the network and the input video block have the same size.

Example 3

The convolutional network-based spatio-temporal video compressed sensing method is implemented in the same way as 1-2, and each "spatio-temporal block" in the step 2c) is serially connected by a three-dimensional convolutional layer and a residual block, as shown in fig. 3, the method comprises the following steps:

2c1) three-dimensional convolutional layer setup within each "spatio-temporal block": the convolution kernel size of the three-dimensional convolution layer is 16 multiplied by 1, the number is 16, zero padding is not set in the convolution process, and the step length is 1; because the size of the convolution kernel in the spatial dimension is 1 × 1, the inter-frame information of each spatial position can be integrated, and the capability of enhancing the inter-frame relationship is provided.

2c2) The residual block within each "spatio-temporal block" is set: the residual block comprises three-dimensional convolution layers, the sizes of convolution kernels of the three-dimensional convolution layers are respectively 16 multiplied by 3, 64 multiplied by 1 and 32 multiplied by 3, the numbers are respectively 64, 32 and 16, the convolution processes are not zero-filled, and the step length is 1; because the convolution kernel with the space dimension of 3 multiplied by 3 can fuse the space-time information and the convolution kernel with the space dimension of 1 multiplied by 1 can integrate the interframe information, the arrangement can enhance the space-time information, the input end of the residual block is connected to the output end of the residual block for summation calculation, and the Tanh activation layer is added after the summation calculation.

2c3) Adding a BN layer before each three-dimensional convolution layer of 2c1) and 2c2) to accelerate convergence speed, and then adding a PReLU to enhance the nonlinear capability of the network;

2c4) multiple "space-time blocks" may be cascaded, that is, the output of the previous "space-time block" is used as the input of the next "space-time block" to expand the network capacity, and the structure of each block is the same, as described in 2c1) -2c3) to complete the building of one "space-time block". The present invention may mean that one "space-time block" or "a plurality of" space-time blocks "are adopted according to the actual situation, and the number of the" space-time blocks "and the time of reconstructing the video are in a trade-off relation, in other words, one" space-time block "may be adopted if the requirement on real-time performance is high.

The invention designs a time-space block of a network reconstruction part of a time-space video compression sensing method, which is serially connected by a three-dimensional convolutional layer and a residual block, wherein the main function part is from the residual block, the residual block is serially connected by three-dimensional convolutional layers, and the input of a first three-dimensional convolutional layer is connected to the output of a third three-dimensional convolutional layer, namely, the input value of the first three-dimensional convolutional layer and the output value of the third three-dimensional convolutional layer are added together to be used as the input of the next layer of the network, so that the residual block only learns the difference value between the input of the residual block and the output of the residual block, and the small difference between the input and the output of the residual block is amplified for processing, so that the network can learn to add detail information to the video when the network reconstructs the video.

Example 4

The method for space-time video compression sensing based on the convolutional network is implemented as 1-3, and reasonable network hyper-parameters are set in the network setting file in the step 3c), and the method comprises the following steps:

the number of input video blocks of the network for training the input space-time video compressed sensing method every time is set to be 20-40, the number of the input video blocks of the network can be properly adjusted according to the depth of the network and the size of videos in a training data set, the number of times epoch of using the training data set in the whole training process is 3-7, and the epoch can be properly adjusted according to the number of the videos in the training data set.

Example 5

The convolutional network-based spatio-temporal video compressed sensing method is implemented in the same way as 1-4, and the training process of defining the network in the training file in step 3c) is described in fig. 1, and comprises the following steps:

3c1) inputting a training data set into a video block for processing: starting from a first subfolder, forming an input video block by 16 pictures from small to large according to the number in the picture name, taking the interval of the input video block as one picture every time, wherein 1.jpg-16.jpg is the first input video block, 2.jpg-17.jpg is the second input video block, and so on; the network of the spatio-temporal video compressed sensing method inputs batchsize continuous video blocks each time, namely, the 1 st-blocksize input video block as a first input, the 2 nd-blocksize +1 st input video block as a second input, and so on, if the size of each picture is H × W, each input includes Q ═ blocksize × T × W × H pixel values. For a large training data set, the number of batchsizes can be increased appropriately, so that the data of each training is increased, and the training time is shortened.

3c2) The network one-time training process of the space-time video compressed sensing method comprises the following steps: each pixel value X in a blocksize continuous video block of a network of input space-time video compressed sensing method at a time⁰First pass through

Normalizing the pixel values to [ -1,1 [)]Then inputting X into defined network to obtain corresponding reconstruction result X ', calculating the mean square error between reconstruction result X' and input X as reconstruction error

And (5) performing back propagation on the reconstruction error, updating network parameters and finishing a training process of the network.

3c3) And 3c1) -3c2) are repeated, the training process is repeatedly executed until the data in the current subfolder is used up, the training is started by using the data in the next subfolder, the network model is saved 500 times per iteration, the network model comprises a network structure and parameters, until the data in all the subfolders are used up, and the training of a training data set is completed.

3c4) And judging whether the training of the epoch training data set is finished, if so, finishing the training, and otherwise, repeating 3c1) -3c 4).

The convolutional network-based spatio-temporal video compression sensing method adopts a neural network training mode, does not use a traditional iterative optimization algorithm, and iteratively optimizes each video for multiple times by reconstructing the video through the traditional iterative optimization algorithm, so that the reconstruction time is long, and the real-time reconstruction cannot be realized.

Example 6

The method for space-time video compression sensing based on the convolutional network is implemented in the same way as 1-5, and the test code is written into the test file according to the designed network structure in the step 3d), referring to fig. 1, the method comprises the following steps:

3d1) inputting a test data set into a video block for processing: the test data set is randomly selected random video, converted into grayscale video frames without cutting and stored in a folder, and for one video frame containing P video frames, each video frame has a size of H₀×W₀Let every T frames constitute an input video block, such as: the 1-T gray frames are used as the first input video block, and 2T-3T are used as the second input video block, so that the number of the complete input video frames can be calculated as

The remaining number of video frames P-nxt.

3d2) For the first nxt frame: sequentially reading video frames when H₀Or W₀When the space size of the convolution kernel is not multiple of 3, zero padding is needed to be carried out at the position of the last row or column, so that H and W in the new space size are both multiple of 3; when the frame number of the accumulated video frame reaches T, an input video block with the size of T multiplied by H multiplied by W is obtained, and a network of the input space-time video compression sensing method obtains a corresponding reconstructed video block.

3d3) For the last p frame: zero-filling operation is carried out in time and space, a video block of T multiplied by H multiplied by W is spliced out, and then space and time are inputObtaining a reconstructed video block in a network of a video compression sensing method, and only setting the size of non-zero padding of a previous p frame without zero padding as H₀×W₀As a result of reconstructing the video block.

3d4) Arranging the reconstructed video blocks according to a time sequence to obtain a reconstructed video which is a test result of a network of a space-time video compression sensing method;

3d5) processing each test video in turn according to 3d1) -3d 4).

The method for space-time video compression sensing based on the convolutional network does not need to cut input gray level video when the network is tested like training the network, because the network of the method for space-time video compression sensing is a full convolutional network, once the training of the full convolutional network is completed, the video with any space size can be processed, the application range of the network is expanded, and the gray level video with any size can be compressed and reconstructed.

A more detailed example is given below to further illustrate the invention

Example 7

The method for space-time video compressed sensing based on the convolutional network is implemented in the same way as 1-6 and comprises the following steps:

step 1, preparing a training data set.

1a) Downloading and downloading videos with different required resolutions, putting the videos into a 'video' folder, establishing an empty folder named as 'frame' for storing gray video frames, and establishing an empty folder named as 'patch' for storing the cut gray video frames.

1b) Cuda acceleration is deployed on the computer, Python is installed, and third party libraries TensorFlow and cv2 are installed under Python.

1c) The video is changed into grayscale video frames by writing codes with the cv2 library:

1c1) and entering a video folder, accessing a certain video in the video set, and acquiring a video name.

1c2) A video frame folder of the same name as the video is established under the "frame" folder.

1c3) For the video being accessed, the following steps are performed: reading a video frame of the video by using a video reading function cv2.videocapture (); the color frame is converted into a gray frame using a cv2.cvtcolor () function; storing the processed video frames into a built video frame folder according to a time sequence, such as: 1, jpg, 2, jpg, and so on.

1d) And (3) cutting the gray video frame:

1d1) writing a frame _ to _ patch () function, and cutting out small blocks with the size of 360 multiplied by 240 according to the spatial position from a gray frame in a subfolder in a folder of the frame;

1d2) saving the small blocks of each spatial position as pictures, naming according to the sequence of video time frames, namely 1.jpg, 2.jpg, and so on; storing the pictures into different subfolders under a folder of the patch, wherein the subfolders are named as 'gray frame folder name + spatial position serial number', such as 'Horse.avi _ 1', 'Flower _ 3' and the like, and the folder of the patch is used as a training data set;

1e) the test data set is any randomly selected video, is converted into a gray level video, and is stored into a built folder according to the time sequence, wherein the test _ frame comprises the following steps: 1.jpg, 2.jpg, and so on;

step 2, writing training and testing codes by using a deep learning framework TensorFlow according to a designed network structure:

2a) the method comprises the steps of establishing a project folder, and establishing functions used in a project folder for training, a project folder for testing, a project folder for saving parameter setting, a network.

2b) Setting reasonable network hyper-parameters in a config.py file, for example, the number batchsize of a frame group input at each iteration is 16, the number of frames N included in each frame group is 16, the number of times epoch of using a data set during training is 5, and the like, writing functions required by training and testing codes, for example, a function for normalizing input to [ -1,1], a loss function for calculating a mean square error, and the like, into utils.py.

2c) Py file is written to define the structure of the full convolutional network employed, as shown in fig. 2:

2c1) the input is first passed through a three-dimensional convolutional layer for observation. The convolutional layer contains 9 three-dimensional convolutional kernels of size 3 × 3 × N, thus obtaining observations that are all compressed in the spatio-temporal dimension with an observation rate of 1/16.

2c2) Taking the observation result as the input of a three-dimensional deconvolution layer, and performing dimension increasing on the input low-dimensional observation, wherein the output size of the deconvolution layer is the same as the input size of the observation convolution layer, and the convolution kernel size is 3 multiplied by N. The initial recovery result is obtained with the same time and space size as the input.

2c3) The initial recovery result is input into 1 or 2 concatenated "time-space blocks". The structure of the space-time block is shown in fig. 3, fig. 3 is a structural schematic diagram of a space-time block, and the space-time block enhances the correlation of a preliminarily reconstructed frame in a time dimension and a space dimension to obtain a final reconstruction result.

In brief, a video is input into a three-dimensional convolution layer for observation and then used as an observation result, and is input into a three-dimensional deconvolution layer for initial recovery and then input into a space-time block for space-time enhancement processing to obtain a final reconstructed video result.

2d) Py file is written to implement the training process of the network:

2d1) and taking every N acquired frames as a video block, taking the blocksize video blocks as the input of the network, and carrying out a forward propagation process through the network defined in the step 2c) to obtain a reconstruction result.

2d2) And taking the input of the network as a reference, carrying out reconstruction error calculation on the obtained reconstruction result and the input of the network, adopting a reconstruction error as a mean square error, carrying out reverse propagation on the error, and updating network parameters to finish a training process of the network.

2d3) And judging whether the epich times of training of the training data set are finished, if so, finishing the training, and otherwise, repeating 2d1) -2d 2).

2d4) The final parametric model is saved for model testing.

2e) Py was written to implement the test procedure:

2e1) and loading the stored network model, and processing the test videos according to 2e2) -2e 4).

2e2) Sequentially acquiring video frames from the test video frame folder, accumulating the video frames until the number of the frames reaches N, namely a video block, and inputting the video block into the network defined in the step 2c) to acquire a reconstructed video block; and performing zero filling operation on the last video frames with the number less than N, splicing the video frames into a video block, and inputting the video block into the network defined in the step 2c) to obtain a reconstruction result. The time taken for each reconstruction was recorded.

2e3) Whenever a reconstructed video block is obtained in 2e2), writing a video file with the same name as the test video immediately to realize real-time reconstruction; and writing only the first few frames without zero padding into the video file according to a reconstruction result obtained for the last zero-padded video block.

2e4) For 2e2), each frame in the reconstructed video block is saved as a picture in temporal order, e.g., 1.jpg, 2.jpg, and so on. The peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) values of each frame are calculated by reconstructing each frame of the frame group and the corresponding input frame and stored.

Step 3, training the network

3a) And replacing the training data set path of the training code with the path of the 'patch' grayscale video block folder.

3b) Py file of network structure and config file of network setting are loaded.

3c) Py under the project folder is executed until the training is finished, and the final parameter model is obtained.

Step 4, testing the network

4a) The input video path of the test code is replaced with the path of the "test _ frame" test video folder.

4b) Py file of network structure and final parameter model are loaded.

4c) Py under the project folder is executed to obtain a reconstructed video.

The video compression sensing method based on the full convolution neural network has the advantages that the used parameter quantity is small, the forward propagation process time is extremely short, the real-time reconstruction can be realized, the time-space block in the reconstruction part in the network of the space-time video compression sensing method can enhance the space-time correlation, the video reconstruction effect can be greatly improved by combining with the obtained observation with strong space-time balance, and the high-quality reconstructed video can be obtained.

The technical effects of the present invention will be explained in the following by simulation and data thereof

Example 8

The spatial-temporal video compressed sensing method based on the convolutional network is implemented in the same way as 1-7,

and (3) testing conditions are as follows:

the reconstruction part of the network for setting the space-time video compression sensing method is connected with a space-time block, and other network parameters are as described above to form a network.

The reconstruction part of the network for setting the space-time video compression sensing method is connected with two space-time blocks, and the rest network parameters are as described above to form a network.

And respectively training the two networks to obtain the trained networks.

Content of test experiment

Selecting 'walk' and 'foldage' in the Vidset4 as test data sets, referring to fig. 4, sending the test data sets to the two networks trained by the invention to obtain reconstructed videos, selecting a reconstructed video frame of any one frame of the two test videos as shown in fig. 4, inputting video original frames by the first behavior 'walk' and 'foldage' in fig. 4, inputting reconstructed video frames of the 'walk' and 'foldage' input video original frames by the network pair of the spatio-temporal video compression perception method in which only one 'spatio-temporal block' is connected by the second behavior, and inputting reconstructed video frames of the video original frames by the 'walk' and 'foldage' by the network pair of the spatio-temporal video compression perception method in which only two 'spatio-temporal blocks' are connected by the third behavior.

As can be seen from fig. 4, the gray level video has a better visual reconstruction effect after being subjected to network reconstruction by the spatio-temporal video compression sensing method, compared with the second line and the third line of fig. 4, the network of the spatio-temporal video compression sensing method of one "spatio-temporal block" can already restore the main content of the original video frame, and the network of the spatio-temporal video compression sensing method of two "spatio-temporal blocks" can make the reconstruction result clearer, which further proves that the "spatio-temporal block" of the present invention can make the reconstruction effect better.

In a second test experiment, the gray level videos of the walk and the foldage are respectively input into a network connected with a space-time video compression sensing method of a space-time block and two space-time blocks for testing to a reconstructed video, the peak signal-to-noise ratio (PSNR) and the Structural Similarity (SSIM) of the image of the first 32 frames of the reconstructed video are calculated, and the result is shown in the first table:

table one: PSNR/SSIM for reconstructing video frames

A block	Walk	Foliage	Two blocks	Walk	Foliage
						Frame 1	25.39/0.809	23.05/0.664	Frame 1	26.07/0.817	23.81/0.716
Frame 2	25.87/0.841	24.39/0.742	Frame 2	26.34/0.839	24.90/0.769
						Frame 3	25.96/0.849	24.87/0.766	Frame 3	26.64/0.854	25.02/0.779
…	…	…	…	…	…
						Mean value of	25.25/0.839	23.77/0.690	Mean value of	25.80/0.844	24.10/0.722

The subjective visual comparison of the reconstructed video frame given in fig. 4 shows that the "spatio-temporal block" can enhance the spatio-temporal correlation and the video reconstruction effect is good. The image index test results are given in the first table, and the reconstruction quality of the invention can be further illustrated from the perspective of quantification. As can be seen from the table one, each frame of the reconstructed video after the gray-scale video passes through the network of the spatio-temporal video compressed sensing method has better PSNR and SSIM, and the numerical values are close to each other, which proves that the reconstructed video has better reconstruction quality, also indicates that the reconstructed video result has better spatio-temporal balance, and can ensure that the quality of each frame of the reconstructed video is stable.

Calculating the average reconstruction time of each frame of the first 32 frames of the reconstructed videos of the networks 'walk' and 'foldage' of the spatio-temporal video compressed sensing method connected with one spatio-temporal block and two spatio-temporal blocks respectively, wherein:

the average reconstruction time of each frame of a network of a space-time block space-time video compression sensing method is 0.03-0.04 s,

the average reconstruction time of each frame of the network of the two space-time block space-time video compression sensing methods is 0.05-0.06 s.

Therefore, the time of reconstructing the video by the network of the space-time video compressed sensing method is short, the real-time performance is realized, the time of reconstructing the video is increased while the space-time block is increased, and the space-time block can be increased as much as possible while the real-time performance is ensured.

In short, the invention discloses a convolutional network-based spatio-temporal video compression sensing method, which mainly solves the problems of poor video compression spatio-temporal balance and poor real-time performance of reconstructed video in the prior art. The scheme is as follows: 1) preparing a training data set 2), designing a network structure of the space-time video compressed sensing method 3), writing a training and testing file according to the designed network structure 4), training the network of the space-time video compressed sensing method 5), and testing the network of the space-time video compressed sensing method. The network of the space-time video compression sensing method adopts an observation technology of space-time simultaneous compression and a reconstruction technology of enhancing the space-time correlation by using a time-space block, not only can realize real-time video reconstruction, but also has stronger space-time balance of the reconstructed result, has high reconstruction quality and stability, and can be used for video compression transmission and subsequent video reconstruction.

Claims

1. A spatio-temporal video compressed sensing method based on a convolutional network is characterized by comprising the following steps:

1) preparing a training data set: downloading videos with different required resolutions, preprocessing all the downloaded videos, sequentially converting the downloaded videos into gray video frames, cutting the videos into small blocks according to spatial positions, storing the small blocks of each spatial position into pictures, storing the pictures into different subfolders, naming the pictures according to the sequence of video time frames, namely 1.jpg and 2.jpg, and so on, finally forming a whole folder by all the subfolders, and taking the whole folder as a training data set; the test data set is any randomly selected video, is converted into a gray level video and is stored in a folder;

2) designing a network structure of a space-time video compression sensing method:

the network structure comprises an observation part and a reconstruction part, wherein the observation part inputs an input video block into a three-dimensional convolution layer to be used as observation output, and the reconstruction part sequentially connects the observation output to a three-dimensional deconvolution layer, one or more 'space-time blocks' related to reconstruction video time, a BN layer and a three-dimensional convolution layer to obtain a reconstruction video block; each 'time-space block' is formed by connecting four three-dimensional convolution layers in series in sequence, adding a BN layer in front of each three-dimensional convolution layer, and then connecting the output end of the first three-dimensional convolution layer with the output end of the fourth three-dimensional convolution layer through residual errors;

3a) establishing a project folder, wherein training, testing, network structure, network setting and function program files are respectively established in the folder;

3c) defining the training process of the network in a training file: sequentially taking 16 video frames from subfolders in a training data set to form a video block as a network input video block of a space-time video compression sensing method, calculating mean square error reconstruction loss by a reconstructed video block output by a network of the space-time video compression sensing method and the input video block, and performing back propagation on the reconstruction loss to update network parameters;

3d) defining the testing process of the network in the test file: converting the gray level video in the test data set into video frames, dividing the gray level frames into a plurality of continuous 16 frames according to a time sequence, wherein each continuous 16 frame is used as an input video block and is input into a network of a space-time video compression sensing method to obtain a corresponding reconstructed video block, the reconstructed video block comprises the same number of frames as the input video block, and the reconstructed video blocks are arranged according to the time sequence to obtain a reconstructed video which is a test result of the network of the space-time video compression sensing method;

2. The spatio-temporal video compressed sensing method based on convolutional network as claimed in claim 1, wherein said network structure for designing spatio-temporal video compressed sensing method in step 2) comprises the following steps:

2a) three-dimensional convolution layer arrangement of observation portion: setting the size of a convolution kernel of the three-dimensional convolution layer to be T multiplied by 3, wherein T multiplied by 16 is the size of the convolution kernel in a time dimension, 3 multiplied by 3 is the size in a space dimension, and setting non-zero padding in the convolution process and the step length to be 3; the size of an input video block is T multiplied by H multiplied by W, T is the number of frames contained in the input video block, H multiplied by W is the space dimension of each frame, and H and W are multiples of 3; with a number of convolution kernels of 1, a spatial compression ratio of 9, a temporal compression ratio of T,observation rate

Observation rate when N is 9 and T is 16

2b) Setting a three-dimensional deconvolution layer of a reconstruction part: setting convolution kernels of the three-dimensional deconvolution layers is required to be based on the symmetry of convolution and deconvolution, when the size of a convolution kernel in a deconvolution process is the same as that of a convolution kernel in a corresponding convolution process, the output of deconvolution is the same as the input size of convolution, so that the setting of the convolution kernels of the three-dimensional deconvolution layers is the same as that of convolution kernels of the three-dimensional convolution layers in an observation process, namely, the size is T multiplied by 3, the number is N, zero padding is not carried out in the convolution process, and the step length is 3;

2c) design of "space-time block" of reconstructed part: each 'space-time block' is formed by sequentially connecting four three-dimensional convolution layers in series, a BN layer is added in front of each three-dimensional convolution layer of the 'space-time block', the input end of the second three-dimensional convolution layer is connected with the output end of the fourth three-dimensional convolution layer in a residual error mode, and the second, third and fourth three-dimensional convolution layers form a residual error block through residual error connection; each space-time block is connected in series by a three-dimensional convolution layer and a residual block;

2d) three-dimensional convolutional layer setup of the reconstructed part: the convolution kernel size of the final three-dimensional convolution layer of the reconstruction part is 16 multiplied by 1, the number is 16, the step length is 1, and zero padding is not carried out in the convolution process.

3. The convolutional network-based spatio-temporal video compressed sensing method as claimed in claim 2, wherein each "spatio-temporal block" in step 2c) is serially connected by a three-dimensional convolutional layer and a residual block, comprising the steps of:

2c1) three-dimensional convolutional layer setup within each "spatio-temporal block": the convolution kernel size of the three-dimensional convolution layer is 16 multiplied by 1, the number is 16, zero padding is not set in the convolution process, and the step length is 1;

2c2) the residual block within each "spatio-temporal block" is set: the residual block comprises three-dimensional convolution layers, the sizes of convolution kernels of the three-dimensional convolution layers are respectively 16 multiplied by 3, 64 multiplied by 1 and 32 multiplied by 3, the numbers are respectively 64, 32 and 16, the convolution processes are not zero-filled, and the step length is 1; connecting the input end of the residual block to the output end of the residual block for summation calculation, and adding a Tanh activation layer after the summation calculation;

2c4) multiple "space-time blocks" may be cascaded, i.e., the output of the previous "space-time block" is used as the input of the next "space-time block" to expand the network capacity, and the structure of each block is the same.