CN108923984A

CN108923984A - Space-time video compress cognitive method based on convolutional network

Info

Publication number: CN108923984A
Application number: CN201810777563.XA
Authority: CN
Inventors: 谢雪梅; 刘婉; 赵至夫; 汪芳羽; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2018-11-30
Anticipated expiration: 2038-07-16
Also published as: CN108923984B

Abstract

The invention discloses a kind of space-time video compress cognitive method based on convolutional network mainly solves the problems, such as that video compress space-time balanced difference and reconstructing video real-time are poor in the prior art.Its scheme is：Prepare training dataset；Design the network structure of space-time video compress cognitive method；Trained and test file is write according to designed network structure；The network of training space-time video compress cognitive method；Test the network of space-time video compress cognitive method.The network of space-time video compress cognitive method of the present invention enhances the reconfiguration technique of temporal correlation using space-time while the observation technology compressed and with " time-space block ", not only it is able to achieve real-time video reconstruct, and the result of reconstruct has stronger space-time balanced, reconstruction quality is high and stablizes, and can be used for the compression transmission and subsequent video reconstruction of video.

Description

Space-time video compress cognitive method based on convolutional network

Technical field

The invention belongs to technical field of video processing, relate generally to video compress perception, specifically a kind of to be based on convolution net The space-time video compress cognitive method of network, can be used for realizing the video compress sensing reconstructing of real-time high quality.

Background technique

Compressed sensing (CS) is a kind of signal compression sampling theory, can lower than under nyquist sampling rate to signal into Row sampling, and original signal is recovered by restructing algorithm.The theory Successful utilization in various field of signal processing, such as medicine Imaging, radar imagery etc..After the hardware system such as appearance of single pixel camera and popularizing, compressed sensing is employed for static map On compressing, and show brilliant potentiality.Nowadays compressed sensing is no longer limited to still image, but is generalized to video In.Compared with still image, video compression also needs to consider the correlation in image temporal dimension, therefore uses compressed sensing (VCS) theoretical treatment video is more complicated.

Using method substantially the space video compressed sensing and temporal video compression sense of compressive sensing theory processing video Know, the observation process of the two uses spatial reuse camera (SMC) and time-multiplexed camera (TMC) to realize respectively.For SPACE V CS, Video be input video is observed and is reconstructed frame by frame, and time VCS be multiple input video successive frames are carried out unified observation and Reconstruct.In recent years, many methods have been realized in video compress perception, and obtain preferable video reconstruction effect.But due to The time complexity for the iteration optimization calculation method taken is very high, cannot achieve Real-time Reconstruction video, and real-time is poor in other words, nothing Method meets actual demand.

With the development of depth learning technology, deep neural network (DNN) is widely used in image and video processing neck Domain, such as CS, super-resolution go rain, denoising and reparation, and achieve significant achievement.Since DNN has training under line, surveyed on line The characteristics of examination, once network training is completed, test process, which only needs to carry out propagated forward, to be completed, therefore reconstitution time phase Conventional method is greatly shortened.

In the method for the existing VCS based on DNN, some carries out the reconstruct of time compressed sensing using fully-connected network, Cause time complexity high because parameter amount is excessive, cannot achieve Real-time Reconstruction；Some carries out space pressure using convolutional neural networks Contracting sensing reconstructing enhances video inter-frame relation on the basis of just restoring video to obtain better video reconstruction effect.Due to These methods are all only compressed in the single dimension in space (or time), also referred to as observe, the observation obtained is caused to be pressed Resolution ratio in the dimension of contracting is very low.So that reconstruction result is between the pixel in compression dimension when carrying out video reconstruction Correlation is insufficient, is difficult to be resumed by the information of compression dimension, to reduce the result of entire video reconstruction.

Summary of the invention

It is an object of the invention in view of the above shortcomings of the prior art, propose a kind of spatial and temporal resolution it is balanced, again Structure performance is more preferable, the stable space-time video compress cognitive method based on convolutional network of reconstructing video result.

The present invention is a kind of space-time video compress cognitive method based on convolutional network, which is characterized in that includes as follows Step：

1) training dataset is prepared：The video of different resolution needed for downloading, pre-processes all videos of downloading, Institute's foradownloaded video is successively switched to the video frame of gray scale, and spatially position is cut to the fritter of certain size, by each sky Between the fritter of position save as picture, be stored in different sub-folders, and name according to video time frame sequential, i.e. 1.jpg, 2.jpg, and so on；All sub-folders finally constitute a whole file, and using whole file as training number According to collection；Test data set is any video randomly selected, switchs to greyscale video and is stored in file；

2) network structure of space-time video compress cognitive method is designed：Network structure includes observation part and reconstructing part Point, input video block is input to after a Three dimensional convolution layer as observation output by observation part, and reconstruct part exports observation Be serially connected to a three-dimensional warp lamination, several " space-time blocks ", one BN layers, obtain reconstructing video after a Three dimensional convolution layer Block；Each " space-time block " is sequentially connected in series for four Three dimensional convolution layers, and BN layers are added before each Three dimensional convolution layer, then will The output end progress residual error connection of the output end and the 4th Three dimensional convolution layer of first Three dimensional convolution layer；

3) trained and test file is write according to designed network structure：

Project folder 3a) is established, establishes training, test, network structure, network settings, function etc. respectively in file Program file；

3b) it is arranged and writes associated file content：Reasonable network hyper parameter is set in network profile, and in letter Required function in code is trained and is tested in write-in in number file；It is perceived in network structure file according to space-time video compress Method writes network structure；

The training process of network 3c) is defined in training file：If the sub-folder concentrated from training data takes in order Dry video frame forms network inputs video block of the video block as space-time video compress cognitive method, by space-time video compress The reconstructing video block and input video block of the network output of cognitive method calculate mean square error reconstruct and lose, and reconstruct loss is carried out Backpropagation is updated network parameter；

The test process of network 3d) is defined in test file：The greyscale video that test data is concentrated is switched into video frame, Gray scale frame is divided into many continuous several frames sequentially in time, each continuous several frame is as an input video Block, the network for being input to space-time video compress cognitive method obtain corresponding reconstructing video block, and reconstructing video block includes and input The identical frame number of video block, arranges reconstructing video block to obtain reconstructing video according to time sequencing, perceives for space-time video compress The test result of the network of method；

4) network of training space-time video compress cognitive method：By network structure file load networks structure and involved And the parameter arrived, and to all parameter initializations, hyper parameter is loaded by network profile, is calculated using stochastic gradient descent Method is carried out according to the training process of training document definition using network of the training dataset to space-time video compress cognitive method Repeatedly training undated parameter, training terminate to obtain final argument model；

5) network of space-time video compress cognitive method is tested：By network structure file load networks structure, load is most End condition model defines test process as network architecture parameters, according to test file, is regarded using test data set test space-time The network of frequency compression sensing method obtains real-time, the perception of high quality, the space-time video compress of high space-time balance reconstruct test Results for video.

The observation part of the network for the space-time video compress cognitive method that the present invention designs uses Three dimensional convolution layer, can be with Obtain spatial and temporal resolution equilibrium observed result.Reconstruct part has used three-dimensional warp lamination and " space-time block ", can reduce net Network parameter amount improves harmony in reconstructing video space-time dimension while remove blocking artifact, it is final realize in real time, high quality, The video reconstruction of high space-time balance.

Compared with the prior art, the present invention has the following advantages：

1. observation process enhances the harmony of time-space compression：In compressed sensing, due to video observation and still image Observation has larger difference, and processing video observation will not only consider the correlation in the frame in sdi video dimension between pixel, also Need to consider the correlation between the frame and frame in video time dimension.The present invention compensates for existing method only in time dimension or space Compression observation is carried out in dimension, so that the space-time balanced of observation is poor and the difference of quality reconstruction on time and space dimension It is larger, cause reconstructed video quality poor, present invention employs the observation technologies that a kind of novel space-time compresses simultaneously, so that seeing The space-time balanced of survey is stronger, can obtain preferable reconstructed video quality.

2. video reconstruction can be carried out high quality in real time：Since existing video observation method has ignored in observation process Space-time balanced, so that reconstructed video quality is poor.In addition to this, traditional VCS method all uses the calculation method of iteration optimization, This makes computation complexity very high, cannot achieve video reconstruction.Existing some video compress perception sides neural network based The network parameter amount that method uses is big, so that the video reconstruction time is longer, thus can not Real-time Reconstruction.It is of the invention based on full convolution Parameter amount used in neural network video compress cognitive method is smaller, and propagated forward process time is extremely short, may be implemented in real time Reconstruct, and space-time can be enhanced in " the time-space block " in the reconstruct part in the network of space-time video compress cognitive method Correlation, this observation strong with the space-time balanced of acquisition, which combines, can greatly promote video reconstruction effect, obtain high The reconstructing video of quality.

3. reconstructing video stability of the present invention is high, it can guarantee that each frame of reconstructing video all has similar preferably reconstruct Quality, there is no the unconformable problems of visual experience caused by the non-uniform reconstruction result of reconstructed video quality interframe, also can be more Video file compression and transmission are scientifically carried out, ensure that the integrality of video in compression and restructuring procedure.

Detailed description of the invention

Fig. 1 is the flow chart of the space-time video compress cognitive method of the invention based on convolutional network；

Fig. 2 is the network structure of space-time video compress cognitive method of the invention；

Fig. 3 is the schematic diagram of " time-space block " unit in reconstruct part of the invention；

The result of reconstructing video frame when Fig. 4 is present invention test.

Specific embodiment

The present invention is described in detail with example with reference to the accompanying drawing.

Embodiment 1

Nowadays compressed sensing is no longer limited to still image, but is generalized in video.Compared with still image, video Compression process also needs to consider the correlation in image temporal dimension, therefore more with compressed sensing (VCS) theoretical treatment video It is complicated.Video compress perception carries out compression sampling to video, reduces memory space and greatly increases transmission speed, is transmitted The reconstructing video that data obtain after being reconstructed can carry out more complicated task, such as target detection and tracking, existing method It is all only compressed in the single dimension in space (or time), also referred to as observes, cause the observation obtained in compressed dimension On resolution ratio it is very low.So that correlation of the reconstruction result between the pixel in compression dimension is not on carrying out video reconstruction Foot, is difficult to be resumed by the information of compression dimension, and reduce entire video reconstruction as a result, reconstructed video quality is also unstable, For this phenomenon, the present invention is inquiring into a kind of better method, proposes a kind of space-time video compress perception based on convolutional network Method referring to Fig. 1, including has the following steps：

1) training dataset is prepared：The video of different resolution needed for downloading, in order to simplify training process, to the institute of downloading There is video to be pre-processed, institute's foradownloaded video is successively switched to the video frame of gray scale, and spatially position is cut to certain size Fritter, the fritter of each spatial position is saved as into picture, is stored in different sub-folders, and suitable according to video time frame Sequence name, i.e. 1.jpg, 2.jpg, and so on, n-th frame of the picture from video is represented if n.jpg；All sub-folders A whole file is finally constituted, and using whole file as training dataset；Test data set randomly selects Any video, also needs to switch to greyscale video frame to be stored in file.

2) network structure of space-time video compress cognitive method is designed：The network structure packet of space-time video compress cognitive method Observation part and reconstruct part are included, input video block is input to defeated as observing after a Three dimensional convolution layer by observation part Out, Three dimensional convolution layer carries out convolution operation in Spatial Dimension and time dimension respectively, by the volume for adjusting separately Three dimensional convolution layer Observation information can be reasonably allocated in space dimension and time dimension by product core in the compression ratio of room and time, go to obtain space-time The observation of resolution ratio equilibrium, existing method often use two-dimensional convolution layer to carry out identical space dimension up in each time frame Observation will cause the correlation that can not extract video time dimension in this way, cause reconstructing video effect on time dimension poor；Weight Observation output is serially connected to three-dimensional warp lamination, several " space-time blocks ", one BN layers, a Three dimensional convolution by structure part Reconstructing video block is obtained after layer；Each " space-time block " is sequentially connected in series for four Three dimensional convolution layers, and in the every of " space-time block " Add BN layers before a Three dimensional convolution layer, then by the output end of the output end of first Three dimensional convolution layer and the 4th Three dimensional convolution layer Carry out residual error connection；Observation information, can be carried out a liter dimension by three-dimensional warp lamination used in the corresponding observation of three-dimensional warp lamination Just restoration result is obtained, the purpose for rising dimension is solution space in order to which the solution space of observation information to be mapped to video, convenient for subsequent " space-time block " be first restoration result add more details information." space-time block " can prevent gradient network more using residual error connection It dissipates, adds detailed information while deepening network for video, enhance the temporal correlation of just restoration result, ensure that reconstruct view After the recovery effects of frequency, integration function finally is played with a Three dimensional convolution layer, further enhances inter-frame relation.

3) trained and test file is write according to designed network structure：

3b) it is arranged and writes associated file content：Reasonable network hyper parameter is set in network profile, including Batchsize, epoch etc..And required function in training and test code is written in function file；In network structure text Network structure is write according to space-time video compress cognitive method in part.

The training process of network 3c) is defined in training file：The sub-folder concentrated from training data takes t in order Video frame forms network inputs video block of the video block as space-time video compress cognitive method, and the value and space-time of t regards Observation part is positively correlated in the compression ratio of time dimension in the network of frequency compression sensing method, generally requires the observation part of basis Interior space compression rate is adjusted, and reaches time dimension compression ratio and space dimension compression ratio with a relative equilibrium ratio.Input Video block passes through network propagated forward, obtains the reconstructing video block of the network output of space-time video compress cognitive method, reconstruct view Frequency block and input video block calculate mean square error reconstruct and lose, and mean square error reconstructs costing bio disturbance input video block and reconstructing video The quadratic sum of the distance of each pixel of block, reconstruct loss carries out backpropagation, using gradient descent algorithm to network parameter It is updated, so far completes primary training, when training dataset reaches the upper limit of access times, then training terminates, and otherwise repeats Training.

The test process of network 3d) is defined in test file：In order to realize that full figure reconstructs, by any test view when test Frequency does not need to be cut after switching to gray scale frame, the gray scale frame that all test datas are concentrated is divided into many continuous several Frame, every t frame is a video block from video first frame, and the zero padding of number deficiency, each continuous t frame is as an input Video block, the network for being input to space-time video compress cognitive method obtain corresponding reconstructing video block, reconstructing video block include with The identical frame number of input video block, using former frames of the last one non-zero padding of reconstructing video block as the reconstruct of the video reconstruction block It is the network of space-time video compress cognitive method as a result, arranging reconstructing video block to obtain reconstructing video according to time sequencing Test result.

4) network of training space-time video compress cognitive method：By network structure file load networks structure and involved And the parameter arrived, and to all parameter initializations, hyper parameter is loaded by network profile, is calculated using stochastic gradient descent Method, according to the training process of training document definition, using training dataset of the invention to space-time video compress cognitive method Network carries out repeatedly training undated parameter, and training terminates to obtain final argument model.

It is adopted since existing video observation method usually only carries out compression in time dimension or space dimension in observation process Sample has ignored space-time balanced, so that reconstructed video quality is poor.In addition to this, traditional VCS method is all changed using traditional For optimized calculation method, this makes time complexity high, cannot achieve real-time video reconstruct.It is more existing to be based on neural network The network parameter amount that uses of video compress cognitive method it is big so that the video reconstruction time is longer, equally cannot achieve real-time weight Structure.Parameter amount used in the space-time video compress cognitive method based on convolutional network used in the present invention is small, propagated forward Process time is short, and real-time video reconstruct may be implemented.Simultaneously, part is reconstructed in the network of space-time video compress cognitive method In " time-space block " restore the temporal correlation of video information when can be enhanced reconstructing video, this is equal with the space-time of acquisition The strong observation of weighing apparatus property combines can be with significant increase video reconstruction effect, to obtain high quality and stable reconstructing video.

Embodiment 2

Space-time video compress cognitive method based on convolutional network is with implementation 1, design space-time video described in step 2) The network structure of compression sensing method referring to fig. 2, including has the following steps：

2a) the Three dimensional convolution layer setting of the space-time video compress cognitive method observation part based on convolutional network：Setting three The size for tieing up the convolution kernel of convolutional layer is T × 3 × 3, and wherein T=16 is size of the convolution kernel on time dimension, and 3 × 3 be space Size in dimension, and not zero padding in convolution process is set, step-length 3；Input video block size is T × H × W, and T is input video The frame number that block includes, H × W is the space dimension size of each frame, and H and W are 3 multiple；When convolution kernel number is 1, space Compression ratio is 9, and time compression ratio is T, observation rateThe number that convolution kernel is arranged is N, then observation rateWork as N=9, observation rate when T=16Therefore this observed pattern can obtain in time and space The observation for tieing up the spatial and temporal resolution equilibrium all having compressed facilitates the letter that reconstruct part preferably recovers time and space dimension Breath.

2b) observation output is firstly connected to the three-dimensional three-dimensional warp lamination of warp lamination reconstructing part point by reconstruct part Setting：The convolution kernel that three-dimensional warp lamination is arranged need to be according to convolution and the symmetry of deconvolution, when the convolution kernel of deconvolution process When size is identical as the convolution kernel size of corresponding convolution process, deconvolution output just with the input size of convolution It is identical, therefore phase should be arranged with the convolution kernel of the Three dimensional convolution layer of observation process in the convolution kernel setting that three-dimensional warp lamination is arranged Together, i.e., having a size of T × 3 × 3, number N, not zero padding in convolution process, step-length 3.

2c) three-dimensional warp lamination is connected to the design of " the space-time block " of several " space-time blocks " reconstruct part：Each " space-time Block " is that four Three dimensional convolution layers are sequentially connected in series, and BN layers are added before each Three dimensional convolution layer of " space-time block ", then by the The output end progress residual error connection of the input terminal and the 4th Three dimensional convolution layer of two Three dimensional convolution layers, residual error are connected second A, third, the 4th Three dimensional convolution layer constitutes residual block；Each " space-time block " is then by a Three dimensional convolution layer and a residual error Block serial connection.

2d) reconstruct the Three dimensional convolution layer setting of part：The convolution kernel of the last Three dimensional convolution layer in part is reconstructed having a size of 16 × 1 × 1, number 16, step-length 1, not zero padding in convolution process.The Three dimensional convolution layer of not zero padding in convolution process is further whole It closes, enhancing inter-frame information, obtains final reconstructing video frame.

The network structure that the present invention designs space-time video compress cognitive method uses Three dimensional convolution layer in observation part, point The other compression sampling that time dimension and space dimension are carried out to input video block, compared to existing method only in time dimension and space dimension list It is sampled in a dimension, the sampling of space-time balanced can be obtained, compared to existing method using Gaussian matrix as observation Matrix, and by the way of convolution study, it is able to carry out more reasonable compression sampling；The network of space-time video compress cognitive method Reconstruct part is made of a three-dimensional warp lamination, several " space-time blocks ", one BN layer, a Three dimensional convolution layer, three-dimensional is instead Convolutional layer and Three dimensional convolution layer are symmetry operations, are handled for carrying out liter dimension to observed result, by observed result expand to it is defeated Enter the identical dimension of video block, be conducive to subsequent network and further restored, is connected in " space-time block " using residual error, it can More details information is added for reconstructing video, this is because residual error connectionist learning is difference between the outputting and inputting of residual block Value, so that it is more focused on the detailed information in video, three-dimensional warp lamination plays the integration of network, so that network is last Reconstructing video block and input video block size having the same.

Embodiment 3

Space-time video compress cognitive method based on convolutional network is with implementing 1-2, step 2c) described in each of " space-time Block " then by a Three dimensional convolution layer and a residual block serial connection, referring to Fig. 3, including has the following steps：

2c1) the Three dimensional convolution layer setting in each " space-time block "：The convolution kernel size of Three dimensional convolution layer be 16 × 1 × 1, number 16, and not zero padding in convolution process is set, step-length 1；Since the space dimension of convolution kernel is having a size of 1 × 1, energy The inter-frame information of each spatial position is enough integrated, there is the ability of enhancing inter-frame relation.

2c2) the residual block setting in each " space-time block "：Residual block includes three Three dimensional convolution layers, Three dimensional convolution layer The size of convolution kernel be respectively 16 × 3 × 3,64 × 1 × 1 and 32 × 3 × 3, number is respectively 64,32,16, and convolution mistake is arranged Cheng Jun not zero padding, step-length are 1；Since space dimension can merge space time information, space dimension size having a size of 3 × 3 convolution kernel Inter-frame information can be integrated for 1 × 1 convolution kernel, therefore space time information can be enhanced in such setting, by the input of residual block The output end that end is connected to residual block carries out read group total, and adds Tanh active coating after read group total.

2c3) in 2c1) and each Three dimensional convolution layer 2c2) before plus one BN layer to accelerate convergence rate, then add PReLU is to enhance the non-thread sexuality of network；

Multiple " space-time blocks " can 2c4) be cascaded, i.e., using the output of upper one " space-time block " as it is next " when The input of empty block " is to expand network capacity, and the structure of each block is identical, such as 2c1) -2c3) one " space-time block " of the completion Build.It can refer in the present invention according to the actual situation, using one " space-time block " or " multiple space-time blocks ", " space-time block " Number and the reconstructing video time be trade-off relation, in other words, to requirement of real-time height, then can use " a space-time Block ".

The present invention designs " the space-time block " of the network reconfiguration part of space-time video compress cognitive method by a Three dimensional convolution Layer and a residual block serial connection, main functional parts are serially connected from residual block, residual block by three Three dimensional convolution layers It connects, and the input of first Three dimensional convolution layer is connected to the output of third Three dimensional convolution layer, i.e., by first Three dimensional convolution The input value of layer is added with the output valve of third Three dimensional convolution layer, collectively as lower layer of input of network, is equivalent in this way Residual block only learnt residual block input and residual block output difference, residual block output and input between small variations all can It is amplified and is handled, allow the network to add detailed information when reconstructing video is arrived in study for video.

Embodiment 4

Space-time video compress cognitive method based on convolutional network is with implementing 1-3, step 3c) described in network settings Reasonable network hyper parameter is set in file, including is had the following steps：

Each input video block number batchsize for training the network for inputting space-time video compress cognitive method, which is arranged, is 20~40, batchsize can make the appropriate adjustments according to the video size in the depth and training dataset of network, entire to instruct The number epoch that white silk process training dataset uses can do suitable for 3~7, epoch according to the video number in training dataset Work as adjustment.

Embodiment 5

Space-time video compress cognitive method based on convolutional network is with implementing 1-4, step 3c) described in training file The middle training process for defining network, referring to Fig. 1, including has the following steps：

3c1) training dataset input video block is handled：Since first sub-folder, according to the number in picture name Every T=16 picture forms an input video block from small to large, takes be divided into a picture between input video block every time, such as 1.jpg-16.jpg is first input video block, and 2.jpg-17.jpg is second input video block, and so on；Space-time view The network of frequency compression sensing method inputs batchsize continuous video blocks, i.e. the 1-batchsize input video every time As inputting for the first time, the 2-batchsize+1 input video block inputs block as second, and so on, if every figure Chip size is H × W, then input each time includes Q=batchsize × T × W × H pixel value.For training on a large scale Data set can suitably increase the number of batchsize, so that trained data increase every time, accelerate the trained time.

3c2) training process of the network of space-time video compress cognitive method：The perception of space-time video compress will be inputted every time Each pixel value X in the continuous video block of batchsize of the network of method⁰It first passes throughOperation by picture Plain value normalizes to [- 1,1], then inputs X in the network defined and obtains corresponding reconstruction result X', calculates reconstruction result Mean square error between X' and input X is as reconstructed errorReconstructed error is subjected to backpropagation, updates net Network parameter completes a training process of network.

3c3) repeat 3c1) -3c2), training process is executed repeatedly, until the data in current sub-folder have used, is opened Beginning is trained using the data in next sub-folder, 500 preservation primary network models of every iteration, and network model includes Network structure and parameter have all been used until by the data in all sub-folders, complete a training dataset training.

3c4) judge whether to complete epoch training dataset trained, is, trained to terminate, otherwise repeatedly 3c1) -3c4).

Space-time video compress cognitive method based on convolutional network uses the training method of neural network, without using tradition Iteration optimization algorithms, traditional iteration optimization algorithms reconstructing video is multiple to each video iteration optimization, to reconstruct Time is long, can not achieve Real-time Reconstruction, and computation burden effectively can be transferred to network training process by the present invention, test When only need input video to carry out network propagated forward then available network reconfiguration video, when being greatly saved calculating Between, realize real-time video reconstruct.

Embodiment 6

Space-time video compress cognitive method based on convolutional network is with implementing 1-5, step 3d) described in basis design Network structure will test code be written test file, referring to Fig. 1, including have the following steps：

3d1) test data set input video block is handled：Test data set is any video randomly selected, switchs to gray scale Video frame is stored in file without cutting out, and contains P video frame for one, each video frame is having a size of H₀×W₀'s Test video enables every T frame form an input video block, such as：1~T gray scale frame is as first input video block, and 2T~3T is then Second input video block is formed, then the number that can calculate complete input video frame isRemaining video frame number Mesh p=P-n × T.

3d2) for preceding n × T frame：It is successively read video frame, works as H₀Or W₀It is not the multiple for observing convolution kernel bulk 3 When, it needs to carry out zero padding at the position of last row or column, the H in bulk and W that keep its new are 3 multiple；When The frame number of accumulative video frame reaches T and, having a size of T × H × W input video block, inputs the sense of space-time video compress to get to one The network of perception method obtains corresponding reconstructing video block.

3d3) for last p frame：Zero padding operation is all carried out in time and space, is spliced into T × H × W video Block subsequently inputs acquisition reconstructing video block in the network of space-time video compress cognitive method, only not by the preceding p frame of not zero padding The size of zero padding is H₀×W₀Part as reconstructing video agllutination fruit.

3d4) reconstructing video block is arranged to obtain reconstructing video according to time sequencing, is space-time video compress cognitive method The test result of network；

3d5) to each test video according to 3d1) -3d4) successively handle.

To input when do not needed when space-time video compress cognitive method test network based on convolutional network as training network Greyscale video is cut, because the network of space-time video compress cognitive method is full convolutional network, full convolutional network is once instructed Practice and complete, is capable of handling the video of any space size, the use scope of present networks has also been enlarged in this, can be to arbitrary size Greyscale video carries out compression reconfiguration.

A more detailed example is given below, the present invention is further described

Embodiment 7

Space-time video compress cognitive method based on convolutional network includes the following steps with 1-6 is implemented：

Step 1, training dataset is prepared.

The video of different resolution needed for 1a) downloading is downloaded, is put into " video " file, establishes one entitled " frame " Empty folder establish the empty folder of one entitled " patch " for saving greyscale video frame, for after saving and cutting Greyscale video frame.

1b) configuration Cuda accelerates, installs Python and install third party library at Python on computers TensorFlow and cv2.

1c) writing code using the library cv2 becomes greyscale video frame for video：

1c1) enter " video " file, access to some video in video set, obtains video name.

1c2) established and video video frame file of the same name under " frame " file.

1c3) to the video accessed, following steps are executed：Utilize video function reading cv2.VideoCapture () Read the video frame of the video；Color framing is switched into gray scale frame using cv2.cvtColor () function；The video frame that will be handled well It is saved in the video frame file built up sequentially in time, such as：1.jpg, 2.jpg, and so on.

Trimming operation 1d) is carried out to greyscale video frame：

Frame_to_patch () function 1d1) is write, the gray scale frame from sub-folder a certain in the file of " frame " The fritter having a size of 360 × 240 is cut out according to spatial position；

The fritter of each spatial position 1d2) is saved as into picture, is named according to video time frame sequential, i.e. 1.jpg, 2.jpg, and so on；By the different sub-folders under the file of picture deposit " patch ", sub-folder is named as " ash Spend frame folder name+spatial position sequence number ", such as " Horse.avi_1 ", " Flower_3 " etc., by file " patch " conduct Training dataset；

1e) test data set is any video randomly selected, switchs to greyscale video, is saved in builds up sequentially in time File in " test_frame ", such as：1.jpg, 2.jpg, and so on；

Step 2, training and test generation are write using deep learning frame TensorFlow according to designed network structure Code：

2a) establish project folder, established in project folder train.py for train, test.py for test, Config.py is for saving parameter setting, network.py preservation network structure, utils.py for being stored in alternative document The function used, with simplified code.

Reasonable network hyper parameter 2b) is set in config.py file, such as the number of the frame group of each iteration input Using number epoch=5 of data set etc. when frame number N=16 that batchsize=16, each frame group include, training, will train And function required for test code, the function for normalizing to [- 1,1], the loss function for calculating mean square error etc. will be such as inputted, Utils.py is written.

2c) write network.py file to define the structure of used full convolutional network, as shown in Figure 2：

Input 2c1) is first passed through into a Three dimensional convolution layer, is observed operation.The convolutional layer includes 9 having a size of 3 × 3 The three-dimensional convolution kernel of × N, to obtain the observed result all having compressed in space-time dimension, observation rate is 1/16.

2c2) using observed result as the input of a three-dimensional warp lamination, a liter dimension is carried out to the observation of the low-dimensional of input, The Output Size of the warp lamination is identical in the input size of observation convolutional layer, and convolution kernel is having a size of 3 × 3 × N.It obtains and inputs Time and bulk it is identical just restoration result.

First restoration result 2c3) is inputted into 1 or 2 cascade " time-space block ".The structure of space-time block such as Fig. 3 institute Show, Fig. 3 is one " space-time block " composition signal, and the frame of " space-time block " of the invention enhancing preliminary reconstruction is in time dimension and space dimension Correlation, obtain final reconstruction result.

Briefly, it is used as observed result after video input Three dimensional convolution layer is observed, then is input to three-dimensional deconvolution Layer carries out after just restoring, then is input to after " space-time block " carries out space-time enhancing processing and obtains final reconstructing video result.

Write train.py file 2d) to realize the training process of network：

It is a video block that the frame that 2d1) will acquire is often N number of, takes batchsize video block as the input of network, and By 2c) defined in network carry out propagated forward process obtain reconstruction result.

2d2) using the input of network as reference, the input of reconstruction result and network to acquisition carries out that reconstructed error is asked to grasp Make, which is carried out backpropagation for mean square error by the reconstructed error used, is carried out network parameter update, is completed network A training process.

2d3) judge whether to complete epoch training dataset trained, is, trained to terminate, otherwise repeatedly 2d1) -2d2).

Final argument model is saved, 2d4) so as to model measurement.

Write test.py 2e) to realize test process：

2e1) the network model that load saves, and to test video according to 2e2) -2e4) successively handle.

Video frame successively 2e2) is obtained from test video frame file, accumulative video frame reaches N, i.e., one until frame number Video block inputs 2c) defined in reconstructing video block is obtained in network；For the video frame of last number deficiency N, mended Z-operation is spliced into a video block, then inputs 2c) defined in obtain reconstruction result in network.Record reconstructs institute each time Time.

2e3) whenever 2e2) as soon as middle obtain a reconstructing video block, write-in at once and test video video file of the same name, To realize Real-time Reconstruction；To the reconstruction result that the video block of the last one zero padding obtains, only former frames of not zero padding are written Video file.

2e4) to 2e2) obtain reconstructing video block in each frame save into picture sequentially in time, as 1.jpg, 2.jpg, and so on.The Y-PSNR of each frame is calculated with corresponding input frame by each frame of reconstructed frame group (PSNR) and the value of structural similarity (SSIM) and preservation.

Step 3, network is trained

It is 3a) path where " patch " greyscale video block file presss from both sides by the training dataset path replacement of training code.

3b) load networks structure network.py file and network settings config.py file.

3c) the train.py under project implementation file terminates until trained, obtains final argument model.

Step 4, network is tested

It is 4a) path where " test_frame " test video file by the input video path replacement for testing code.

4b) load networks structure network.py file and final argument model.

4c) the test.py under project implementation file, obtains reconstructing video.

Of the invention, propagated forward smaller based on parameter amount used in full convolutional neural networks video compress cognitive method Process time is extremely short, and Real-time Reconstruction may be implemented, and in the reconstruct part in the network of space-time video compress cognitive method Temporal correlation can be enhanced in " time-space block ", this observation strong with the space-time balanced of acquisition combines can be very big Promotion video reconstruction effect, obtain the reconstructing video of high quality.

Technical effect of the invention is explained again below by emulation and its data

Embodiment 8

Space-time video compress cognitive method based on convolutional network is same to implement 1-7,

Test condition：

The reconstruct part that the network of space-time video compress cognitive method is arranged is connected to one " space-time block ", remaining network ginseng Number is as described above, constitute a network.

The reconstruct part that the network of space-time video compress cognitive method is arranged is connected to two " space-time blocks ", remaining network ginseng Number is as described above, reconstruct a network.

The two networks are trained respectively, obtain trained network.

Test experiments content

One, of test experiments chooses " walk " and " foliage " in Vidset4 and is used as test data set, referring to fig. 4, will Test data set is sent into the present invention and passes through the two trained networks, obtains reconstructing video, chooses two tests view therein The reconstructing video frame of any one frame of frequency is as shown in figure 4, the first behavior " walk " and " foliage " input video are original in Fig. 4 Frame, the second behavior be only connected to the network of one " space-time block " space-time video compress cognitive method to " walk " and The reconstructing video frame of " foliage " input video primitive frame, third behavior are only connected to two " space-time block " space-time video pressures Reconstructing video frame of the network of contracting cognitive method to " walk " and " foliage " input video primitive frame.

As seen from Figure 4, greyscale video has preferable vision after the network reconfiguration of space-time video compress cognitive method Quality reconstruction, the second row and the third line of comparison diagram 4, the network of one " space-time block " space-time video compress cognitive method is The main contents of original video frame can be recovered, the network of two " space-time block " space-time video compress cognitive methods can make It obtains reconstruction result to be more clear, further proves that " space-time block " of the invention enables to quality reconstruction more preferable.

The greyscale video of two, of test experiments " walk " and " foliage " inputs respectively is connected to one " space-time block " and two What the network of the space-time video compress cognitive method of a " space-time block " was tested arrives reconstructing video, calculates preceding the 32 of reconstructing video The Y-PSNR PSNR and structural similarity SSIM of the image of frame, as a result such as table one：

Table one：The PSNR/SSIM of reconstructing video frame

One block	Walk	Foliage	Two blocks	Walk	Foliage
						1st frame	25.39/0.809	23.05/0.664	1st frame	26.07/0.817	23.81/0.716
2nd frame	25.87/0.841	24.39/0.742	2nd frame	26.34/0.839	24.90/0.769
						3rd frame	25.96/0.849	24.87/0.766	3rd frame	26.64/0.854	25.02/0.779
…	…	…	…	…	…
						Average value	25.25/0.839	23.77/0.690	Average value	25.80/0.844	24.10/0.722

The reconstructing video frame that Fig. 4 is provided is compared from the subjective vision of the mankind, illustrates that " space-time block " can enhance space-time pass Connection property, video reconstruction effect is good.Image index test result is provided from table one below, more can illustrate the present invention from the angle of quantization Reconstruction quality.From table one as it can be seen that reconstructing video of the greyscale video after the network of space-time video compress cognitive method it is every One frame all has a preferable PSNR and SSIM, and numerical values recited is close, it was demonstrated that and reconstructing video has preferable reconstruction quality, Illustrate that reconstructing video result has preferable space-time balanced, can guarantee that the quality of each frame of reconstructing video is stablized.

Three, of test experiments, which is calculated separately, is connected to one " space-time block " and the perception of two " space-time block " space-time video compress The reconstitution time of average each frame of preceding 32 frame of the reconstructing video of the network " walk " and " foliage " of method, wherein：

Average each frame reconstitution time of the network of one " space-time block " space-time video compress cognitive method is 0.03~ 0.04s,

Average each frame reconstitution time of the network of two " space-time block " space-time video compress cognitive methods is 0.05~ 0.06s。

Therefore the time of the network reconfiguration video of space-time video compress cognitive method of the present invention is fast, has real-time, Increase the time that " space-time block " increases reconstructing video simultaneously, " space-time can be increased as far as possible while guaranteeing real-time Block ".

In brief, the space-time video compress cognitive method disclosed by the invention based on convolutional network, mainly solves existing The problem of video compress space-time balanced difference and reconstructing video real-time difference in technology.Its scheme is：1) training data is prepared 2) collection designs the network structure 3 of space-time video compress cognitive method) training and test text are write according to designed network structure Part 4) training space-time video compress cognitive method network 5) test space-time video compress cognitive method network.Space-time of the present invention The observation technology and enhance temporal and spatial correlations with " time-space block " that the network of video compress cognitive method is compressed simultaneously using space-time Property reconfiguration technique, be not only able to achieve real-time video reconstruct, but also reconstruct result have stronger space-time balanced, reconstruct matter Amount is high and stablizes, and can be used for the compression transmission and subsequent video reconstruction of video.

Claims

1. a kind of space-time video compress cognitive method based on convolutional network, which is characterized in that including having the following steps：

1) training dataset is prepared：The video of different resolution needed for downloading, pre-processes all videos of downloading, by institute Foradownloaded video successively switchs to the video frame of gray scale, and spatially position is cut to the fritter of certain size, by each space bit The fritter set saves as picture, is stored in different sub-folders, and name according to video time frame sequential, i.e. 1.jpg, 2.jpg, The rest may be inferred, and all sub-folders finally constitute a whole file, and using whole file as training dataset；It surveys Trying data set is any video randomly selected, switchs to greyscale video and is stored in file；

2) network structure of space-time video compress cognitive method is designed：

Network structure includes observation part and reconstruct part, and input video block is input to a Three dimensional convolution layer by observation part Afterwards as observation output, observation output is serially connected to three-dimensional warp lamination, several " a space-time blocks ", one by reconstruct part BN layers, obtain reconstructing video block after a Three dimensional convolution layer；Each " space-time block " is that four Three dimensional convolution layers are sequentially connected in series, And BN layers are added before each Three dimensional convolution layer, then by the output end of first Three dimensional convolution layer and the 4th Three dimensional convolution layer Output end carries out residual error connection；

3) trained and test file is write according to designed network structure：

Project folder 3a) is established, establishes the programs such as training, test, network structure, network settings, function respectively in file File；

3b) it is arranged and writes associated file content：Reasonable network hyper parameter is set in network profile, and in function text Required function in training and test code is written in part；According to space-time video compress cognitive method in network structure file Write network structure；

The training process of network 3c) is defined in training file：The sub-folder concentrated from training data takes several views in order Frequency frame forms network inputs video block of the video block as space-time video compress cognitive method, is perceived by space-time video compress The reconstructing video block and input video block of the network output of method calculate mean square error reconstruct and lose, and reconstruct loss are carried out reversed Propagation is updated network parameter；

The test process of network 3d) is defined in test file：The greyscale video that test data is concentrated is switched into video frame, according to Gray scale frame is divided into many continuous several frames by time sequencing, each continuous several frame as an input video block, The network for being input to space-time video compress cognitive method obtains corresponding reconstructing video block, and reconstructing video block includes and input video The identical frame number of block, arranges reconstructing video block to obtain reconstructing video according to time sequencing, is space-time video compress cognitive method Network test result；

4) network of training space-time video compress cognitive method：By network structure file load networks structure and involved Parameter hyper parameter is loaded by network profile and is pressed using stochastic gradient descent algorithm and to all parameter initialization According to the training process of training document definition, repeatedly instructed using network of the training dataset to space-time video compress cognitive method Practice undated parameter, training terminates to obtain final argument model；

5) network of space-time video compress cognitive method is tested：By network structure file load networks structure, final ginseng is loaded Exponential model defines test process as network architecture parameters, according to test file, tests space-time video pressure using test data set The network of contracting cognitive method obtains the reconstruct test video of real-time high quality, the space-time video compress of high space-time balance perception As a result.

2. a kind of space-time video compress cognitive method based on convolutional network according to claim 1, which is characterized in that step It is rapid 2) described in design space-time video compress cognitive method network structure, including have the following steps：

2a) observe the Three dimensional convolution layer setting of part：The size that the convolution kernel of Three dimensional convolution layer is arranged is T × 3 × 3, wherein T= 16 be size of the convolution kernel on time dimension, and 3 × 3 be the size in space dimension, and not zero padding in convolution process is arranged, and step-length is 3；Input video block size is T × H × W, and T is the frame number that input video block includes, and H × W is the space dimension size of each frame, and H and W is 3 multiple；When convolution kernel number is 1, space compression rate is 9, and time compression ratio is T, observation rateIf The number for setting convolution kernel is N, then observation rateWork as N=9, observation rate when T=16

2b) reconstruct the setting of partial 3-D warp lamination：The convolution kernel that three-dimensional warp lamination is arranged need to be according to convolution and deconvolution Symmetry, when the convolution kernel size of deconvolution process is identical as the convolution kernel size of corresponding convolution process, The output of deconvolution is just identical as the input size of convolution, thus the convolution kernel that three-dimensional warp lamination is set be arranged should with observed The convolution kernel setting of the Three dimensional convolution layer of journey is identical, i.e., having a size of T × 3 × 3, number N, and not zero padding in convolution process, step-length It is 3；

2c) reconstruct the design of " the space-time block " of part：Each " space-time block " is sequentially connected in series for four Three dimensional convolution layers, and Add BN layer before each Three dimensional convolution layer of " space-time block ", then roll up the input terminal of second Three dimensional convolution layer and the 4th are three-dimensional The output end of lamination carries out residual error connection, and by second, third, the 4th Three dimensional convolution layer constitutes residual block for residual error connection； Each " space-time block " is then by a Three dimensional convolution layer and a residual block serial connection；

2d) reconstruct the Three dimensional convolution layer setting of part：Reconstruct the convolution kernel of the last Three dimensional convolution layer in part having a size of 16 × 1 × 1, number 16, step-length 1, not zero padding in convolution process.

3. the space-time video compress cognitive method according to claim 2 based on convolutional network, which is characterized in that step " space-time block " is then by a Three dimensional convolution layer and a residual block serial connection each of described in 2c), including has the following steps：

2c1) the Three dimensional convolution layer setting in each " space-time block "：The convolution kernel size of Three dimensional convolution layer is 16 × 1 × 1, number Mesh is 16, and not zero padding in convolution process is arranged, step-length 1；

2c2) the residual block setting in each " space-time block "：Residual block includes three Three dimensional convolution layers, the volume of Three dimensional convolution layer The size of product core is respectively 16 × 3 × 3,64 × 1 × 1 and 32 × 3 × 3, and number is respectively 64,32,16, and setting convolution process is equal Not zero padding, step-length are 1；The output end that the input terminal of residual block is connected to residual block is subjected to read group total, and is counted in summation Add Tanh active coating after calculation；

2c3) in 2c1) and each Three dimensional convolution layer 2c2) before plus one BN layer to accelerate convergence rate, then add PReLU To enhance the non-thread sexuality of network；

Multiple " space-time blocks " can 2c4) be cascaded, i.e., using the output of upper one " space-time block " as next " space-time block " Input to expand network capacity, the structure of each block is identical.