CN106911930A

CN106911930A - It is a kind of that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net

Info

Publication number: CN106911930A
Application number: CN201710124135.2A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-03-03
Filing date: 2017-03-03
Publication date: 2017-06-30

Abstract

A kind of method that perception video reconstruction is compressed based on recursive convolution neutral net proposed in the present invention, its main contents are included：Compressed sensing network (CSNet), CSNet algorithm structures, convolutional neural networks (CNN), shot and long term memory (LSTM) network, CSNet network trainings, compressed sensing video reconstruction, its process is, motion feature is extracted using RNN, CNN extracts visual signature, the information that both fusions are extracted, the all features extracted using LSTM network aggregations, it are formed with the deduction movement combination of hidden state and are rebuild.The present invention breaches the problem that existing method is difficult to ensure that video reconstruction quality under high compression ratio, devise a kind of training and non-iterative model end to end, improve the compression ratio (CR) of CS video cameras, and improve video reconstruction quality, the bandwidth of data transfer is reduced simultaneously so that can support the Video Applications of frame per second high.

Description

It is a kind of that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net

Technical field

The present invention relates to video compress and reconstruction field, carried out based on recursive convolution neutral net more particularly, to one kind The method of compressed sensing video reconstruction.

Background technology

Video compress and reconstruction are usually used in research, video monitoring, remote sensing technology, social networks of physics and bioscience etc. Field, in the research of physics and bioscience, high-speed camera is used to record the two-forty to be recorded of traditional camera Affair character, it can record the static image of high-resolution of high speed event, for example, tracking " insignificant motion blur and image The explosion bubble of distortion artifacts ".In video monitoring, region interested in monitor video can be rebuild, to particular persons Or the image of car plate carries out enhancing and improves identification.But, if frame per second is 1080P's for the video camera of 10kfps shoots resolution ratio HD video, then the data that can produce about 500GB per second, this constitutes huge to existing transmission and memory technology How challenge, efficiently transmit and store the focus that these large-capacity videos are current research.

The present invention proposes a kind of method for being compressed based on recursive convolution neutral net and perceiving video reconstruction, using volume Neutral net (CNN) and recurrent neural network (RNN) is accumulated to extract space-time characteristic, including background, object detail and motion letter Breath, has reached more preferable reconstruction quality.Specifically, random coded device parallel running, using in more measurement encoded video First frame, while using less measurement coded residual frame, for each compression measurement, thering is specific CNN therefrom to extract space special Levy, length memory all features for being extracted by each CNN of (LSTM) network aggregation, and hidden state deduction campaign shape together Into reconstruction.The present invention breaches a series of limitation of the conventional process mode that video is considered as independent images, by RNN by the time Information application is in process of reconstruction, so as to generate more accurate models, in addition this method is also regarded keeping preferably original On the basis of frequency visual details, improve compression ratio and reduce the broadband of data transfer, improve video reconstruction quality, prop up Hold the Video Applications of frame per second high.

The content of the invention

The problem of video reconstruction quality is difficult to ensure that under high compression ratio for existing method, it is an object of the invention to carry The method for perceiving video reconstruction is compressed based on recursive convolution neutral net for a kind of, has surmounted the limitation of conventional method, carried The compression ratio (CR) of CS video cameras high, and video reconstruction quality is improve, while reducing the bandwidth of data transfer so that can To support the Video Applications of frame per second high.

To solve the above problems, the present invention provides one kind and is compressed perception video reconstruction based on recursive convolution neutral net Method, its main contents includes：

(1) compressed sensing network (CSNet)；

(2) CSNet algorithm structures；

(3) convolutional neural networks (CNN)；

(4) shot and long term memory (LSTM) network；

(5) CSNet network trainings；

(6) compressed sensing video reconstruction.

Wherein, described compressed sensing network (CSNet), is a kind of deep neural network, can be suffered from random measurement Solution visual representation, is a kind of training and non-iterative model end to end for compressed sensing video reconstruction, combines convolutional Neural Network (CNN) and recurrent neural network (RNN), so as to carry out video reconstruction using space-time characteristic, this network structure can connect Receive with the random measurement of multi-stage compression ratio (CR), separately provide background information and object detail, reach preferably reconstruction Quality.

Wherein, described CSNet algorithm structures, the structure includes three modules：For measure random coded, for regarding Feel CNN clusters, the LSTM for time reconstruction of feature extraction, random coded device parallel running is encoded using more measurement First frame in video, while using less measurement coded residual frame, can receive multi-stage compression ratio (CR) measurement, is calculated by this Method, key frame and non-key frame (remaining frame of main contributions movable information) are compressed respectively, and recurrent neural network (RNN) is calculated Go out movable information, and these information are combined with the visual signature extracted by convolutional Neural system (CNN), synthesize high-quality Frame, efficient information fusion can make be reached most between the fidelity of compressed sensing (CS) Video Applications and compression ratio (CR) Excellent balance.

Wherein, described convolutional neural networks (CNN), the network to image be compressed measurement and put reconstruction outward, when Between compression and space compression be combined together with maximum compression ratio, one larger CNN of design processes key frame, because closing Key frame contains entropy information high, meanwhile, one less CNN of design processes non-key frame, in order to reduce system delay and Simplify network structure, using image block as input, now, the size of all characteristic patterns generated by CNN is identical with image block, The quantity monotonic decreasing of characteristic pattern, the m dimensional vectors that this network inputs is made up of compression measurement, there is a holostrome before CNN, It measures one two dimensional character figure of generation using these.

Further, described time compression, to obtain compression ratio (CR) higher, each video comprising T frames is mended Fourth is divided into K key frame and (T-K) individual non-key frame, and key frame compresses by low compression ratio (CR), and non-key frame is by high pressure Contracting is compressed than (CR) so that the metrical information of key frame can be used to rebuild non-key frame again, and this can regard time compression as.

Wherein, described shot and long term memory (LSTM) network, rebuilds for the time, for obtain end-to-end training, And effective model is calculated, and do not pre-process to being originally inputted, and rebuild using a LSTM network extraction must Few motion feature, so as to estimate the light stream of video, the LSTM networks of synthesis are used for motion extrapolation, spatial vision feature and fortune Dynamic aggregation, to reach video reconstruction.

Further, described LSTM network training process, it is characterised in that in the training process of LSTM networks, rises The M- of first LSTM is input into the CNN data of extraction process key frame, and the CNN of remaining (T-M) extraction process non-key frame is exported, For each LSTM unit, it will receive the visual signature of key frame, and these visual signatures are used for Background Reconstruction, recover object Present frame and estimation last several frames.

Wherein, described CSNet network trainings, are divided into two stages, first stage, pre-training background CNN, and from Visual signature is extracted in K key frames, second stage gives model more basic blocks extracted from origin needed for building object, Then training (T-M) the smaller CNN, these objects CNN and pre-training background CNN that start from scratch is tied by a LSTM for synthesis Close, three networks are trained together, the number of parameters for needed for reducing training, the last several layers of of only key frame CNN are combined, institute Input with these figure layers is Feature Mapping rather than measurement, using average Euclidean loss as loss function, i.e.,

Herein, W and b are network weight and biasing, x_iAnd y_iIt is that each image block and its CS are measured, a random Gaussian Matrix is used for CS codings.

Wherein, described compressed sensing video reconstruction, sets up the present frame based on information, using recurrent neural network (RNN) motion feature is extracted, convolutional neural networks (CNN) extract visual signature, the information that both fusions are extracted, using LSTM All features that network aggregation is extracted, it are formed with the deduction movement combination of hidden state and are rebuild.

Brief description of the drawings

Fig. 1 is a kind of system stream that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net of the present invention Cheng Tu.

Fig. 2 is that a kind of framework of the method that perception video reconstruction is compressed based on recursive convolution neutral net of the present invention is whole Body structure.

Fig. 3 is a kind of CSNet that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net of the present invention Network training schematic diagram.

Fig. 4 is a kind of compression sense that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net of the present invention Know video reconstruction flow chart.

Specific embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system stream that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net of the present invention Cheng Tu.Mainly include compressed sensing network (CSNet), CSNet algorithm structures, convolutional neural networks (CNN), shot and long term memory (LSTM) network, CSNet network trainings, compressed sensing video reconstruction.

Wherein, described convolutional neural networks (CNN), the network to image be compressed measurement and put reconstruction outward, when Between compression and space compression be combined together with maximum compression ratio, one larger CNN of design processes key frame, because closing Key frame contains entropy information high, meanwhile, one less CNN of design processes non-key frame, in order to reduce system delay and Simplify network structure, using image block as input, now, the size of all characteristic patterns generated by CNN is identical with image block, The quantity monotonic decreasing of characteristic pattern, the m dimensional vectors that this network inputs is made up of compression measurement, there is a holostrome before CNN, It measures one two dimensional character figure of generation using these.To obtain compression ratio (CR) higher, each video comprising T frames is mended Fourth is divided into K key frame and (T-K) individual non-key frame, and key frame compresses by low compression ratio (CR), and non-key frame is by high pressure Contracting is compressed than (CR) so that the metrical information of key frame can be used to rebuild non-key frame again, and this can regard time compression as.

Wherein, described shot and long term memory (LSTM) network, rebuilds for the time, for obtain end-to-end training, And effective model is calculated, and do not pre-process to being originally inputted, and rebuild using a LSTM network extraction must Few motion feature, so as to estimate the light stream of video, the LSTM networks of synthesis are used for motion extrapolation, spatial vision feature and fortune Dynamic aggregation, to reach video reconstruction.In the training process of LSTM networks, the M- input extraction processs of LSTM originally are crucial The CNN data of frame, the CNN outputs of remaining (T-M) extraction process non-key frame, for each LSTM unit, it will be received The visual signature of key frame, these visual signatures are used for the last of Background Reconstruction, the present frame of recovery object and estimation Several frames.

Fig. 2 is that a kind of framework of the method that perception video reconstruction is compressed based on recursive convolution neutral net of the present invention is whole Body structure chart.Compressed video frame is obtained by compressed sensing.Reconstruction is performed by CSNet, and CSNet is by background CNN, object CNN and the LSTM of synthesis compositions.In per T frames, preceding M frames and remaining (T-M) frame are compressed by low CR and CR high respectively. Background CNN first by pre-training, then, train together by the remainder of CNN layers of remaining background and model.

Fig. 3 is a kind of CSNet that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net of the present invention Network training schematic diagram.Network training process is divided into two stages, wherein pre-training of the figure a for background CNN, figure b is CNN and conjunction Into LSTM joint training.First stage, pre-training background CNN, and visual signature is extracted from K key frame, such as figure a It is shown；Second stage, gives model more basic blocks extracted from origin needed for building object, and we start from scratch training (T-M) small CNNs, these objects CNN and pre-training background CNN are combined by a LSTM for synthesis, and three networks are instructed together Practice, as shown in figure b.Number of parameters for needed for reducing training, the last several layers of of only key frame CNN are combined, so layer Input is Feature Mapping rather than measurement.

Fig. 4 is a kind of compression sense that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net of the present invention Know video reconstruction flow chart.The present frame based on information is set up, motion feature, convolution are extracted using recurrent neural network (RNN) Neutral net (CNN) extracts visual signature, the information that both fusions are extracted, all spies extracted using LSTM network aggregations Levy, it is formed with the deduction movement combination of hidden state and is rebuild.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. it is a kind of that the method for perceiving video reconstruction is compressed based on recursive convolution neutral net, it is characterised in that mainly to include Compressed sensing network (CSNet) (one)；CSNet algorithm structures (two)；Convolutional neural networks (CNN) (three)；Shot and long term is remembered (LSTM) network (four)；CSNet network trainings (five)；Compressed sensing video reconstruction (six).

2. based on the compressed sensing network (CSNet) () described in claims 1, it is characterised in that compressed sensing network (CSNet) it is a kind of deep neural network, visual representation can be understood from random measurement, for compressed sensing video reconstruction, It is a kind of training and non-iterative model end to end, combines convolutional neural networks (CNN) and recurrent neural network (RNN), from And video reconstruction is carried out using space-time characteristic, this network structure can receive the random measurement with multi-stage compression ratio (CR), Background information and object detail are separately provided, more preferable reconstruction quality is reached.

3. based on the recurrent neural network (RNN) described in claims 2, it is characterised in that for video reconstruction application, simulation Time course is extremely important, and by setting up the present frame based on information, these packets are containing outer between present frame and patch Time-dependent relation is pushed away, temporal information is applied to process of reconstruction by recurrent neural network (RNN), can be used to generate more accurate Model.

4. based on the CSNet algorithm structures (two) described in claims 1, it is characterised in that the structure includes three modules：With Random coded in measurement, CNN clusters, the LSTM for time reconstruction for Visual Feature Retrieval Process, random coded device are parallel Operation, using the first frame in more measurement encoded video, while using less measurement coded residual frame, multistage can be received Compression ratio (CR) is measured, and by this algorithm, key frame and non-key frame (remaining frame of main contributions movable information) are pressed respectively Contracting, recurrent neural network (RNN) extrapolates movable information, and extracted by these information and by convolutional Neural system (CNN) Visual signature is combined, and synthesizes high-quality frame, and efficient information fusion can make the fidelity of compressed sensing (CS) Video Applications The balance that must be optimal and compression ratio (CR) between.

5. based on the convolutional neural networks (CNN) (three) described in claims 1, it is characterised in that the network is carried out to image Compression is measured and puts reconstruction outward, and time compression and space compression are combined together with maximum compression ratio, and design one is larger CNN process key frame because key frame contains entropy information high, meanwhile, one less CNN is non-key to process for design Frame, in order to reduce the delay of system and simplify network structure, using image block as input, now, by owning that CNN is generated The size of characteristic pattern is identical with image block, the quantity monotonic decreasing of characteristic pattern, and this network inputs is tieed up by the m that compression measurement is constituted Vector, there is a holostrome before CNN, and it measures one two dimensional character figure of generation using these.

6. based on the time compression described in claims 5, it is characterised in that to obtain compression ratio (CR) higher, will be comprising T Each video patch of frame is divided into K key frame and (T-K) individual non-key frame, and key frame compresses by low compression ratio (CR), non- Key frame compresses by high compression ratio (CR) so that the metrical information of key frame can be used to rebuild non-key frame again, this Time compression can be regarded as.

7. (LSTM) network (four) is remembered based on the shot and long term described in claims 1, it is characterised in that rebuild for the time, For obtaining end-to-end training and calculate effective model, do not pre-process to being originally inputted, and utilize one LSTM network extractions rebuild essential motion feature, so as to estimate the light stream of video, the LSTM networks of synthesis are used for fortune The aggregation of dynamic extrapolation, spatial vision feature and motion, to reach video reconstruction.

8. based on the LSTM network training process described in claims 7, it is characterised in that in the training process of LSTM networks In, the M- of LSTM originally is input into the CNN data of extraction process key frame, the CNN of remaining (T-M) extraction process non-key frame Output, for each LSTM unit, it will receive the visual signature of key frame, and these visual signatures are used for Background Reconstruction, extensive The present frame of multiple object and last several frames of estimation.

9. based on the CSNet network trainings (five) described in claims 1, it is characterised in that be divided into two stages, first rank Section, pre-training background CNN, and visual signature is extracted from K key frames, second stage is more carried to model from origin Take the basic block needed for building object, then start from scratch training (T-M) smaller CNN, these objects CNN and pre-training background CNN is combined by a LSTM for synthesis, and three networks are trained together, and the number of parameters for needed for reducing training is only crucial The last several layers of of frame CNN are combined, so the input of these figure layers is Feature Mapping rather than measurement, average Euclidean is lost and is made It is loss function, i.e.,

L (w, b) = \frac{1}{2 N} Σ_{i}^{T} | | f (y_{i}, W, b) - x_{i} | |_{2}^{2}

Herein, W and b are network weight and biasing, x_iAnd y_iIt is that each image block and its CS are measured, a random Gaussian matrix It is used for CS codings.

10. based on the compressed sensing video reconstruction (six) described in claims 1, it is characterised in that set up working as based on information Previous frame, motion feature is extracted using recurrent neural network (RNN), and convolutional neural networks (CNN) extract visual signature, both fusions The information extracted, all features extracted using LSTM network aggregations are formed it with the deduction movement combination of hidden state Rebuild.