CN107197260A

CN107197260A - Video coding post-filter method based on convolutional neural networks

Info

Publication number: CN107197260A
Application number: CN201710439132.8A
Authority: CN
Inventors: 张永兵; 林荣群; 王兴政; 王好谦; 戴琼海
Original assignee: Shenzhen Weilai Media Technology Research Institute; Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Weilai Media Technology Research Institute; Shenzhen Graduate School Tsinghua University
Priority date: 2017-06-12
Filing date: 2017-06-12
Publication date: 2017-09-22
Anticipated expiration: 2037-06-12
Also published as: CN107197260B

Abstract

Video coding post-filter method based on convolutional neural networks, including convolutional neural networks model training step and filter step, training step include：Set the quantization parameter of video compress to carry out coding compression for 20 to 51 pairs of original videos, obtain compressing video；All videos are carried out with frame and extracts the frame pair for obtaining multiple compressed video frame original video frames；It will extract and obtain frame to being multiple groups by the different demarcation of frame type and quantization parameter；Convolutional neural networks framework and initialization network parameter are built, neutral net is trained respectively using the group of foregoing division, obtains corresponding to different quantization parameters and multiple neural network models of frame type.Filter step includes：Obtained multiple neural network models are embedded in the post-filtering link of video encoder；The coding compression foregoing to pending original video execution and frame extract and obtain pending frame pair, and the quantization parameter and the corresponding neural network model of frame type selecting according to pending frame pair are filtered processing.

Description

Video coding post-filter method based on convolutional neural networks

Technical field

The present invention relates to computer vision and field of video encoding, and in particular to a kind of video based on convolutional neural networks Encode post-filter method.

Background technology

With the development of science and technology and enriching for various video display apparatus, video has been increasingly becoming can not in people's life The part lacked, highly important effect is played in every field.The past video resolution that witnessed in decades is with showing Show the great development of device screen, and the video council of ultrahigh resolution produces huge data volume, this can be produced to the network bandwidth Greatly bear.Accordingly, it would be desirable to which efficient Video coding and transmission technology ensure that user watches the experience of video, while to the greatest extent may be used The data volume of energy ground reduction video, is network bandwidth Reduction of Students' Study Load.In consideration of it, researcher is in decades, constantly research is efficiently regarded Frequency coding method.Video coding technique, mainly reduced by removing the redundancy in video the data volume of video with up to To effectively storing and transmitting multitude of video data, it is intended on the premise of keeping former video quality as far as possible, with lower code check Compress video.

However, current video encoding standard is mainly all based on the hybrid video coding technology of block, in this encoder block It is block-based to predict within the frame/frames, change and slightly quantify all cause the decline of video quality, especially in low code under frame In the case of rate.Therefore, the distortion reduced in Video coding turns into one of study hotspot of current video coding field.Although, Some algorithms are also taken in current video coding standards to reduce blocking effect and improve subjective quality, but its effect is also not Enough ideals, the video after processing still suffers from obvious blocking effect and edge blurry, and loss in detail is still than more serious.

The content of the invention

It is a primary object of the present invention to using strong capability of fitting of the convolutional neural networks to nonlinear transformation, propose a kind of Video coding post-filter method based on convolutional neural networks, this method, which is set up, damages frame of video reflecting to lossless video frame Penetrate, so that the inverse transformation of Distortion course in Video coding is approximately obtained, to reach the purpose for reducing distortion.

The present invention is as follows for the technical scheme provided up to above-mentioned purpose：

A kind of Video coding post-filter method based on convolutional neural networks, includes the training of convolutional neural networks model Step and post-filtering process step, wherein：

The training step includes S1 to S4：

S1, the quantization parameter of setting video compress are 20 to 51, carry out coding compression to original video, obtain compression and regard Frequently；

S2, the extraction that frame is carried out to the compression video and the original video, obtain multiple frames pair, each frame pair Include a compressed video frame and an original video frame；

S3, the frame for extracting step S2 are multiple groups to the different demarcation according to frame type and quantization parameter；

S4, convolutional neural networks framework and initialization network parameter are built, the group divided using step S3 is respectively to nerve Network is trained, and obtains corresponding to different quantization parameters and multiple neural network models of frame type；

The post-filtering process step includes S5 and S6：

S5, the post-filtering link that multiple neural network models that step S4 is obtained are embedded in video encoder；

S6, step S1 and S2 are performed to pending original video obtain pending frame pair, and according to pending frame pair Quantization parameter and the corresponding neural network model of frame type selecting are filtered processing.

The difference of quantization parameter and frame type causes their nature of distortion also different, and the present invention is in training and actual treatment During be all the frame pair for extracting original video and compression video composition, set up with this and damage frame of video and lossless video frame Mapping；And post-filtering in encoder is embedded into by the neural network model trained for different quantization parameters and frame type Link is filtered processing, takes into full account the influence of quantization parameter and frame type to distortion level, first using the method for the present invention Differentiate the quantization parameter and frame type of frame, reselection quantization parameter and all corresponding neural network model of frame type are carried out Filtering, so as to effectively suppress distortion.

Brief description of the drawings

Fig. 1 is the flow chart for the Video coding post-filter method based on convolutional neural networks that the present invention is provided；

Fig. 2 is convolutional neural networks frame diagram；

Fig. 3 is the training process schematic diagram of convolutional neural networks.

Embodiment

The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.

The present invention is using the powerful nonlinear fitting ability of convolutional neural networks, with reference to the characteristic of Video coding, fully divides In analysis Video coding compression process the reason for cause distortion, it is proposed that based on convolutional neural networks and be relevant to quantization parameter and frame The Video coding post-filter method of type, sets up the mapping damaged between frame of video and lossless video frame, and training obtains difference Convolutional neural networks model under quantization parameter (abbreviation QP) and different frame type, for the post-filtering link in encoder Processing is filtered for the frame of different quantization parameters and frame type, can effectively suppress distortion.

In mixed video coding framework, frame type mainly has I frames, P frames and B frames, wherein：

I frames use intra prediction mode, also referred to as intracoded frame (intra picture), and I frames are typically every The start frame of one picture group (GOP), its information independent of front and rear frame, therefore shuffle can be used as in broadcasting Point of penetration.For intra prediction mode, in decoding end, it is the block and the block of top on the adjacent left side based on decoding, uses Algorithm carries out the prediction of current block, and then along with residual error data carrys out the raw information of reconstruction image, the residual error data is by solving Code value and predicted value are subtracted each other and obtained；In coding side, it is necessary to using identical infra-frame prediction, because predicted value and raw image data It is more similar, so the energy for the remaining data to be transmitted is low-down, so as to realize the compression of code stream.

P frames and B frames use inter-frame forecast mode, P frames are forward predicted frame (predictive-frame), with reference to above I frames or B frames remove temporal redundancy, realize Efficient Compression.B frames are bi-directional predicted interpolation coding frame, also known as (bi-directional interpolated prediction frame), it needs to refer to front and rear frame of video, utilizes two The frame on side removes redundancy with the similitude of B frames.Inter prediction is mainly based upon higher similarity between consecutive frame, utilizes The information of given frame rebuilds present frame, and commonly used approach is pre- to carry out interframe by way of motor-function evaluation Survey.The method i.e. estimation method of the macro block most matched is found, this method is to be directed to current macro, first finds most phase in former frame As macro block (i.e. match block), poor are asked to them, many null values are produced, carrying out the coding flow similar to frame in can with regard to that can save The bit number of sight.

In mixed video coding framework, the frame of video of different frame type, its coding method is different, just because such as This, its nature of distortion of the frame of video of different frame type is also different.Because I frames are relatively independent, therefore master the reason for its distortion To be quantizing process and dct transform (discrete cosine transform) process, quantizing process is substantially that many numerical value are mapped into a small number of values Process, relatively, de-quantization process recovers the process of many numerical value by a small number of values, and its recovery inevitably has number According to error, therefore frame of video occurs some by quantifying the high-frequency noise that loss is caused.Producing the major way of P frames and B frames is By way of inter prediction, the source of both frames prediction is from I frames, therefore the nature of distortion that I frames have in itself Also P frames and B frames can be broadcast to；In addition, during motor-function evaluation, there is also the factor for causing distortion.Such as, in frame Between the motion compensation link predicted, two neighboring piece of prediction block of present frame is possible to be not from the phase of same reference frame The prediction of adjacent two blocks, it may be possible to from the prediction of two blocks in different frame.That is, that two of adjacent block reference The edge of block does not just possess continuity in itself, and this also leads to certain blocking effect phenomenon, so as to cause distortion.

The general principle of Video coding is exactly the correlation that has using video sequence to compress redundancy, and realization is regarded The high efficiency of transmission of frequency sequence.The correlation that wherein video sequence has is broadly divided into room and time correlation, spatially Correlation has correlation between being mainly reflected in same frame of video neighbor pixel, and temporal correlation is mainly reflected in Adjacent frame has higher similarity on time.Therefore, in mixed video coding framework, the generation side of each frame of video Formula, is not the result of infra-frame prediction, is exactly the result of inter prediction.Its generation mechanism of different types of frame is also differed, meaning Needs frame type being also contemplated for into the video encoding quality enhancing algorithm based on convolutional neural networks, to obtain more preferably Effect.

Based on foregoing principle, it is rearmounted that the specific embodiment of the invention proposes the Video coding based on convolutional neural networks Filtering method, this method includes the training step and post-filtering process step of convolutional neural networks model, as shown in Figure 1：

The training step includes S1 to S4：

The post-filtering process step includes S5 and S6：

The method of the present invention is filtered place firstly the need of training neutral net, then with the neural network model trained Reason.Illustrate that the above-mentioned Video coding based on convolutional neural networks that the present invention is provided is rearmounted with a kind of specific embodiment below Filtering method.

The original video for training is chosen, coding compression is carried out, quantization parameter QP during compression is set to 20 to 51 (i.e. Continuous integer 20,21,22,23,24 ..., 50,51), obtain the compression video under different Q P.Preferably, can be using normal JM10 and/or HM12 video encoding standards software carries out coding compression to original video.To original video and corresponding pressure Contracting video carries out frame extraction, obtains many frames pair, and frame is to that can be expressed as " original video frame-compressed video frame ", i.e., each Frame is to including an original video frame and a corresponding compressed video frame.By obtained all frames to according to frame type and quantization parameter QP different demarcation is multiple groups, and specific partition process is, for example,：All frames are drawn to elder generation according to QP difference (32) Point, then to the frame under each QP to being divided into I frames, P frames and B frames according to frame type, so as to obtain described multiple groups (in this example In, obtain), the frame in each group is to identical frame type and quantization parameter QP.

The framework of neutral net to be trained includes convolutional layer and Relu layers, in a kind of specific embodiment, builds The framework of neutral net refers to Fig. 2, including 3 convolutional layers (Convolution layer1, Convolution in figure Layer2, Convolution layer3) and 2 Relu layers (ReLU layer1 and the ReLU layer2 in figure), convolutional layer Alternately connected successively with Relu layers, for example, connecting convolutional layer 1 (Convolution layer1), convolution after input layer input The output connection Relu layers 1 (ReLU layer1) of layer 1, the output connection convolutional layer 2 (Convolution of Relu layers 1 Layer2), convolutional layer 2 (Convolution layer2) output connection Relu layers 2 (ReLU layer2), Relu layers 2 it is defeated Go out to connect convolutional layer 3 (Convolution layer3), the output connection output layer output of convolutional layer 3.It is to be appreciated that Fig. 2 What shown convolutional neural networks were merely exemplary, it is not construed as limiting the invention.To in the neutral net shown in Fig. 2, The filtering core size (filtering core is convolution kernel) for setting convolutional layer 1,2,3 is respectively 9,1,5, neuron number is respectively 64,32, 1；Relu layers of neuron number with and the Relu layers of neuron number of last layer convolutional layer being connected be consistent.

Then, using foregoing obtained many framings to separately go training it is above-mentioned put up for example shown in Fig. 2 Neutral net.First, by the frame in each group to dividing training set, checking collection and test set, example according to resolution ratio and scene Such as by QP be 20 and frame type be I frames all frames to be divided into training set, checking collection and test set, by QP be 20 and frame type Collect and test set to being divided into training set, checking for all frames of P frames, wherein, the frame in test set is to resolution ratio and field The characteristics of scape is various.Secondly, by all frames to according to certain pel spacing interception image block, obtaining " primitive frame image block- The image block pair of condensed frame image block ", is used as the input of neutral net in training process.

Illustrate the process for training neutral net so that QP is the framing pair that 20, type is I frames as an example, for example, to the group The frame of interior all frame centerings is that spacing carrys out interception image block by 28 pixels, such as the size of each image block is 33 × 33, In the training process, all convolutional layers all avoid boundary effect (because using padding phases without using padding operations When the pixel on border in imagineering, so as to expand the size of image, the information of boundary is just inaccurate), in order to The penalty values between output image block and original picture block are calculated, original picture block is also cut into output image block by the present invention It is in the same size.Then, all image blocks of interception are divided into multiple batches at random, the image block of a batch is sequentially input Neutral net, it is then an iteration to complete a batch, and how many batch is to carry out how many times iteration, sharp in an iterative process Back-propagating is carried out with stochastic gradient descent method, network parameter is updated, constantly loss function is minimized.Per the certain number of times of iteration The network parameter trained is then corrected using checking collection, to prevent over-fitting.

Specific training process is as shown in figure 3, operation of the neuron to the image block of input of convolutional layer is expressed asWherein n represents the quantity of input picture block, w_iRepresent i-th of convolution kernel, x_iRepresent i-th of input picture Block, b represents biasing coefficient, and * represents convolution operation, and image block is represented with picture element matrix；Relu layers of neuron is to image block Operation be expressed as M=max (N, 0), wherein N and M represent the pixel value of input picture block and output image block respectively.

When parameter restrains, training is completed, it is the framing that 20, type is I frames to corresponding neutral net to obtain QP Model, is embedded into encoder post-filtering link and is used as filtering.Similarly, QP is a framing that 20, type is P frames to also adopting Neutral net is trained in aforementioned manners, obtains corresponding to the neural network model that QP is the framing pair that 20, type is P frames； Frame under other QP values using same method to also training neutral net.

Assuming that the compression video employed in training process has two kinds of softwares of JM10 and HM12, then two kinds of Software Compressions are obtained Compression video should separate and handled, separate training.So as to by above-mentioned training method, then obtain following Multiple neural network models：

32 different I frame models that the lower correspondence QP of JM10 standard softwares coding is 20 to 51, JM10 standard softwares coding 32 different P frame models that lower correspondence QP is 20 to 51, the lower correspondence QP of JM10 standard softwares coding for 32 of 20 to 51 not Same B frame models, 32 different I frame models that the lower correspondence QP of HM12 standard softwares coding is 20 to 51, HM12 standard softwares 32 different P frame models that the lower correspondence QP of coding is 20 to 51, the lower correspondence QP of HM12 standard softwares coding is the 32 of 20 to 51 Individual different B frame models.

In actual use, coding compression and frame are carried out to extracting it to the pending encoded device of original video Afterwards, obtained frame is for example judged to learn that the QP of currently pending frame pair is to carrying out the judgement of quantization parameter and frame type 23rd, type is B frames, then it is that the image block that 23, type is B frames trains obtained neural network model come after carrying out that selection, which belongs to QP, Put filtering process.In a preferred embodiment, the encoding software of use can be also contemplated for into, i.e., in training always according to coding The difference of software is trained (as described by the preceding paragraph) respectively, so that it is which kind of that first resolution is further contemplated in actual use Encoding software compressed encoding is obtained, and considers further that quantization parameter and frame type to select corresponding neural network model to be filtered Processing.

Above content is to combine specific preferred embodiment further description made for the present invention, it is impossible to assert The specific implementation of the present invention is confined to these explanations.For those skilled in the art, do not taking off On the premise of from present inventive concept, some equivalent substitutes or obvious modification can also be made, and performance or purposes are identical, all should When being considered as belonging to protection scope of the present invention.

Claims

1. a kind of Video coding post-filter method based on convolutional neural networks, includes the training step of convolutional neural networks model Rapid and post-filtering process step, wherein：

The training step includes S1 to S4：

S1, the quantization parameter of setting video compress are 20 to 51, and coding compression is carried out to original video, obtain compressing video；

S2, the extraction that frame is carried out to the compression video and the original video, obtain multiple frames pair, each frame to comprising One compressed video frame and an original video frame；

S4, convolutional neural networks framework and initialization network parameter are built, the group divided using step S3 is respectively to neutral net It is trained, obtains corresponding to different quantization parameters and multiple neural network models of frame type；

The post-filtering process step includes S5 and S6：

S6, step S1 and S2 are performed to pending original video obtain pending frame pair, and according to the quantization of pending frame pair Parameter and the corresponding neural network model of frame type selecting are filtered processing.

2. the Video coding post-filter method as claimed in claim 1 based on convolutional neural networks, it is characterised in that：Step By frame to being divided into I frames, P frames and B frames according to frame type in S3.

3. the Video coding post-filter method as claimed in claim 1 based on convolutional neural networks, it is characterised in that：Step S3 is also included the frame in each group to dividing training set, checking collection and test set according to resolution ratio and scene；It is wherein described Checking collection is used for the parameter trained in the training process per iteration pre-determined number post-equalization, to prevent over-fitting.

4. the Video coding post-filter method as claimed in claim 3 based on convolutional neural networks, it is characterised in that：It is described Frame pair including a variety of resolution ratio and several scenes in test set.

5. the Video coding post-filter method as claimed in claim 1 based on convolutional neural networks, it is characterised in that：Step When S4 carries out neural metwork training, by frame to according to predetermined pel spacing interception image block, obtaining image block to as training During neutral net input；Wherein, each image block is to including a primitive frame image block and a condensed frame image block.

6. the Video coding post-filter method as claimed in claim 5 based on convolutional neural networks, it is characterised in that：Step The framework for the convolutional neural networks built in S4 includes convolutional layer and Relu layers；Wherein：

Operation of the neuron of convolutional layer to the image block of input is expressed asWherein n represents input picture block Quantity, w_iRepresent i-th of convolution kernel, x_iI-th of input picture block is represented, b represents to bias coefficient, and * represents convolution operation, schemes As block is represented with picture element matrix；

Operation of Relu layers of the neuron to image block is expressed as M=max (N, 0), and wherein N and M represent input picture block respectively With the pixel value of output image block.

7. the Video coding post-filter method as claimed in claim 6 based on convolutional neural networks, it is characterised in that：Convolution Layer and Relu layers of alternately connection successively.

8. the Video coding post-filter method as claimed in claim 5 based on convolutional neural networks, it is characterised in that：Nerve During network training, the size of the original picture block of input is cut in the same size with output image block.