CN110324664A

CN110324664A - A kind of video neural network based mends the training method of frame method and its model

Info

Publication number: CN110324664A
Application number: CN201910612434.XA
Authority: CN
Inventors: 刘俐君; 任金凯; 王子义; 公倩昀; 许靖桐
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2019-10-11
Anticipated expiration: 2039-07-11
Also published as: CN110324664B

Abstract

The present invention provides the training methods that a kind of video neural network based mends frame method and its model；After determining current training reference frame based on preset training set, training reference frame is input to preset initial model；The initial characteristics figure of the default level quantity of training reference frame is generated by feature extraction network；The initial characteristics figure of default level quantity is fused to fusion feature figure by Fusion Features network；Again by fusion feature figure input to output network, the training complementing video frame between the first training frames and the second training frames is exported；The penalty values of training complementing video frame are determined by preset prediction loss function；Continue to be trained to initial model next group of trained reference frame of input, until the parameter convergence in initial model, terminates training, obtains video benefit frame model.The present invention gets the comprehensive characteristic information of reference frame by feature extraction, Fusion Features process, to obtain mending the preferable video supplement frame of effect frame, to improve user's viewing experience.

Description

A kind of video neural network based mends the training method of frame method and its model

Technical field

The present invention relates to technical field of image processing, more particularly, to a kind of video neural network based mend frame method and The training method of its model.

Background technique

In the related technology, it generallys use motion compensation process or benefit frame is carried out to video based on the method for light stream.Pass through fortune When dynamic compensation method carries out benefit frame, reference frame image is divided into static and movement two parts, estimates object according to motion parts The motion vector of body, so that it is determined that obtaining the image data of video frame to be mended；However, object occurs quickly between two frame of video Under case of motion, it is poor to mend frame result.When carrying out benefit frame to video by the method based on light stream, it is assumed that be between consecutive frame Brightness constancy finds previous frame in the variation in time-domain and the correlation between consecutive frame using pixel in image sequence With corresponding relationship existing between present frame, so that it is determined that video frame to be mended；Become when occurring unexpected brightness between two frame of video When change, it is poor to mend frame result；Since above-mentioned video mends the partial information that frame mode only considers reference frame during mending frame, lead It causes benefit effect frame poor, causes user's viewing experience poor.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of videos neural network based to mend frame method and its model Training method, to improve benefit effect frame.

In a first aspect, the embodiment of the invention provides the training method that a kind of video neural network based mends frame model, It include: that current training reference frame is determined based on preset training set；Training reference frame includes the first training frames and the second instruction Practice frame；Training reference frame is input to preset initial model；Initial model include feature extraction network, Fusion Features network and Export network；The initial characteristics figure of the default level quantity of training reference frame is generated by feature extraction network；Melted by feature It closes network and the initial characteristics figure of default level quantity is fused to fusion feature figure；By fusion feature figure input to output network, Export the training complementing video frame between the first training frames and the second training frames；It is determined and is trained by preset prediction loss function The penalty values of complementing video frame；Initial model is trained according to penalty values, until the parameter convergence in initial model, obtains Video mends frame model.

With reference to first aspect, the embodiment of the invention provides the first possible embodiments of first aspect, wherein on Stating feature extraction network includes sequentially connected the first convolutional network of multiple groups；Every group of first convolutional network includes volume interconnected Lamination and average pond layer.

The possible embodiment of with reference to first aspect the first, the embodiment of the invention provides second of first aspect Possible embodiment, wherein the level quantity of above-mentioned initial characteristics figure is multilayer；Scale between multilayer initial characteristics figure is not Together；The step of initial characteristics figure of default level quantity is fused to fusion feature figure by Fusion Features network, comprising: according to Multilayer initial characteristics figure is arranged successively by the scale of each layer initial characteristics figure；Wherein, the scale of the initial characteristics figure of top grade It is minimum；The scale of the initial characteristics figure of bottom grade is maximum；The initial characteristics figure of top grade is determined as melting for top grade Close characteristic pattern；In addition to top grade, by the fusion feature of the initial characteristics figure of current level and a upper level for current level Figure is merged, and the fusion feature figure of current level is obtained；It is special that the fusion feature figure of lowest hierarchical level is determined as final fusion Sign figure.

The possible embodiment of second with reference to first aspect, the embodiment of the invention provides the third of first aspect Possible embodiment, wherein features described above converged network includes sequentially connected the second convolutional network of multiple groups；Every group of volume Two Product network includes bilinear interpolation layer and convolutional layer interconnected；By the upper of the initial characteristics figure of current level and current level The step of fusion feature figure of one level is merged, and the fusion feature figure of current level is obtained, comprising: pass through bilinear interpolation Layer carries out interpolation processing to the fusion feature figure of a upper level for current level, obtains the ruler with the initial characteristics figure of current level The very little fusion feature figure to match；By convolutional layer by the current level after the initial characteristics figure of current level and interpolation processing The fusion feature figure of a upper level carries out convolutional calculation, obtains the fusion feature figure of current level.

With reference to first aspect, the embodiment of the invention provides the 4th kind of possible embodiments of first aspect, wherein on Stating output network includes the first convolutional layer, the second convolutional layer, third convolutional layer, Volume Four lamination and feature synthesis layer；The first volume Lamination, the second convolutional layer, third convolutional layer and Volume Four lamination are connected to the network with Fusion Features respectively；First convolutional layer, second Convolutional layer, third convolutional layer and Volume Four lamination synthesize layer connection with feature respectively；By fusion feature figure input to output network, The step of exporting the training complementing video frame between the first training frames and the second training frames, comprising: by the first convolutional layer to melting It closes the corresponding characteristic of the first training frames in characteristic pattern and carries out the first convolution algorithm, export the first vertical characteristic pattern；Pass through Two convolutional layers carry out the second convolution algorithm to the corresponding characteristic of the first training frames in fusion feature figure, and output first level is special Sign figure；Third convolution algorithm is carried out to the corresponding characteristic of the second training frames in fusion feature figure by third convolutional layer, it is defeated Second vertical characteristic pattern out；The 4th is carried out to the corresponding characteristic of the second training frames in fusion feature figure by Volume Four lamination Convolution algorithm exports the second horizontal properties figure；Layer is synthesized to the first vertical characteristic pattern, first level characteristic pattern, the by feature Two vertical characteristic patterns and the second horizontal properties figure carry out feature superposition processing, obtain training complementing video frame.

Second aspect, the embodiment of the present invention also provide a kind of video neural network based and mend frame method, comprising: obtain to Mend the first reference frame and the second reference frame of frame video；First reference frame and the second reference frame are input to the video pre-established Frame model is mended, complementing video frame is generated；Video mends the training that frame model mends frame model by above-mentioned video neural network based Method training obtains；Complementing video frame is inserted between the first reference frame and the second reference frame.

The third aspect, the embodiment of the present invention also provide a kind of training device of video benefit frame model neural network based, It include: trained reference frame determining module, for determining current training reference frame based on preset training set；Training reference frame Including the first training frames and the second training frames；Training reference frame input module, it is preset first for reference frame will to be trained to be input to Beginning model；Initial model includes feature extraction network, Fusion Features network and output network；Characteristic extracting module, for passing through Feature extraction network generates the initial characteristics figure of the default level quantity of training reference frame；Fusion Features module, for passing through spy It levies converged network and the initial characteristics figure of default level quantity is fused to fusion feature figure；Frame determining module is supplemented, for that will melt Characteristic pattern input to output network is closed, the training complementing video frame between the first training frames and the second training frames is exported；Penalty values Module is obtained, the penalty values of training complementing video frame are determined by preset prediction loss function；Training module, for according to damage Mistake value is trained initial model, until the parameter convergence in initial model, obtains video and mend frame model.

Fourth aspect, the embodiment of the present invention also provide a kind of video benefit frame device neural network based, comprising: reference frame Module is obtained, for obtaining the first reference frame and the second reference frame of frame video to be mended；Frame generation module is supplemented, is used for first Reference frame and the second reference frame are input to the video pre-established and mend frame model, generate complementing video frame；It is logical that video mends frame model The training method training for crossing above-mentioned video benefit frame model neural network based obtains；It supplements frame and is inserted into module, for that will supplement Video frame is inserted between the first reference frame and the second reference frame.

5th aspect, the embodiment of the present invention also provides a kind of server, including processor and memory, memory are stored with The machine-executable instruction that can be executed by processor, it is above-mentioned based on nerve net to realize that processor executes machine-executable instruction The step of training method of the video benefit frame model of network or above-mentioned video neural network based mend frame method.

6th aspect, the embodiment of the present invention also provide a kind of machine readable storage medium, machine readable storage medium storage There is machine-executable instruction, when being called and being executed by processor, machine-executable instruction promotes to handle machine-executable instruction The training method of the above-mentioned video benefit frame model neural network based of device realization realization or above-mentioned video neural network based The step of mending frame method.

The embodiment of the present invention bring it is following the utility model has the advantages that

The embodiment of the invention provides training method, dresses that a kind of video neural network based mends frame method and its model It sets and server；After determining current training reference frame based on preset training set, training reference frame is input to preset Initial model；The initial characteristics figure of the default level quantity of training reference frame is generated by feature extraction network；Melted by feature It closes network and the initial characteristics figure of default level quantity is fused to fusion feature figure；Again by fusion feature figure input to output net Network exports the training complementing video frame between the first training frames and the second training frames；It is determined by preset prediction loss function The penalty values of training complementing video frame；Continue to be trained to initial model next group of trained reference frame of input, until introductory die Parameter convergence in type, terminates training, obtains video and mends frame model.In which, pass through feature extraction network and Fusion Features Network get reference frame compared with the comprehensive characteristic information of horn of plenty, then pass through trained video mend the available benefit of frame model The preferable video of effect frame supplements frame, to improve user's viewing experience.

Other features and advantages of the present invention will illustrate in the following description, alternatively, Partial Feature and advantage can be with Deduce from specification or unambiguously determine, or by implementing above-mentioned technology of the invention it can be learnt that.

To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, better embodiment is cited below particularly, and match Appended attached drawing is closed, is described in detail below.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.

Fig. 1 is the process for the training method that a kind of video neural network based provided in an embodiment of the present invention mends frame model Figure；

Fig. 2 is that a kind of video neural network based provided in an embodiment of the present invention is mended in the training method of frame model, just The structural schematic diagram of beginning model；

Fig. 3 is the stream for the training method that another kind provided in an embodiment of the present invention video neural network based mends frame model Cheng Tu；

Fig. 4 is the flow chart that a kind of video neural network based provided in an embodiment of the present invention mends frame method；

Fig. 5 is that a kind of adaptive video based on deep learning provided in an embodiment of the present invention is mended in frame method, nerve net The data flow schematic diagram of network frame；

Fig. 6 is the structure for the training device that a kind of video neural network based provided in an embodiment of the present invention mends frame model Schematic diagram；

Fig. 7 is the structural schematic diagram that a kind of video neural network based provided in an embodiment of the present invention mends frame device；

Fig. 8 is a kind of structural schematic diagram of server provided in an embodiment of the present invention.

Specific embodiment

Technical solution of the present invention is clearly and completely described below in conjunction with embodiment, it is clear that described reality Applying example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, the common skill in this field Art personnel every other embodiment obtained without making creative work belongs to the model that the present invention protects It encloses.

In the prior art, it generallys use motion compensation process or the method based on light stream carries out benefit frame.

The basic thought of motion compensation process are as follows: divide the image into static two parts obtained and move, estimate object Motion vector, then obtains the image data of former frame according to the obtained motion vector of estimation, then by using predictive filter, Obtain the prediction pixel of previous frame image data.However, under the quick case of motion of, the supplement that is obtained by the benefit frame mode It is easy to appear the situations of fuzzy even serious distortion for frame；When the motion vector inaccuracy acquired to occlusion area, motion vector The flatness of field is difficult to be guaranteed；When carrying out benefit frame to transition video, supplement frame will appear serious torsional deformation situation.

The basic thought of benefit frame method based on light stream are as follows: using variation of the pixel in image sequence in time-domain and Correlation between consecutive frame finds previous frame with corresponding relationship existing between present frame, to calculate between consecutive frame The motion information of object.However, since the basic assumption of optical flow method is the brightness constancy between consecutive frame, so unexpected brightness Variation violates this it is assumed that this leads to the visual artifacts in frame interpolation result.In addition, optical flow method requirement adjacent video frames take frame Time Continuous alternatively, " small " is compared in the movement of object between consecutive frame, thus is not suitable between the biggish image of spacing Mend frame.

Based on this, the embodiment of the invention provides the training that a kind of video neural network based mends frame method and its model Method, apparatus and server can be applied in the benefit frame or relevant image procossing of video, such as 2D video or 3D video.

It is neural network based to one kind disclosed in the embodiment of the present invention first for convenient for understanding the present embodiment The training method that video mends frame model describes in detail.

A kind of video neural network based shown in Figure 1 mends the flow chart of the training method of frame model, this method The following steps are included:

Step S100 determines current training reference frame based on preset training set；Training reference frame includes the first instruction Practice frame and the second training frames.

There are multiple groups video frames in above-mentioned preset training set；Since this method is mainly used for the instruction that video mends frame model Practice；And usually there is certain similarity between two reference frames of frame to be mended.Frame model can be mended based on the preset video The scope of application divides the similarity dimensions of two reference frames, so that it is determined that current training reference frame.For example, if two ginsengs It examines frame and belongs to different scenes, similarity between the two is lower, and under normal conditions It is not necessary to carry out benefit between the two Frame, therefore similarity can be greater than some threshold value between two frames of the setting as training reference frame, to meet the first reference frame Belong to the demand of Same Scene with the second reference frame.

Training reference frame is input to preset initial model by step S102.

Usually, the size of two reference frames of the same video is identical；If it is different, adjustable two reference frames Picture size, preset initial network is input to after adjustment again.It, can be by the first training frames and second in specific implementation process Training frames are spliced into an image, are input in preset initial model, are handled；Above-mentioned initial model may include spy Sign extract network, Fusion Features network and output network, three realizes respectively carry out feature extraction processing, Fusion Features handle and The function of final output complementing video frame.In addition, usually being adopted in an initial model case when above-mentioned trained reference frame is color image It is handled with triple channel.

Step S104 generates the initial characteristics figure of the default level quantity of training reference frame by feature extraction network.

It can be various forms of neural networks, such as full convolutional network or fully-connected network that features described above, which extracts network,； After training reference frame is input to feature extraction network, the initial characteristics figure of available default level quantity；The default level Quantity is related to the convolution layer number in feature extraction network, can specifically be arranged according to demand.It in the specific implementation process, can be with The initial characteristics figure that previous convolutional layer is exported carries out convolution fortune to it as the input of current convolutional layer, by current convolutional layer It calculates, exports the initial characteristics figure of current layer, at this point, the scale of the initial characteristics figure of current layer is less than low one layer of initial characteristics Figure.

The initial characteristics figure of default level quantity is fused to fusion feature figure by Fusion Features network by step S106.

It is obtained since different initial characteristics figures carries out convolution algorithm by different convolution kernels, different initial spies Levy the feature of the variety classes or dimension comprising training reference frame in figure；These Fusion Features are risen by Fusion Features network Come, exported for subsequent supplement frame, can make supplement frame that can more restore corresponding details；The fusion process can also pass through Convolutional calculation obtains, and therefore, Fusion Features network may be various forms of neural networks, such as full convolutional network or full connection Network etc.；In initial characteristics figure scale not at the same time it can also adding sample level to the feature in initial characteristics figure or fusion process Figure carries out the transformation of scale.

Step S108 exports fusion feature figure input to output network between the first training frames and the second training frames Training complementing video frame.

Contain the feature of the first training frames and the second training frames in above-mentioned fusion feature figure, the feature of complementing video frame with The feature of first training frames and the second training frames has certain relationship；It also may include convolutional Neural net in above-mentioned output network The structure of network extracts the spy for belonging to complementing video frame from the feature of the first training frames and the feature of the second training frames respectively Sign, thus the corresponding trained complementing video frame of the current training reference frame of synthesis.

Step S110 determines the penalty values of training complementing video frame by preset prediction loss function.

Above-mentioned prediction loss function may include perception loss function, SSIM (structural similarityindex, Structural similarity) loss function etc., it can the corresponding loss function of as needed or historical experience selection.

Step S112 is trained the initial model according to the penalty values, until the ginseng in the initial model Number convergence obtains video and mends frame model.

Above-mentioned penalty values can reflect the matching degree of trained complementing video frame and ideal complement video frame；It can set in advance Surely penalty values to be achieved are needed, during model training, the adjustment direction of the parameter in model can be drawn close to the penalty values, directly To the penalty values are reached, the parameter convergence in initial model, available more mature video mends frame model.The process needs A large amount of sample data；In fact, the training reference frame used can be unduplicated during the training initial model Sample group, there may also be mutual duplicate sample groups.

The embodiment of the invention provides the training methods that a kind of video neural network based mends frame model；Based on preset After training set determines current training reference frame, training reference frame is input to preset initial model；Pass through feature extraction Network generates the initial characteristics figure of the default level quantity of training reference frame；By Fusion Features network by default level quantity Initial characteristics figure is fused to fusion feature figure；Again by fusion feature figure input to output network, the first training frames and second are exported Training complementing video frame between training frames；The penalty values of training complementing video frame are determined by preset prediction loss function； Continue to be trained to initial model next group of trained reference frame of input, until the parameter convergence in initial model, terminates training, It obtains video and mends frame model.In which, the more rich of reference frame is got by feature extraction network and Fusion Features network Rich comprehensive characteristic information then mends the preferable video of the available benefit effect frame of frame model by trained video and supplements frame, To improve user's viewing experience.

The embodiment of the invention also provides the training methods that another video neural network based mends frame model；This method Emphasis is described through Fusion Features network to the fusion process of initial characteristics figure and by output network output training supplement frame Process.

This method is based on initial model as shown in Figure 2；The initial model includes feature extraction network, Fusion Features network And output network；Wherein, feature extraction network includes sequentially connected the first convolutional network of multiple groups；Every group of first convolutional network packet Include convolutional layer interconnected and average pond layer；In Fig. 2 for including 5 layer of first convolutional network.Fusion Features network packet Include sequentially connected second convolutional network；Every group of second convolutional network includes bilinear interpolation layer and convolutional layer interconnected； In Fig. 2 for including 5 layer of second convolutional network.Export network include the first convolutional layer, the second convolutional layer, third convolutional layer, Volume Four lamination and feature synthesize layer；First convolutional layer, the second convolutional layer, third convolutional layer and Volume Four lamination respectively with feature Converged network connection；First convolutional layer, the second convolutional layer, third convolutional layer and Volume Four lamination synthesize layer company with feature respectively It connects.

The flow chart of this method is as shown in Figure 3, comprising the following steps:

Step S300 determines current training reference frame based on preset training set；Training reference frame includes the first instruction Practice frame and the second training frames.

Training reference frame is input to preset initial model by step S302.

Step S304 generates the initial characteristics figure of the default level quantity of training reference frame by feature extraction network；Knot The structure of initial model shown in Fig. 2 is closed, presetting level quantity is 5, and each the first convolutional network of layer exports an initial characteristics Figure；After reference frame being trained to be input to initial network, the convolutional layer of the first convolutional network by the first level and average pond After layer processing, the initial characteristics figure of the first level is exported；First convolution of the initial characteristics figure of first level Jing Guo the second level After the convolutional layer of network and average pond layer processing, the initial characteristics figure of the second level is exported；And so on, until obtaining 5 The initial characteristics figure of level.

Initial characteristics figure described in multilayer is arranged successively by step S306 according to the scale of each layer initial characteristics figure；Its In, the scale of the initial characteristics figure of top grade is minimum；The scale of the initial characteristics figure of bottom grade is maximum.

The initial characteristics figure of top grade is determined as the fusion feature figure of the top grade by step S308.

Step S310, in addition to the top grade, by the upper of the initial characteristics figure of current level and the current level The fusion feature figure of one level is merged, and the fusion feature figure of current level is obtained.

The fusion feature figure of lowest hierarchical level is determined as final fusion feature figure by step S312.

Based on the structure of Fusion Features network shown in Fig. 2, above-mentioned steps S310 can be accomplished by the following way:

(1) interpolation processing is carried out by fusion feature figure of the bilinear interpolation layer to a upper level for current level, obtained The fusion feature figure to match with the size of the initial characteristics figure of current level.

(2) pass through convolutional layer for a upper level for the current level after the initial characteristics figure of current level and interpolation processing Fusion feature figure carries out convolutional calculation, obtains the fusion feature figure of current level；In fact, in use convolutional layer to initial characteristics Before figure and the processing of fusion feature figure, the corresponding part of initial characteristics figure and fusion feature figure can be overlapped.

Usually, in order to merge the initial characteristics figures of all levels, level quantity and the feature of Fusion Features network are mentioned Take the level quantity of network identical, as shown in Figure 2.It in practice, can also according to demand, using different level quantity.By Need to handle initial characteristics figure in Fusion Features network, the first convolutional network of feature extraction network also with Fusion Features Second convolutional network of network is correspondingly connected with, as shown in Figure 2.

Step S314 carries out first to the corresponding characteristic of the first training frames in fusion feature figure by the first convolutional layer Convolution algorithm exports the first vertical characteristic pattern.

Step S316 carries out second to the corresponding characteristic of the first training frames in fusion feature figure by the second convolutional layer Convolution algorithm exports first level characteristic pattern.

Step S318 carries out third to the corresponding characteristic of the second training frames in fusion feature figure by third convolutional layer Convolution algorithm exports the second vertical characteristic pattern.

Step S320 carries out the 4th to the corresponding characteristic of the second training frames in fusion feature figure by Volume Four lamination Convolution algorithm exports the second horizontal properties figure.

Above-mentioned first convolutional layer, the second convolutional layer, third convolutional layer and Volume Four lamination convolution kernel be one-dimensional convolution Core is compared to two-dimensional convolution core, and operand is smaller, and operation time is shorter；By one-dimensional convolution kernel respectively to fusion feature figure In the first training frames, the vertical feature of the second training frames and horizontal properties extract；When using structure shown in Fig. 2, above-mentioned 4 A step can carry out simultaneously, reduce operation time.

Step S322 synthesizes layer to the first vertical characteristic pattern, first level characteristic pattern, the second vertical characteristic pattern by feature And second horizontal properties figure carry out feature superposition processing, obtain train complementing video frame.

Step S324 determines the penalty values of training complementing video frame by preset prediction loss function.

Step S326 is trained the initial model according to the penalty values, until the ginseng in the initial model Number convergence obtains video and mends frame model.

It is more using being made of convolutional layer and average pond layer during generating initial characteristics figure in the above method A first convolutional network uses during initial characteristics figure is fused to fusion feature figure by convolutional layer and bilinearity difference Multiple second convolutional networks of layer composition, obtain trained reference frame compared with the comprehensive feature of horn of plenty；It exports layer network and is based on four A convolutional layer extracts the first training frames, the vertical feature of the second training frames and horizontal properties in fusion feature figure parallel, It is finally synthesizing complementing video frame；The available preferable benefit effect frame of which, and operand is reduced, when reducing operation Between.

The training method embodiment that frame model is mended based on above-mentioned video, the embodiment of the invention also provides one kind based on nerve The video of network mends frame method, flow chart as shown in figure 4, method includes the following steps:

Step S400 obtains the first reference frame and the second reference frame of frame video to be mended.

Above-mentioned first reference frame and the second reference frame can be two frames adjacent in the sequence of frames of video of frame video to be mended, Intermediate it can be separated with other video frames；The selection process of first reference frame and the second reference frame can also mend frame model referring to video In training process, some requirements to training reference frame, such as two video frames are in Same Scene.

First reference frame and the second reference frame are input to the video pre-established and mend frame model, generated and mend by step S402 Fill video frame；Video is mended frame model and is obtained by the training method training that above-mentioned video neural network based mends frame model.

First reference frame of input model and the scale size of the second reference frame are answered identical；If it is different, then needing to adjust To after identical, then input the video benefit frame model pre-established and handled.

Complementing video frame is inserted between the first reference frame and the second reference frame by step S404.

The entire treatment process of this method is end to end, not need to carry out subsequent processing, video frame rate to video frame Conversion effect is good, compared with conventional method, can provide higher-quality video frame interpolation.

Based on the above embodiment, the present invention also provides a kind of, and the adaptive video based on deep learning mends frame method, should Method the following steps are included:

Step (1) designs a neural network framework based on complete convolution.

Specifically, using the neural network of complete convolution, this network includes the retraction assemblies (phase for being used for feature extraction When in features described above extract network) and one comprising up-sampling layer with execute prediction extension layer (be equivalent to features described above merge Network), it further uses jump connection and extension layer is allowed to obtain the feature from neural network constriction.Information flow is directed to The last one extension layer, the extension layer are divided into four subnets (being equivalent to four convolutional layers in above-mentioned output network), every height Net calculates one of kernel；Its data flow schematic diagram is as shown in figure 5, the structure that retraction assemblies are formed with extension layer is equivalent to Coder-decoder network, the feature of extraction, which is sent to, gives four subnets, the pixel relevant kernel and input frame convolution of estimation To generate interpolation frame I.Wherein, each subnet estimates one in four 1D kernels of each output pixel in a manner of dense-pixel (being equivalent to training process)；In addition to there are one bilinearity difference layers for convolutional layer in subnet, it acts as the spies that will be extracted Sign is amplified to match with input frame.In Fig. 5, I₁' indicate from I₁What reference frame extracted belongs to the feature of supplement frame I ', I₂' table Show from I₂What reference frame extracted belongs to the feature of supplement frame I '.

It is specific as follows: for video frame interpolation, to aim at through two input frame I₁And I₂, obtain intermediate frame Traditional video frame interpolation method includes two steps: estimation is synthesized with pixel, usually passes through two kinds of sides of light stream and picture element interpolation Method is realized.When light stream is due to blocking, the problems such as motion blur and when becoming unreliable, the interpolation result that this method obtains may It can inaccuracy.

For this method, for each output pixelA pair of of two-dimensional convolution is estimated using the method based on convolution Core K₁(x, y) and K₂(x, y) and use they and I₁And I₂It carries out convolution and calculates the color of output pixel, each output pixel Mathematical description are as follows:

Wherein P₁(x, y) and P₂(x, y) is with I₁And I₂In (x, y) centered on patch, be equivalent to I₁And I₂By upper Eigenmatrix (being equivalent to above-mentioned fusion feature figure) is obtained after stating retraction assemblies and extension layer processing.

Disappeared by a pair of one-dimensional kernel close to two-dimentional kernel of estimation come solve to calculate larger kernel bring Consumption problem.For K₁And K₂, estimation < k_1,v,k_1,h>and<k_2,v,k_2,h> it is approximately k_1,v*k_1,hAnd k₂,_v*k_2,h, k_1,v、k_1,h K respectively₁Horizontal vector and vertical vector, k_2,vAnd k_2,hRespectively K₂Horizontal vector and vertical vector, realizing will be each interior The number of parameters of core has been reduced to 2n from original n*n.

To estimate that four groups of one-dimensional kernels, information flow are directed to the last one extension layer, which is divided into four subnets, often A subnet calculates one of kernel.The combination expression of four kernels can also be considered as unified model to model, but used When four sub-networks training during convergence rate faster.

Simultaneously to solve the problems, such as artifact in experiment, these pseudomorphisms are handled using bilinear interpolation, in the decoding of network Up-sampling is executed in device.

Step (2) constructs loss function, so that the VGG-19 network effect based on feature reconstruction loss is more preferable.

Loss is defined using perception loss function, the mathematical description for perceiving loss function is as follows:

Wherein φ is the feature extracted from image,Indicate predicted value, I_gtIndicate true value.

Step (3) is trained using convolution perception initial method initialization neural network parameter and using AdaMax, The image-region for having used 128*128 size avoids improving training effect using the image-region for not including useful information.

Neural network parameter is initialized with convolution perception initial method, and is trained using AdaMax, wherein β 1= 0.9 be single order moments estimation exponential decay rate, β 2=0.999 be second order moments estimation exponential decay rate, learning rate 0.001, It is classified as 16 minibatch.Using the image-region of 128 × 128 sizes, and non-training entire video frame.Avoiding makes With the image-region for not including useful information, training effect is improved.

Generating training set, detailed process is as follows: all video frames being divided into three frame groups, and random in each three frames group A frame is chosen, then extracts the three frame groups centered on the frame in video.Due to the resolution ratio of video have for model it is larger Influence, have chosen the higher video of resolution ratio, and scale it and bring for the resolution ratio of 1280*720 to reduce video compress Influence.For avoid choosing have in three frame groups it is a large amount of without or little motion frame, in three frame groups of calculating first frame and last The average luminous flux of light stream and calculating between frame.Then, 500,000 three frame group has not been selected alternatively.Wherein, three frames Amount of exercise in group is bigger, is easier to be selected.In this way, the training set with larger movement has been obtained. Simultaneously as some video councils are made of many camera lenses, the color difference calculated between different frame is carried out the switching of detector lens and is deleted Except the group across different camera lenses.Finally, calculating the entropy of the light stream in each sample, finally selecting 250,000 has maximum Three frame groups of entropy form training dataset.It is concentrated in this training data, the luminous flux size of about 70% pixel is at least 20 pixels.Average value is 25 pixels, and maximum value is 38 pixels.

Training data is enhanced while training.Each sample size that training data is concentrated is 150 × 150 pictures Element, and using size is that the patch of 128 × 128 pixels is trained, it in this way can be by carrying out random cropping to training data Enhance to execute data, e-learning is prevented to be present in the spatial prior of training data concentration.By in first frame and last Crop window is moved in frame to enhance the amount of exercise of each sample, while the holding crop window for saving intermediate frame is constant.Pass through It is consecutively carried out this to operate and move first frame and the crop window of last frame round about, ensure that centre at this time Frame is still available.Experiment discovery, the displacement effect for executing about 6 pixels is fine, this can make about 8.5 pictures of luminous flux increase Element.By the vertically or horizontally random time sequencing for overturning the patch cut and exchange them at random, so that training dataset Interior movement is symmetrical and prevents network from biasing.

After video mends frame model training, before model progress video interleave, it can be determined that two references Whether frame is in Same Scene；It in the judgment process, can be by the respective pixel position of the first reference frame and the second reference frame Pixel value subtracts each other, and obtains the corresponding pixel value difference of each location of pixels；According to the corresponding pixel value difference of each location of pixels, calculate Total pixel difference；Judge whether total pixel difference is greater than or equal to preset difference threshold；If it is not, then the first reference frame of confirmation and institute It states the second reference frame and is in Same Scene；If it is, the first reference frame of confirmation and second reference frame are not at same field Scape.When two reference frames are not at Same Scene, it is not usually required to carry out video benefit frame.

Above-mentioned adaptive video based on deep learning mends frame method: including the feature, logical between joint account consecutive frame It crosses result and generates two steps of intermediate frame；Input frame and spatially adaptive convolution kernel are subjected to convolution；By video frame interpolation representation To use a pair of one-dimensional convolution kernel to carry out convolution to the part on input frame, occupancy is calculated to solve larger kernel bring The problem of memory size increases；Using the complete convolutional neural networks of depth after the optimization, interior nuclear fusion can be once calculated And at entire intermediate frame, and can allow for training neural network in conjunction with perception loss to generate the intermediate frame of high quality.

The above method, for mending frame fuzzy case, due to the method using adaptive convolution, it is thus possible to maximum limit Degree keeps original image clarity, does not have ghost image artifact；For occlusion area and unexpected brightness change situation: not with the method for light stream Reliably, and this method by deep learning estimates convolution kernel, be automatically synthesized pixel, mend effect frame and stablize；Additionally it is possible to realize The convolution kernel of edge perception picture element interpolation.Above-mentioned video mends frame model for all kernels, and only a few has nonzero value；Along image The pixel at edge, core are anisotropic, and are orientated and are aligned well with edge direction.

The training method embodiment that frame model is mended corresponding to above-mentioned video, the embodiment of the invention also provides one kind based on mind Video through network mends the training device of frame model, and structural schematic diagram is as shown in Figure 6, comprising:

Training reference frame determining module 600, for determining current training reference frame based on preset training set；Training Reference frame includes the first training frames and the second training frames.

Training reference frame input module 602, for reference frame will to be trained to be input to preset initial model；Initial model packet Include feature extraction network, Fusion Features network and output network.

Characteristic extracting module 604, for the first of the default level quantity by feature extraction network generation training reference frame Beginning characteristic pattern.

Fusion Features module 606, for being fused to the initial characteristics figure of default level quantity by Fusion Features network Fusion feature figure.

Frame determining module 608 is supplemented, for exporting the first training frames and second for fusion feature figure input to output network Training complementing video frame between training frames.

Penalty values obtain module 610, and the penalty values of training complementing video frame are determined by preset prediction loss function.

Training module 612, for being trained according to penalty values to initial model, until the parameter in initial model is received It holds back, obtains video and mend frame model.

The technical effect and preceding method embodiment phase of device provided by the embodiment of the present invention, realization principle and generation Together, to briefly describe, Installation practice part does not refer to place, can refer to corresponding contents in preceding method embodiment.

Frame method embodiment is mended corresponding to above-mentioned video, the embodiment of the invention also provides a kind of views neural network based Frequency mend frame device, structural schematic diagram as shown in fig. 7, comprises:

Reference frame obtains module 700, for obtaining the first reference frame and the second reference frame of frame video to be mended；It is raw to supplement frame At module 702, frame model is mended for the first reference frame and the second reference frame to be input to the video pre-established, generates supplement view Frequency frame；Video is mended frame model and is obtained by the training method training that above-mentioned video neural network based mends frame model；Supplement frame It is inserted into module 703, for complementing video frame to be inserted between the first reference frame and the second reference frame.

Shown in Figure 8 the embodiment of the invention also provides a kind of server, which includes processor 130 and deposits Reservoir 131, the memory 131 are stored with the machine-executable instruction that can be executed by processor 130, which executes Machine-executable instruction is to realize that above-mentioned video neural network based mends the training method of frame model or neural network based Video mends frame method.

Further, server shown in Fig. 8 further includes bus 132 and communication interface 133, processor 130, communication interface 133 and memory 131 connected by bus 132.

Wherein, memory 131 may include high-speed random access memory (RAM, Random Access Memory), It may further include non-labile memory (non-volatile memory), for example, at least a magnetic disk storage.By extremely A few communication interface 133 (can be wired or wireless) is realized logical between the system network element and at least one other network element Letter connection, can be used internet, wide area network, local network, Metropolitan Area Network (MAN) etc..Bus 132 can be isa bus, pci bus or Eisa bus etc..The bus can be divided into address bus, data/address bus, control bus etc..Only to be used in Fig. 8 convenient for indicating One four-headed arrow indicates, it is not intended that an only bus or a type of bus.

Processor 130 may be a kind of IC chip, the processing capacity with signal.It is above-mentioned during realization Each step of method can be completed by the integrated logic circuit of the hardware in processor 130 or the instruction of software form.On The processor 130 stated can be general processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.；It can also be digital signal processor (Digital Signal Processing, abbreviation DSP), specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), ready-made programmable gate array (Field-Programmable Gate Array, abbreviation FPGA) or Person other programmable logic device, discrete gate or transistor logic, discrete hardware components.It may be implemented or execute sheet Disclosed each method, step and logic diagram in inventive embodiments.General processor can be microprocessor or the processing Device is also possible to any conventional processor etc..The step of method in conjunction with disclosed in the embodiment of the present invention, can be embodied directly in Hardware decoding processor executes completion, or in decoding processor hardware and software module combination execute completion.Software mould Block can be located at random access memory, flash memory, read-only memory, programmable read only memory or electrically erasable programmable storage In the storage medium of this fields such as device, register maturation.The storage medium is located at memory 131, and processor 130 reads memory Information in 131, in conjunction with its hardware complete previous embodiment method the step of.

The embodiment of the invention also provides a kind of machine readable storage medium, which is stored with machine Executable instruction, for the machine-executable instruction when being called and being executed by processor, which promotes processor Realize that above-mentioned video neural network based mends the training method of frame model or video neural network based mends frame method, specifically It realizes and can be found in embodiment of the method, details are not described herein.

Video neural network based provided by the embodiment of the present invention mends training method, the device of frame method and its model And the computer program product of server, the computer readable storage medium including storing program code, said program code Including instruction can be used for executing previous methods method as described in the examples, specific implementation can be found in embodiment of the method, herein It repeats no more.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.

Finally, it should be noted that embodiment described above, only a specific embodiment of the invention, to illustrate the present invention Technical solution, rather than its limitations, scope of protection of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, those skilled in the art should understand that: anyone skilled in the art In the technical scope disclosed by the present invention, it can still modify to technical solution documented by previous embodiment or can be light It is readily conceivable that variation or equivalent replacement of some of the technical features；And these modifications, variation or replacement, do not make The essence of corresponding technical solution is detached from the spirit and scope of technical solution of the embodiment of the present invention, should all cover in protection of the invention Within the scope of.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. the training method that a kind of video neural network based mends frame model characterized by comprising

Current training reference frame is determined based on preset training set；The trained reference frame includes the first training frames and second Training frames；

The trained reference frame is input to preset initial model；The initial model includes that feature extraction network, feature are melted Close network and output network；

The initial characteristics figure of the default level quantity of the trained reference frame is generated by the feature extraction network；

The initial characteristics figure of the default level quantity is fused to fusion feature figure by the Fusion Features network；

The fusion feature figure is input to the output network, is exported between first training frames and second training frames Training complementing video frame；

The penalty values of the trained complementing video frame are determined by preset prediction loss function；

The initial model is trained according to the penalty values, until the parameter convergence in the initial model, depending on Frequency mends frame model.

2. the method according to claim 1, wherein the feature extraction network includes sequentially connected multiple groups One convolutional network；First convolutional network described in every group includes convolutional layer interconnected and average pond layer.

3. according to the method described in claim 2, it is characterized in that, the level quantity of the initial characteristics figure is multilayer；Multilayer Scale between the initial characteristics figure is different；

The step of initial characteristics figure of the default level quantity is fused to fusion feature figure by the Fusion Features network, Include:

According to the scale of each layer initial characteristics figure, initial characteristics figure described in multilayer is arranged successively；Wherein, top grade The scale of initial characteristics figure is minimum；The scale of the initial characteristics figure of bottom grade is maximum；

The initial characteristics figure of top grade is determined as to the fusion feature figure of the top grade；

It is in addition to the top grade, the fusion of the initial characteristics figure of current level and a upper level for the current level is special Sign figure is merged, and the fusion feature figure of current level is obtained；

The fusion feature figure of lowest hierarchical level is determined as to final fusion feature figure.

4. according to the method described in claim 3, it is characterized in that, the Fusion Features network includes sequentially connected multiple groups Two convolutional networks；Second convolutional network described in every group includes bilinear interpolation layer and convolutional layer interconnected；

The fusion feature figure of the initial characteristics figure of current level and a upper level for the current level is merged, is worked as The step of fusion feature figure of preceding level, comprising:

Interpolation processing is carried out by fusion feature figure of the bilinear interpolation layer to a upper level for the current level, is obtained The fusion feature figure to match with the size of the initial characteristics figure of the current level；

By the convolutional layer by a upper level for the current level after the initial characteristics figure of the current level and interpolation processing Fusion feature figure carry out convolutional calculation, obtain the fusion feature figure of current level.

5. the method according to claim 1, wherein the output network includes the first convolutional layer, the second convolution Layer, third convolutional layer, Volume Four lamination and feature synthesize layer；First convolutional layer, second convolutional layer, third volume Lamination and the Volume Four lamination are connected to the network with the Fusion Features respectively；First convolutional layer, second convolutional layer, The third convolutional layer and the Volume Four lamination synthesize layer connection with the feature respectively；

The fusion feature figure is input to the output network, is exported between first training frames and second training frames Training complementing video frame the step of, comprising:

The first convolution is carried out to the corresponding characteristic of the first training frames in the fusion feature figure by first convolutional layer Operation exports the first vertical characteristic pattern；

The second convolution is carried out to the corresponding characteristic of the first training frames in the fusion feature figure by second convolutional layer Operation exports first level characteristic pattern；

Third convolution is carried out to the corresponding characteristic of the second training frames in the fusion feature figure by the third convolutional layer Operation exports the second vertical characteristic pattern；

Volume Four product is carried out to the corresponding characteristic of the second training frames in the fusion feature figure by the Volume Four lamination Operation exports the second horizontal properties figure；

Layer is synthesized to the described first vertical characteristic pattern, the first level characteristic pattern, the second vertical spy by the feature Sign figure and the second horizontal properties figure carry out feature superposition processing, obtain the trained complementing video frame.

6. a kind of video neural network based mends frame method characterized by comprising

Obtain the first reference frame and the second reference frame of frame video to be mended；

First reference frame and the second reference frame are input to the video pre-established and mend frame model, generates complementing video frame； The video is mended frame model and is obtained by the training method training that the described in any item videos of claim 1-5 mend frame model；

The complementing video frame is inserted between first reference frame and second reference frame.

7. the training device that a kind of video neural network based mends frame model characterized by comprising

Training reference frame determining module, for determining current training reference frame based on preset training set；The training ginseng Examining frame includes the first training frames and the second training frames；

Training reference frame input module, for the trained reference frame to be input to preset initial model；The initial model Including feature extraction network, Fusion Features network and output network；

Characteristic extracting module, for generated by the feature extraction network the trained reference frame default level quantity just Beginning characteristic pattern；

Fusion Features module, for being fused to the initial characteristics figure of the default level quantity by the Fusion Features network Fusion feature figure；

Frame determining module is supplemented, for the fusion feature figure to be input to the output network, exports first training frames Training complementing video frame between second training frames；

Penalty values obtain module, and the penalty values of the trained complementing video frame are determined by preset prediction loss function；

Training module, for being trained according to the penalty values to the initial model, until the ginseng in the initial model Number convergence obtains video and mends frame model.

8. a kind of video neural network based mends frame device characterized by comprising

Reference frame obtains module, for obtaining the first reference frame and the second reference frame of frame video to be mended；

Frame generation module is supplemented, mends frame mould for first reference frame and the second reference frame to be input to the video pre-established Type generates complementing video frame；The video mends frame model and passes through the described in any item views neural network based of claim 1-5 The training method training that frequency mends frame model obtains；

Supplement frame and be inserted into module, for by the complementing video frame be inserted into first reference frame and second reference frame it Between.

9. a kind of server, which is characterized in that including processor and memory, the memory is stored with can be by the processing The machine-executable instruction that device executes, the processor execute the machine-executable instruction to realize that claim 1 to 5 is any Video neural network based described in mends the training method or as claimed in claim 6 based on neural network of frame model Video mend frame method the step of.

10. a kind of machine readable storage medium, which is characterized in that the machine readable storage medium is stored with the executable finger of machine It enables, for the machine-executable instruction when being called and being executed by processor, machine-executable instruction promotes processor to realize Video neural network based described in any one of claim 1 to 5 mends training method or claim 6 institute of frame model The video neural network based stated mends the step of frame method.