CN110363794A

CN110363794A - Light stream prediction technique between video successive frame

Info

Publication number: CN110363794A
Application number: CN201910645583.6A
Authority: CN
Inventors: 王传旭; 刘帅; 丰艳; 闫春娟
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2019-10-22

Abstract

The invention discloses the light stream prediction techniques between a kind of video successive frame, are related to computer machine vision technique field, comprising the following steps: step a: by can deformation convolution unit extract consecutive frame space characteristics；Step b: fusion reconstruct is carried out to consecutive frame space characteristics；Step c: deconvolution operation is carried out to the feature of fusion reconstruct, constructs network stack；Step d: loss function training network stack is utilized；Step e: output result.The beneficial effects of the present invention are: convolution kernel is optimized from basic structure type, by fixed rectangular convolution be changed to can deformation convolution, improve precision of prediction and saved calculation resources；By training by the feature reconstruction of fusion and allocation of parameters and channel weight again, thus it is a small amount of increase calculating cost in the case where retain the feature correlation of consecutive frame to greatest extent.

Description

Light stream prediction technique between video successive frame

Technical field

Light stream prediction side the present invention relates to computer machine vision technique field, between especially a kind of video successive frame Method.

Background technique

Light stream prediction can be applied to trajectory track, and the fields such as preceding background segment and human behavior identification are computer views Feel the common method and major issue in research.Light stream includes space motion object pixel motion on observation imaging plane The information such as instantaneous velocity and displacement vector, light stream image indicate image object movement by way of color gamut and spatial domain combine State.Relative to other image studies methods, the emphasis of light stream is essentially consisted in " movement ", and light stream not only includes observed object Motion information also includes the abundant information in relation to scenery three-dimensional structure.Light stream prediction referred to using pixel in image sequence in the time The correlation between variation and consecutive frame on domain finds previous frame with corresponding relationship existing between present frame, to count The method for calculating the motion information (light stream figure) of object between consecutive frame.

The previous algorithm that light stream prediction is carried out using convolutional neural networks, in convolution characteristic extraction part usually using fixed The rectangular convolution kernel of size shape；This convolution form limits network in the ability adjusted to different images body self-adaptation, The universality being displaced to pixel size between consecutive frame is poor, also will cause the waste of computing resource.

The key of light stream prediction algorithm is to calculate the sports ground of adjacent interframe, therefore the association fusion of adjacent two frames feature It is the most important thing solved the problems, such as, and existing method is directly superimposed or is based on window mobile computing correlation in precision and time There are obvious shortcoming, part edge details easy to be lost in fusion process in complexity, precision is affected.

Summary of the invention

The purpose of the present invention is to solve the light stream forecasting problems between video image successive frame, retain to greatest extent empty Between structure and the feature correlation information of consecutive frame, promote precision of prediction, devise the light stream between a kind of video successive frame Prediction technique.

To achieve the goals above, the technical scheme is that, the light stream prediction technique between a kind of video successive frame, The following steps are included: step a: by can deformation convolution unit extract consecutive frame space characteristics；Step b: to consecutive frame space spy Sign carries out fusion reconstruct；Step c: deconvolution operation is carried out to the feature of fusion reconstruct, constructs network stack；Step d: damage is utilized Lose function training network stack；Step e: output.

Further, in the step a, it is described can deformation convolution unit first layer be 7*7 can deformation convolutional layer, The second layer is traditional convolutional layer of 5*5, and traditional convolutional layer that third layer and the 4th layer are 3*3, the step-length of each convolutional layer is 2； It is described can convolution operation used in deformation convolutional layer are as follows:

In formula, P₀For can the exported Feature Mapping u of deformation convolutional layer a bit, the input feature vector that wherein x represents this layer reflects It penetrates or original image, R is covered on the region of u by convolution kernel, w is weighted value, P_nIt is R enumerating in x institute overlay area.

Further, in the step b, the consecutive frame space characteristics by the overall situation by channel be averaged pondization convert The scalar of port number is characterized for a length；It includes full connection, ReLU activation primitive, full connection that the scalar, which is sent into one, With the full articulamentum of Sigmoid activation primitive, the weight vectors as fusion feature are obtained by subsequent training operation； Using the weight vectors as the selection parameter in feature channel, then by multiplication by channel weighting to consecutive frame space characteristics On, complete the recalibration to primitive character on channel dimension.

Further, in the step c, by deconvolution operation by the feature amplification of the fusion reconstruct and close to light Then resolution ratio is restored to original size by up-sampling by stream.

Further, in the step c, the building network stack includes network one, network two and network three, described Network one using can deformation convolution sum SE-net, network two simultaneously removes SE-net using traditional convolution kernel, and the network three uses Traditional convolution kernel adds SE-net, and the network one is connected in parallel with network two and being connected in series through for network three.

Further, the network two transmits prediction light stream and loss amount to network three.

Further, in the step d, the ground truch and output light stream mean end-point mistake in data set are used Difference is used as loss function, calculates loss after output light stream is carried out primary interpolation.

Further, the strategy of the trained network is first the first hierarchical network of stand-alone training, then again by the guidance The fixed training undernet of parameters within network weight, until all-ones subnet network training is completed, in the last layer addition synthesis mould Block finely tunes synthesis module under fixed upper layer network inner parameter.

Further, the step e includes output result light stream image corresponding with output video consecutive frame.

The beneficial effects of the present invention are: convolution kernel is optimized from basic structure type, by fixed rectangular convolution Be changed to can deformation convolution, improve precision of prediction and saved calculation resources；By training the feature reconstruction of fusion and again Allocation of parameters and channel weight, so that the feature for retaining consecutive frame to greatest extent in the case where a small amount of increase calculates cost is related Property.

Detailed description of the invention

Fig. 1 is application scheme flow chart；

Fig. 2 is traditional convolutional layer operation chart；

Fig. 3 is can deformation convolution kernel schematic diagram；

Fig. 4 be can deformation convolution schematic diagram；

Fig. 5 be added offset can deformation pond schematic diagram；

Fig. 6 is the operation principle schematic diagram of SE-net

Fig. 7 is complete sub-network structure figure；

Fig. 8 is complete network stack diagram.

Specific embodiment

It is of the invention to reach the technical means and efficacy that predetermined goal of the invention is taken further to illustrate, below in conjunction with Attached drawing and preferred embodiment, to specific embodiment, structure, feature and its effect according to the present invention, detailed description are as follows:

The application is for the light stream forecasting problem solved between video image successive frame.Firstly, the application use can deformation The self-adaptive features of adjacent two frame of convolution feature extracting method, to greatest extent retaining space structural information.Secondly, by two parts sky Between feature superposition be input to SE-net feature reconstruction module and calculate correlation and restructuring allocation channel weight, obtain fusion feature； Fusion feature is subjected to up-sampling again and deconvolution operates, is recycled after obtaining rough light stream image by network stack and warp The method of optimization is corrected, and parameters within network weight learns to adjust by sample training.

A kind of light stream prediction technique between video successive frame, the process with reference to shown in Fig. 1, comprising the following steps:

Step a: by can deformation convolution unit extract consecutive frame space characteristics.

Core of the convolution operation as convolutional neural networks, be usually seen as on local receptor field, will spatially The condensate that information in information and characteristic dimension is polymerize, the concrete operations of traditional convolutional layer are as shown in Fig. 2, traditional convolution Layer one image of input or characteristic pattern, convolution kernel is mobile in input and is calculated with overlay area and is mapped to next layer Grade, this calculating is subchannel.Traditional square convolution kernel operating principle can be formulated are as follows:

This layer output has just been obtained after multiple channels as above operate.

In the application, can deformation convolution unit first layer be 7*7 can deformation convolutional layer, the second layer be 5*5 tradition Convolutional layer, third layer and the 4th layer of traditional convolutional layer for 3*3, the step-length of each convolutional layer are 2；

It can convolution operation used in deformation convolutional layer are as follows:

In formula, P₀For can the exported Feature Mapping u of deformation convolutional layer a bit, the input feature vector that wherein x represents this layer reflects It penetrates or original image, R is covered on the region of u by convolution kernel, w is weighted value, P_nIt is R enumerating in x institute overlay area, △P_nFor offset.

The previous convolution in light stream prediction field is square convolution kernel, and the form of this square convolution kernel limits net Network is in the space that image adaptive optimizes, and the universality being displaced to pixel size between frame is poor, and calculated performance is low.In Fig. 3, figure It (a) is traditional convolution kernel, figure (b), figure (c) and figure (d) are can deformation convolution kernel.The application improves convolution kernel, Be added have adaptivity can deformation convolution, first layer using can deformation convolution kernel, be with the difference of traditional convolution kernel It both increases the variable △ Pn of an offset in the position of each sampled point.By these variables, convolution kernel can work as Front position nearby arbitrarily samples, and the regular lattice point before being no longer limited to.In fact, increased inclined in deformable convolutional layer Shifting amount is a part of network structure, is calculated by another parallel Standard convolution layer, and then can also pass through ladder Degree backpropagation is learnt end to end.

In addition after the study of the offset, as shown in figure 4, the size and location of deformable convolution kernel can be according to current The picture material for needing to identify carries out dynamic adjustment, and visual effect is exactly the convolution kernel sampling point position of different location can basis Adaptive variation occurs for picture material, to adapt to the geometric deformations such as the shape of different objects, size.

Step b: fusion reconstruct is carried out to consecutive frame space characteristics.

In this step, consecutive frame space characteristics are converted into a length and are characterized by the pond that is averaged of the overall situation by channel The scalar of port number.Convolutional neural networks for light stream prediction would generally establish two relatively independent nets for consecutive frame image Then the two streams are combined carry out subsequent processing in some stage by network stream.The movement and three of light stream expression image object Tie up information, therefore find both relevance and reservation be the key that the step.The SE-net that the application uses is to pass through foundation Relation of interdependence between fusion feature mapping channel, and feature recalibration (reconstruct) is carried out as next step input. With reference to Fig. 6, if giving an input X, feature port number is C ', and wide and high respectively W ', H ' pass through a series of convolution transforms It is C, the Feature Mapping U of wide and a height of W, H that a feature port number is obtained after Ftr.Pass through three operations realizations pair after this The reconstruct of Feature Mapping, the parameters weighting of each operation are all obtained simultaneously finally obtaining the important of C feature channel by learning training Degree, and promote important feature on this basis and inhibit secondary feature.

Squeeze operation, i.e., Feature Mapping is compressed on Spatial Dimension, and each two-dimensional feature channel is become one A real number, the feature port number phase of dimension and input that this real number has global receptive field in a way, and exports Matching, essence of this step operation are that an overall situation is averaged pond.The two dimensional character uc that channel is c in U adds up on entire area to be taken Average, final each channel obtain a scalar z_c, the one-dimensional variable z that C combination of channels is C at a length, expression spy The global numeric distribution situation that sign mapping U is responded in channel C, and make the layer close to input that can also obtain the overall situation Receptive field, concrete operations can be formulated as:

It include the full connection of full connection, ReLU activation primitive, full connection and Sigmoid activation primitive by scalar feeding one Layer obtains the weight vectors as fusion feature by subsequent training operation.It is operated using Excitation, it is one Similar to the mechanism of door in Recognition with Recurrent Neural Network.Weight is generated for each feature channel by parameter W, wherein parameter W is learned Commonly use the correlation for carrying out explicitly Modelling feature interchannel.The mark between C 0 to 1 is obtained by FC-ReLU-FC-Sigmoid Amount, as each channel weight, then each channel of original output channel is weighted (corresponding channel with corresponding weight Each element is multiplied respectively with weight), the feature after obtaining new weighting.Following formula illustrates the step operation, first uses W1 here Multiplied by z, i.e., full articulamentum FC (Fully Connected Layer) operation, the dimension of W1 is C/r*C, this r is a contracting Parameter is put, this programme takes 16, and the purpose of this parameter is to reduce by channel number to reduce calculation amount.Again because of z's Dimension is 1/C, so W1z result dimension is 1/C/r；Then using a ReLU (Rectified Linear Unit) line Property rectification layer, it is still 1/C/r that it is constant, which to export dimension, here training when it is adjustable；Be multiplied again with W2 be multiplied with W2 be also one The process of a full articulamentum, the dimension of W2 are C*C/r, therefore output dimension here is exactly 1/C；Finally using sigmoid Activation primitive obtains the weight vectors s that length is C.

S=F_ex(z, W)=σ (g (z, W))=σ (W₂δ(W₁z))

Using weight vectors as the selection parameter in feature channel, then by multiplication by channel weighting to consecutive frame space spy In sign, the recalibration to primitive character on channel dimension is completed.It is the operation of a Reweight, by Excitation Output weight regard as into cross feature selecting after each feature channel importance, then by multiplication by channel weighting Onto previous feature, the recalibration to primitive character on channel dimension is completed.

In formula, U_cFor c-th of channel characteristics of U, the weight that the c that Sc is weight vectors S is tieed up,For output.

Step c: deconvolution operation is carried out to the feature of fusion reconstruct, constructs network stack；

It is then by up-sampling that resolution ratio is extensive by deconvolution operation by the feature amplification of fusion reconstruct and close to light stream Original size is arrived again.Deconvolution operation is the inverse operation of convolution operation, and major function is amplification characteristic mapping, improves image resolution Rate keeps picture material abundant.The output F (rough light stream) of upper one layer of deconvolution is added in deconvolution step as reference, leads to This mode is crossed, the high-level information that thicker characteristic pattern is transmitted both has been remained, the essence provided in low-level image feature figure is also provided Thin local message.As shown in fig. 7, having carried out four deconvolution operation in this step, the light stream figure finally obtained is by above adopting Sample (bilinear interpolation) is restored to the clarity of original image rank, and to here, whole processes of a sub-network terminate.

In order to optimize final result, network performance is promoted, this patent is in the subsequent part that joined network stack and warp.

Building network stack includes network one, network two and network three, is three kinds of structures son different with internal module Network.Network one using can deformation convolution sum SE-net, network two simultaneously removes SE-net using traditional convolution kernel, and network three uses Traditional convolution kernel adds SE-net, and network one is connected in parallel with network two and being connected in series through for network three.Experiment shows using can shape Becoming network when convolution does not add SE-net module can not restrain, therefore not do the network settings of the combination herein.Experimental result card It is bright optimal according to the structure progress sub-network storehouse combined effect of Fig. 8, show that prediction light stream is smooth.

Preferably, network two transmits prediction light stream and loss amount to network three, and undernet is made to be absorbed in study video phase The lost motion of adjacent frame, loss are operated to obtain, be shown below by warp:

I′₂=(I₁,F)

C=| | I₂-I′₂||

In formula, I₁、I₂Respectively video consecutive frame, F are that sub-network exports light stream, I '₂For first frame and output light stream mapping Image (being similar to the second frame) out, C is loss amount.It can effectively prevent the network over-fitting of heap poststack by Warp operation And optimize precision simultaneously.

Step d: loss function training network stack is utilized

The mode of learning of the application network is supervised learning, uses the ground truch and output light levelling in data set Equal end point error constantly adjusts network parameter as loss function, calculates loss after output light stream is carried out primary interpolation.Cause Data set and strategy used by this is trained will largely influence network performance, although and network stack can have Effect improves precision of prediction, but has following disadvantage: 1, the complicated network structure is huge, and training speed is slow and is easy to appear quasi- The case where closing or not restraining；2, the sub-network downlink shared information in the more tributaries of multi-layer, does not only result in the transmitting of error, can also draw Play the problem of costing bio disturbance confusion；3, heap stack network needs a large amount of calculating cost, is easy to cause in the lesser equipment of memory Insufficient space.Therefore herein using the strategy of substep training.Specific network training strategy is: first the first level of stand-alone training net Then guidance parameters within network weight is fixed training undernet again by network, until all-ones subnet network training is completed, last One layer of addition synthesis module finely tunes synthesis module under fixed upper layer network inner parameter.Wherein, original sample used by training It should work as and have ground truth, and suitable sample data is chosen according to model application.And it should contain in sample It blocks, obscure and the case where big displacement, so that e-learning is suitable for handling these situations.

Step e: output.

Step e includes output result light stream image corresponding with output video consecutive frame.

Above with reference to preferred embodiment, invention has been described, but protection scope of the present invention is not restricted to This can carry out various improvement to it and can be replaced wherein with equivalent without departing from the scope of the invention Component, as long as be not present structural conflict, it is mentioned in the various embodiments items technical characteristic can combine in any way Get up, and any reference signs in the claims should not be construed as limiting the involved claims, no matter comes from which point It sees, the present embodiments are to be considered as illustrative and not restrictive.Therefore, any to fall within the scope of the appended claims All technical solutions be within the scope of the invention.

Claims

1. the light stream prediction technique between a kind of video successive frame, which comprises the following steps:

Step a: by can deformation convolution unit extract consecutive frame space characteristics；

Step b: fusion reconstruct is carried out to consecutive frame space characteristics；

Step d: loss function training network stack is utilized；

Step e: output.

2. the light stream prediction technique between video successive frame according to claim 1, which is characterized in that in the step a In, it is described can deformation convolution unit first layer be 7*7 can deformation convolutional layer, the second layer be 5*5 traditional convolutional layer, third Layer and the 4th layer of traditional convolutional layer for 3*3, the step-length of each convolutional layer are 2；It is described can convolution operation used in deformation convolutional layer Are as follows:

In formula, P₀For can the exported Feature Mapping u of deformation convolutional layer a bit, wherein x represent this layer input feature vector mapping or Original image, R are covered on the region of u by convolution kernel, and w is weighted value, P_nIt is R enumerating in x institute overlay area.

3. the light stream prediction technique between video successive frame according to claim 1, which is characterized in that in the step b In, the consecutive frame space characteristics are converted into the mark that a length is characterized port number by the pond that is averaged of the overall situation by channel Amount；It include the full connection of full connection, ReLU activation primitive, full connection and Sigmoid activation primitive by scalar feeding one Layer obtains the weight vectors as fusion feature by subsequent training operation；Using the weight vectors as feature channel Selection parameter, then by multiplication by channel weighting to consecutive frame space characteristics, complete on channel dimension to original The recalibration of feature.

4. the light stream prediction technique between video successive frame according to claim 1, which is characterized in that in the step c In, it is then by up-sampling that resolution ratio is extensive by deconvolution operation by the feature amplification of the fusion reconstruct and close to light stream Original size is arrived again.

5. the light stream prediction technique between video successive frame according to claim 1, which is characterized in that in the step c In, the building network stack includes network one, network two and network three, and the network one uses can deformation convolution sum SE- Net, network two is using traditional convolution kernel and removes SE-net, and the network three adds SE-net, the net using traditional convolution kernel Network one is connected in parallel with network two and being connected in series through for network three.

6. the light stream prediction technique between video successive frame according to claim 5, which is characterized in that the network two to The transmitting of network three prediction light stream and loss amount.

7. the light stream prediction technique between video successive frame according to claim 1, which is characterized in that in the step d In, use the groundtruch in data set, as loss function, output light stream to be carried out with the equal end point error of output light levelling Loss is calculated after interpolation.

8. the light stream prediction technique between video successive frame according to claim 1, which is characterized in that the trained network Strategy be first the first hierarchical network of stand-alone training, then again by the fixed training junior's net of the guidance parameters within network weight Network is added synthesis module in the last layer, finely tunes under fixed upper layer network inner parameter until all-ones subnet network training completion Synthesis module.

9. the light stream prediction technique between a kind of video successive frame according to claim 1, the step e includes output knot Fruit light stream image corresponding with output video consecutive frame.