CN109086807A

CN109086807A - A kind of semi-supervised light stream learning method stacking network based on empty convolution

Info

Publication number: CN109086807A
Application number: CN201810779483.8A
Authority: CN
Inventors: 项学智; 张荣芳; 翟明亮; 吕宁; 郭鑫立; 王帅; 于泽婷; 张玉琦
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2018-12-25
Anticipated expiration: 2038-07-16
Also published as: CN109086807B

Abstract

The semi-supervised learning optical flow approach based on convolutional neural networks that the present invention provides a kind of, belongs to network design field.Method provided by the invention can be trained for tape label and the blended data without label, and it designs one kind and blocks perception loss function, the end point error cost function for being used for supervised learning is combined with the data item and smooth item for being used for unsupervised learning, construct a kind of semi-supervised learning optical flow estimation, using stacking network structure in the network architecture, empty convolution is introduced in convolutional layer to increase receptive field, and design and block sensing layer to estimate occlusion area, which can carry out semi-supervised light stream study end-to-endly.Method provided by the invention can be improved light stream estimated accuracy, and also propose that one kind blocks the perception semi-supervised trained network of loss function, and a kind of stacking network structure is designed in the network architecture to further promote network performance.

Description

A kind of semi-supervised light stream learning method stacking network based on empty convolution

Technical field

The present invention provides a kind of light stream estimation methods, and in particular to the semi-supervised light stream of network is stacked based on empty convolution Learning method belongs to network design field.

Background technique

Light stream estimation can be used as a supervised learning problem, and the supervised learning method based on convolutional neural networks is solving Preferable effect is achieved on light stream estimation problem, but there are still many problems for supervised learning optical flow approach.Generation true first The data true value on boundary is difficult to obtain, and lacks the limitation for largely becoming supervised learning optical flow approach with labeled data, is secondly to avoid Motion information is lost, and is operated in existing many full convolutional network frameworks without pondization, however convolution operation can still lose The detailed information of image, these problems remain fatal for Pixel-level task.Meanwhile occlusion issue is also that light stream is estimated Urgent problem in meter.

In view of the above problems, the invention proposes a kind of semi-supervised learning optical flow estimation based on convolutional neural networks, Empty convolution is introduced in network convolutional layer to increase receptive field, and is designed one layer and blocked sensing layer for occlusion area involvement network instruction Practice process to improve light stream estimated accuracy, and then propose that one kind blocks the perception semi-supervised trained network of loss function, finally exists A kind of stacking network structure is designed in the network architecture further promotes network performance.

Summary of the invention

The present invention provides a kind of semi-supervised light stream learning methods that network is stacked based on empty convolution, it is therefore intended that is directed to Tape label and blended data without label are trained, and are designed one kind and blocked perception loss function, and supervised learning will be used for End point error cost function is combined with the data item and smooth item for being used for unsupervised learning, constructs a kind of semi-supervised learning light stream Model is named as SA-Net, and a kind of stacking network structure is used in the network architecture, introduces empty convolution in convolutional layer to increase The receptive field of convolution kernel, and design one layer and block sensing layer to estimate occlusion area, which being capable of semi-supervised end-to-end Practise light stream.

The object of the present invention is achieved like this:

Step 1: the 1st light stream of building learns sub-network, and it is named as SA-Net_1, SA-Net_1 light stream learning network It using full convolution framework, is constituted by shrinking and expanding 2 parts, 4 layers of standard volume are respectively adopted to 2 width images first in constriction Characteristic pattern is extracted in product operation, 2 width characteristic patterns input relevant layers is carried out characteristic matching and merging later, then pass through 4 layers of cavity Convolutional layer extracts Optical-flow Feature, and expansion includes 4 layers of warp lamination, and the light stream that constriction is extracted is reverted to original image Resolution ratio.

Step 2: the 2nd light stream of building learns sub-network, and it is named as SA-Net_2, SA-Net_2 light stream learning network It using full convolution framework, is formed by shrinking and expanding 2 parts, input layer will input network after 2 width image stacks, network passes through 4 Layer Standard convolution layer and 4 layers of empty convolutional layer extract the light stream between image pair, and constriction is made of 4 layers of warp lamination, will The light stream that constriction is extracted reverts to original image resolution.

Step 3: 2 stackings networks of building, after SA-Net_1 sub-network connect 2 SA-Net_2 sub-networks thus The 1st stacking network is constituted, each sub-network junction uses deformation technology by the 2nd width image to the 1st width anamorphose, later Using deformed image and the 1st width image as the input of next sub-network, the light stream increment of 2 width images is calculated.2nd heap Folded network and the 1st stacking network share network architecture and parameter, the input terminal input t moment and t+1 of the 1st stacking network The 2 width images at moment extract the positive light stream between image pair, while the image at t and t+1 moment being exchanged and is sequentially input to the In 2 stacking networks, the reversed light stream between image pair is extracted.

Step 4: 2 stackings networks of training, wherein only need trained 1st stacking network, the 2nd network share its more Network weight after new, when synchronizing the sub-network of 2 stacking network corresponding positions of training, every 1 layer of difference of expansion is defeated Each layer of positive light stream and reversed light stream are inputted simultaneously and block perception by the positive light stream and reversed light stream of different resolution out Layer, goes out occlusion area by consistency desired result criteria function, stops positive and negative one after positive light stream restores to original resolution The verification of cause property.

Step 5: designing one kind blocks perception loss function, it can be with semi-supervised learning network.It will be used for the end of supervised learning Point tolerance cost function is combined with the data item and smooth item for being used for unsupervised learning, can both be trained to tape label data, Can also be to the training of the data of no label, wherein data item is to be decomposed based on picture structure texture and Census convert to design It is constant it is assumed that smooth item designs smooth item using the isotropic diffusion based on image-driven, loss function can be by anti- To propagation trained network semi-supervised end-to-end.

Step 6: the training stage, first in network input input largely without label data, by loss weight summation Total losses is obtained, later using back-propagation algorithm training network, initial network weight is obtained, later with a small amount of tape label number According to training network, network model to the end is obtained.

Step 7: tested using trained model, inputs as image pair, export as corresponding light stream.

Compared with prior art, present invention has an advantage that

Method provided by the invention introduces empty convolution in network convolutional layer to increase receptive field, and designs one layer and block Occlusion area is incorporated network training process to improve light stream estimated accuracy by sensing layer, and then proposes that one kind blocks perception loss letter Number carrys out semi-supervised trained network, and a kind of stacking network structure is also designed in the network architecture to further promote internetworking Energy.

Detailed description of the invention

Fig. 1 is flow chart of the present invention.

Fig. 2 is study sub-network SA-Net_1 architecture diagram of the invention.

Fig. 3 is study sub-network SA-Net_2 architecture diagram of the invention.

Fig. 4 is stacking learning network architecture diagram of the invention.

Fig. 5 is empty convolution schematic diagram of the invention.

Fig. 6 is occlusion area error schematic diagram of the invention.

Fig. 7 is semi-supervised learning light stream network SA-Net general frame figure of the invention.

Specific embodiment

The present invention is more fully described with reference to the accompanying drawing.

Step 1: as shown in Fig. 2, 1 light stream of building learns sub-network SA-Net_1, first in constriction by 4 layers Standard convolution layer extracts the characteristic pattern of t moment Yu t+1 moment image respectively, helps net mate feature by a relevant layers Figure, finds the corresponding relationship between characteristic pattern, the correlation function of relevant layers is defined as follows:

WhereinThe characteristic pattern of t moment and t+1 moment is respectively indicated, π is indicated centered on pixel x, size K* The agglomerate of K.

2 agglomerates in 2 width images centered on x1 and x2 are taken respectively, and then corresponding position, which is multiplied, to be added, relevant layers pair Entire image carries out relevant operation, while the feature of 2 width images being merged, and extracts higher by 4 layers of empty convolutional layer later Secondary feature, the visible attached drawing 5 of empty convolution schematic diagram, Fig. 5 indicate that 1 size is 3*3, is divided into 1 empty convolution kernel, due to Pond layer is not included in network, if increasing receptive field using larger-size convolution kernel in deep layer network, calculation amount will Can be very huge, using the convolution kernel in such as Fig. 5, receptive field can be increased in the case where not increasing network parameter.It is got the bid The convolution kernel size of quasi- convolutional layer be 3*3, step-length 2, empty convolution kernel size be 3*3, step-length 1, interval numerical value from it is small to Exponentially form increases greatly, respectively 2,4,8,16, and then ReLU layers after every layer of convolutional layer and empty convolutional layer, other ginsengs Fig. 2 visible with details is arranged in number.The expansion of network is made of 4 layers of warp lamination, convolution kernel size be 3*3, step-length 2, ReLU layers are connected after every layer of deconvolution, are operated by a series of deconvolution, characteristic pattern reverts to original image resolution size, obtains Light stream to the end.

Step 2: as shown in figure 3, the 2nd light stream of building learns sub-network SA-Net_2, by the figure of t moment and t+1 moment As inputting network in a stacked fashion, constriction first is made of 4 layers of Standard convolution layer and 4 layers of empty convolutional layer, and network is logical It crosses Standard convolution layer and empty convolutional layer extracts Optic flow information, expansion is made of 4 layers of warp lamination, light stream is reverted to Original image resolution.Wherein the size of the convolution kernel of Standard convolution layer is 3*3, and the convolution kernel of step-length 2, empty convolutional layer is big Small is 3*3, and step-length 1, interval is 5*5 by exponential form growth, respectively 2,4,8,12, the convolution kernel size of warp lamination, Step-length is 2, all connected nonlinearity ReLU layers after each layer of convolution.Since SA-Net_1 and SA-Net_2 sub-network does not include entirely The input of articulamentum, two networks can be the image of arbitrary size.

Step 3: the identical stacking learning network of 2 frameworks is trained simultaneously, for learning two images in a manner of semi-supervised Between positive light stream and reversed light stream, each stacking network is by 1 SA-Net_1 sub-network and 2 SA-Net_2 sub-network heaps It is folded to form, in order to which the result to a upper sub-network is assessed, while making the easier calculating incremental update of whole network, this In deformation operation is added between each stacking sub-network, with the output light stream of a sub-network to t+1 moment image into Row deformation operation, obtained image can be used following formula to indicate:

Wherein I_t+1,It respectively indicates with deformed image before deforming, u, v respectively indicate pixel x, the light stream value at y.

It will be deformedThe input of t moment image and their luminance errors as next sub-network, under One sub- e-learningIncrement light stream between t moment image.Here deformation operation passes through guidable bilinear interpolation Algorithm is realized, to guarantee that stacking network can train end-to-end.In the training stage, it is only necessary to the 1st stacking network of training, 2nd stacking network share its weight, to stacking the Training strategy that network uses for SA-Net_1 sub-network trained first, after being Two sub-networks provide good initial value, then keep the weight of SA-Net_1 constant, the next stacking sub-network SA- of training Net_2, next the weight of 2 sub-networks of fixed front is constant, and the 3rd stacking sub-network of training updates weight.SA-Net The number of iterations when increasing the depth of network by way of stacking network, while increasing network training, to improve network Overall performance.

Step 4: as shown in fig. 6, one pixel is by positive light stream mapping and reversed light stream in de-occlusion region Original location of pixels will be returned to after mapping, and in occlusion area, pixel is by positive light stream mapping and backlight Pixel position and original pixel point position after stream mapping can have deviation, and occlusion area is the concentration zones that such error occurs Domain, therefore the larger screening of light stream evaluated error can be obtained by carrying out positive and negative consistency desired result to positive light stream and backlight stream Keep off region.Positive and negative consistency desired result discriminant function is as follows:

WhereinWithThe positive light stream and reversed light stream at pixel x are respectively indicated, ε is the threshold of discriminant function Value.

Labeling function O is blocked in definition_x, when discriminant function be greater than threshold value when indicate this region light stream solve there are larger mistakes Difference, it will be determined as occlusion area, this season O_x=1.When discriminant function be less than threshold value when indicate this region light stream solve compared with It is accurate, it will be determined as de-occlusion region, enable O_x=0.When training 2 are stacked with every 1 layer positive light of network expansion part Stream and backlight stream carry out consistency desired result, estimate occlusion area to be used for network training process.Occlusion area participates in network instruction Practice process, is conducive to improve light stream precision.

Step 5: design constraint occlusion area pixel blocks perception loss function, it is adapted to semi-supervised learning light stream net Network, loss function are only applied to the training stage, train light stream learning network semi-supervisedly by backpropagation.Half prison of the invention Educational inspector practises optical flow estimation compared with supervised learning optical flow estimation, is no longer influenced by light stream true value and obtains difficult limitation, can both have The study light stream of supervision, study light stream that can also be unsupervised are more suitable for solving the problems, such as the extraction of motion information of real world.

Loss function E_lossIt is as follows:

E_loss=α E_epe+(1-α)(E_data+γE_smooth), (4)

Wherein E_epeFor end point error cost function, E_dataFor data item constraint cost function, E_smoothMotion smoothing constraint, α and γ is weight, and when input data, which is, label data, α value is 1, when input data is no label data, α value It is 0.

End point error cost function E_epeIt is as follows:

Wherein m and n is respectively the width and height of input picture, u_i,jAnd v_i,jIt Wei not predict obtained light stream value, u_i′_,j And v_i′_,jFor corresponding light stream true value.

Data item cost function E_dataIt is as follows:

Wherein κ_xFor the light stream at pixel x, N is pixel number, and T (x) indicates the value of amount of texture at pixel x, C (x) value of the Census transformation at pixel x is indicated, φ is robust penalty (x²+δ²)^α, δ=0.001, O_xMark is blocked in expression Remember function.

Picture breakdown is the part I comprising geological information by picture structure texture algorithm_S(x) and include image texture information Part I_T(x), that is, there is following formula:

I (x)=I_S(x)+I_T(x), (7)

The wherein texture part I of image_T(x) it is hardly illuminated by the light, the influence of shade isocandela variation.

Census transformation is a kind of nonlinear transformation, has conservation property in the case where illumination is acutely monotonically changed, will be in image Pixel in a certain rectangular transform window is indicated with a string of binary sequences, carries out being applied to number after simply improving to Census transformation According in item constraint, specific implementation formula is as follows:

Wherein W (p) is indicated using p as the rectangular transform window of center pixel, and q is other points in rectangular window, I (p), I (q) Gray value respectively at p, q pixel,For character string connector, σ is the threshold value of discriminate.

Motion smoothing cost function is as follows:

WhereinWithThe respectively gradient value of light stream horizontal direction and vertical direction.

Step 6: the input terminal in network inputs a small amount of tape label data and largely without label data, passes through different damages It loses weight to sum to obtain total losses, and semi-supervised learning network is trained using back-propagation algorithm.

Step 7: the input tape label data and without label data in trained model, to semi-supervised learning light stream net Network is tested, and is exported as corresponding dense optical flow.

Claims

1. a kind of semi-supervised light stream learning method for stacking network based on empty convolution, it is characterised in that:

Step 1: the 1st light stream of building learns sub-network, and it is named as SA-Net_1, SA-Net_1 light stream learning network uses Full convolution framework, is constituted by shrinking and expanding 2 parts, and 4 layers of Standard convolution behaviour are respectively adopted to 2 width images first in constriction Make extraction characteristic pattern, 2 width characteristic patterns input relevant layers is subjected to characteristic matching and merging later, then pass through 4 layers of empty convolution Layer extracts Optical-flow Feature, and expansion includes 4 layers of warp lamination, and the light stream that constriction is extracted is reverted to original image and is differentiated Rate；

Step 2: the 2nd light stream of building learns sub-network, and it is named as SA-Net_2, SA-Net_2 light stream learning network uses Full convolution framework, is formed by shrinking and expanding 2 parts, and input layer will input network after 2 width image stacks, and network passes through 4 layers of mark Quasi- convolutional layer and 4 layers of empty convolutional layer extract the light stream between image pair, and constriction is made of 4 layers of warp lamination, will shrink The light stream of extracting section reverts to original image resolution；

Step 3: 2 stacking networks of building, connect 2 SA-Net_2 sub-networks after SA-Net_1 sub-network to constitute 1st stacking network, each sub-network junction use deformation technology by the 2nd width image to the 1st width anamorphose, will become later The input of image and the 1st width image as next sub-network after shape calculates the light stream increment of 2 width images；2nd stacking net The input terminal of network and the 1st stacking network share network architecture and parameter, the 1st stacking network inputs t moment and t+1 moment 2 width images extract the positive light stream between image pair, while the image of t and t+1 moment being exchanged and are sequentially input to the 2nd heap In folded network, the reversed light stream between image pair is extracted；

Step 4: 2 stackings networks of training, wherein only need the 1st stacking network of training, the 2nd network share its update after Network weight, synchronize training 2 stacking network corresponding position sub-network when, every 1 layer of expansion exports not respectively With the positive light streams and reversed light stream of resolution ratio, each layer of positive light stream and reversed light stream are inputted simultaneously and block sensing layer, Go out occlusion area by consistency desired result criteria function, stops positive converse consistency1 after positive light stream restores to original resolution Verification；

Step 5: designing one kind blocks perception loss function, the endpoint for being used for supervised learning can be missed with semi-supervised learning network Poor cost function is combined with the data item and smooth item for being used for unsupervised learning, both can be to the training of tape label data, can also With the data training to no label, wherein data item is constant to design based on the decomposition of picture structure texture and Census transformation It is assumed that using smooth item is designed based on the isotropic diffusion of image-driven, loss function passes through backpropagation end-to-end Semi-supervised trained network；

Step 6: the training stage, first in network input input largely without label data, by summing to obtain to loss weight Total losses obtains initial network weight, is instructed later with a small amount of tape label data later using back-propagation algorithm training network Practice network, obtains network model to the end；

2. a kind of semi-supervised light stream learning method for stacking network based on empty convolution according to claim 1, feature Be: the process described in step 1 for extracting characteristic pattern is as follows:

The characteristic pattern for extracting t moment Yu t+1 moment image respectively by 4 layers of Standard convolution layer is helped by a relevant layers Net mate characteristic pattern finds the corresponding relationship between characteristic pattern, and the correlation function of relevant layers is defined as follows:

WhereinThe characteristic pattern of t moment and t+1 moment is respectively indicated, π is indicated centered on pixel x, size is K*K's Agglomerate.

3. a kind of semi-supervised light stream learning method for stacking network based on empty convolution according to claim 1, feature Be: deformation operation process described in step 3 is as follows: with the output light stream of a sub-network to t+1 moment image into Row deformation operation, obtained image can be used following formula to indicate:

4. a kind of semi-supervised light stream learning method for stacking network based on empty convolution according to claim 1, feature Be: consistency desired result criteria function occlusion area process described in step 4 is specific as follows:

Positive and negative consistency desired result function formula is as follows

WhereinWithThe positive light stream and reversed light stream at pixel x are respectively indicated, ε is the threshold value of discriminant function；

Labeling function O is blocked in definition_x, indicate that the light stream in this region is solved there are large error when discriminant function is greater than threshold value, it will It can be determined as occlusion area, this season O_x=1；It is more smart to indicate that the light stream in this region is solved when discriminant function is less than threshold value It is quasi-, it will to be determined as de-occlusion region, enable O_x=0；Every 1 layer positive light stream stacking network expansion part to 2 when training and Backlight stream carries out consistency desired result, estimates occlusion area to be used for network training process.

5. a kind of semi-supervised light stream learning method for stacking network based on empty convolution according to claim 1, feature It is: loss function E described in step 5_lossIt is as follows:

E_loss=α E_epe+(1-α)(E_data+γE_smooth), (4)

Wherein E_epeFor end point error cost function, E_dataFor data item constraint cost function, E_smoothMotion smoothing constraint, α and γ For weight, when input data, which is, label data, α value is 1, and when input data is no label data, α value is 0.

6. a kind of semi-supervised light stream learning method for stacking network based on empty convolution according to claim 5, feature It is: the end point error cost function E_epeIt is as follows:

Wherein m and n is respectively the width and height of input picture, u_i,jAnd v_i,jIt Wei not predict obtained light stream value, u '_i,jWith v′_i,jFor corresponding light stream true value；

The data item cost function E_dataIt is as follows:

Wherein κ_xFor the light stream at pixel x, N is pixel number, and T (x) indicates the value of amount of texture at pixel x, and C (x) is indicated The value of Census transformation at pixel x, φ are robust penalty (x²+δ²)^α, δ=0.001, O_xLabeling function is blocked in expression；

The motion smoothing cost function is as follows: