CN112465872A

CN112465872A - Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization

Info

Publication number: CN112465872A
Application number: CN202011454593.0A
Authority: CN
Inventors: 陈震; 何庭建; 张聪炫; 胡卫明; 黎明; 陈昊; 李凌
Original assignee: Nanchang Hangkong University
Current assignee: Nanchang Hangkong University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-09
Anticipated expiration: 2040-12-10
Also published as: CN112465872B

Abstract

The invention discloses an image sequence light stream estimation method based on a learnable occlusion mask and secondary deformation optimization, which comprises the steps of firstly inputting any continuous two frames of images in an image sequence, and carrying out feature pyramid downsampling and layering on the images to obtain multi-resolution two-frame features; calculating the correlation degree of the first frame feature and the second frame feature in each layer of pyramid, and constructing an occlusion mask-based module by utilizing the correlation degree; then, removing the edge artifact of the deformation feature by using the obtained shielding mask to optimize the optical flow of the image motion edge blur; constructing a secondary deformation optimization module by using the optical flow after the occlusion constraint, and further optimizing the estimation of the optical flow of the image motion edge at a sub-pixel level by secondary deformation; and carrying out the same shielding mask and secondary deformation on the deformation features in each pyramid layer to obtain a residual flow to refine the optical flow, and outputting the final optimized optical flow estimation when the optical flow reaches the pyramid bottom layer. The method has higher calculation precision and better applicability to image sequences such as motion occlusion, large displacement motion and the like.

Description

Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization

Technical Field

The invention relates to an image sequence processing technology, in particular to an image sequence optical flow estimation method based on a learnable occlusion mask and secondary deformation optimization.

Background

The optical flow is a method for calculating the motion information of an object between adjacent frames by finding the corresponding relation between the previous frame and the current frame according to the change of pixel points in an image sequence in a time domain and the correlation between the adjacent frames. Geometric and motion information of objects in a scene can be well recognized through optical flow estimation of an image sequence. In recent years, with the rapid development of deep learning theory and technology, the convolutional neural network model is widely applied to the optical flow estimation technology research, and because the method has the remarkable advantages of high calculation speed, high stability and the like, the optical flow estimation gradually becomes a hotspot in the image processing and computer vision research fields. The research results are widely applied to higher-level visual tasks such as target detection, target tracking, action recognition, automatic driving, three-dimensional reconstruction and the like.

At present, optical flow estimation based on deep learning is the most commonly adopted method in the research of image sequence optical flow computing technology, and compared with the traditional optical flow estimation technology research based on mathematical reasoning extraction feature matching iteration minimization energy functional, the method can estimate the optical flow more efficiently, quickly and accurately. However, due to the fact that an object in an image sequence scene has motion occlusion or large displacement motion, the problem that motion information is lost due to motion edge blurring and large displacement motion still exists in the optical flow estimation technology, robustness of an image sequence containing non-rigid motion and large displacement is poor, and application of the optical flow estimation method based on deep learning in various fields is limited.

Disclosure of Invention

The invention aims to provide an image sequence optical flow estimation method based on a learnable occlusion mask and secondary deformation optimization aiming at the defects and shortcomings of the prior art, and the optical flow estimation is refined by utilizing a learnable occlusion mask of each pyramid layer and a residual flow obtained by secondary deformation so as to improve the accuracy and robustness of an image sequence pyramid layered model for estimating the optical flow of the edge of a moving object in a scene.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows. An image sequence optical flow estimation method based on a learnable occlusion mask and secondary deformation optimization comprises the following steps:

1) inputting any two continuous frames of images in an image sequence;

2) carrying out feature pyramid downsampling layering on the two selected frames of images to obtain two frames of feature images with five layers of different resolutions;

3) firstly, calculating the correlation degree of the two frame image characteristics in the highest layer of the pyramid, and then inputting the correlation degree into an optical flow estimator to calculate an initial optical flow;

4) constructing a learnable occlusion mask optimization module by utilizing the correlation, wherein the learnable occlusion mask optimization module comprises five convolution modules which are stacked continuously, each convolution module Conv comprises a 3 x 3 convolution, a batch normalization process and an activation function LeakyRelu, and outputting a single-channel occlusion mask characteristic by inputting the correlation, the optical flow, the context characteristic and the first frame characteristic in the current pyramid layer into the learnable occlusion mask module, wherein the number of characteristic channels decreases gradually layer by layer when passing through the continuous convolution layers and is respectively 128, 96, 64, 32 and 1, wherein the last layer has no activation function, and the calculation formula is as follows:

in the formula: f_ⅰ ¹、F_ⅰ ²Respectively represent the characteristics of the first frame and the second frame of the ith-2, 3,4,5,6 th layer(ii) a Wherein x1 and x2 respectively represent the coordinates of the characteristic pixel points of the corresponding first frame and the second frame; if d is less than or equal to 4, setting the maximum displacement of the pixel point to be 4 pixels, and calculating the matching correlation degree, wherein the size of a target search window in the second frame feature is 9 multiplied by 9; the method comprises the following steps of (1) solving an L2 norm on a characteristic channel to carry out normalization processing on the correlation degree; representing the inner product of two features; corr represents the correlation degree between two frames of feature maps; upflow_i+1Representing the optical flow after the pyramid upper-layer optical flow is up-sampled by a factor of two, and the size of the optical flow is doubled as the feature scale is increased by one time; warp_iRepresenting that the feature of a second frame of the current layer is deformed after upsampling by using the (i + 1) th layer of optical flow to obtain a deformation feature warpF_i ²The deformation is beneficial to reducing the matching distance of the feature space, and the deformation and displacement of the pixel points between frames are weakened; corr_iRepresenting the correlation degree between the two frames of feature maps of the ith layer pyramid; cat represents the cascade connection of a plurality of characteristics on the channel dimension to obtain the multi-scale context cascade connection characteristic x_i(ii) a The estimationflow represents an optical flow estimator; x represents the feature after convolution of the last convolution module of the continuous stacking convolution; mask_iRepresenting the ith layer of block mask;

the mask of the obtained shielding mask characteristics_iAfter upsampling, activating a function, and then performing inner product on the feature channel dimension with the lower pyramid deformation feature, because some useful optical flow information is masked while removing the edge artifact of the deformation feature, the optical flow information missing from the masked feature needs to be compensated by adding the deconvolved upper pyramid cascade feature containing the missing information, and the optimized deformation feature can be obtained; the canonical constraint of the multi-scale context occlusion mask on the second frame deformation feature can be expressed as:

in the formula:

representing constraints with occlusion masksFeatures after the artifact are deformed; upmask_i+1Representing the occlusion mask after upsampling by a factor of two; deconv denotes the deconvolution upper pyramid context cascade feature x_i+1(ii) a sigmoid denotes an activation function that will occlude the mask_iObtaining an occlusion probability mask between a threshold value and (0, 1);

5) inputting superposition of current pyramid layer optical flow, correlation, context characteristics and first frame characteristics for an occlusion mask module to obtain an occlusion mask characteristic diagram; the lower the gray value of a pixel point in the shielding mask characteristic image, the pixel point tends to be in a state that a first frame exists and a second frame is shielded; on the contrary, in the shielding mask characteristic diagram, the higher the gray value of the pixel point is, the pixel point tends to be in a state of shielding in the first frame and appearing in the second frame; the method has the advantages that constraints are applied to deformation characteristics through a learnable occlusion mask, and the problem of image motion edge blurring is restrained;

6) constructing a secondary deformation optimization module by utilizing optical flow estimation; calculating optical flow according to the correlation degree of the undistorted features or the deformed features after the mask is shielded and the first frame features, performing secondary deformation on the second frame features by using the optical flow, overlapping the optical flow, the deformed features and the first frame features on feature dimensions, passing through a continuous five-layer convolution module, gradually decreasing the number of channels of a continuous convolution layer by layer, wherein the number of the channels is 128, 96, 64, 32 and 2 respectively, and the number of the channels of the continuous convolution layer is not an activation function in the last layer, and outputting a two-channel residual error stream; the sum of the residual flow and the current layer optical flow is the optical flow estimation of the final optimized image motion edge and the large displacement motion; the calculation formula is as follows:

in the formula, flow_jRepresenting that the correlation degree is calculated by using the characteristics of the occlusion mask after the deformation error is restrained, and then an optical flow estimation of a j ═ {2,3,4,5,6} layer is obtained through an optical flow calculation layer;

indicating optimization of optical flow with first time deformation_jAgain for the second frame featureCharacteristics after secondary deformation; residualflow_jIs the j-th layer residual optical flow; feat_jIs the j level context cascade feature; finalflow_jPredicting optical flow for layer j;

7) inputting superposition of the first frame characteristics, the characteristics after secondary deformation and the current pyramid layer optical flow for a secondary deformation residual flow optimization module to obtain a residual flow; the first deformation is pixel-level optical flow estimation, the second deformation is sub-pixel-level optical flow estimation, the secondary residual flow contains rich contour information of a moving object, the optical flow field is further compensated, optical flow learning is guided, the moving edge of an image is optimized, the matching distance of a feature space is reduced, and missing optical flow information caused by large displacement motion is made up;

8) carrying out the same occlusion mask constraint and secondary deformation calculation residual flow in each pyramid layer, and taking the sum of the optical flow after the occlusion mask is regular and the secondary deformation residual flow as the optical flow estimation finalflow after the motion edge of the optimized image_j(ii) a And when the image reaches the pyramid bottom layer, outputting the final dense optical flow estimation of the image sequence motion edge optimization, and obtaining the rich motion information and the geometrical structure of the object from the optical flow estimation.

The invention is based on the image sequence optical flow estimation method of the learnable occlusion mask and the secondary deformation optimization, and the learnable occlusion mask is adopted to remove the edge artifact of the deformation image caused by the movement occlusion and the residual flow generated by the secondary deformation to correct the optical flow information of the image movement edge blur and the moving object deletion caused by the large displacement movement.

Drawings

FIG. 1 is a first frame image of a KITTI2015 tracking image sequence according to an embodiment of the present invention;

FIG. 2 is a second frame image of a KITTI2015 tracking image sequence according to an embodiment of the present invention;

FIG. 3 is a block diagram of an embodiment of the pyramid-based hierarchical optical flow estimation model;

FIG. 4 is a mask feature map of a KITTI2015 tracing image sequence calculated by the embodiment of the present invention;

FIG. 5 is a residual flow graph of the secondary deformation estimation of the KITTI2015training image sequence calculated by the embodiment of the present invention;

fig. 6 is a light flow diagram of a final optimized image motion edge of a KITTI2015training image sequence calculated by the embodiment of the invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings. Referring to fig. 1 to 6, an embodiment of the present invention is illustrated below, which is based on a learnable occlusion mask and a secondary deformation optimized image sequence optical flow calculation method, using a KITTI2015training image sequence optical flow calculation experiment:

it comprises the following steps:

inputting a first frame image of a KITTI2015 tracing image sequence and a second frame image of the KITTI2015 tracing image sequence (as shown in FIG. 1 and FIG. 2);

secondly, performing feature pyramid downsampling layering on the input KITTI2015training image sequence (as shown in FIG. 3), I_tRefers to the first frame image, I, of the KITTI2015training image sequence_t+1Refers to the KITTI2015 trailing image sequence second frame image; establishing a convolutional neural network feature pyramid, wherein the feature pyramid comprises five continuous convolution modules, and the number of channels of the continuous convolution layers is 128, 64 and 64 respectively; a first frame image I_tAnd a second frame image I_t+1Inputting the images into a feature pyramid after cascading, and performing pyramid feature extraction on the two frames of images to obtain two frames of feature images with five layers of different resolutions, wherein the resolution of the feature image at the lower layer of the pyramid is twice that of the feature image at the upper layer;

thirdly, calculating the correlation of the continuous frame image features in the highest layer in the feature pyramid, and then inputting the correlation into an optical flow estimator to calculate an initial optical flow (as shown in fig. 3);

fourthly, a learnable occlusion mask estimation module (shown as a dashed box in fig. 3) is constructed by utilizing the correlation, wherein the occlusion mask estimation module comprises five convolution modules which are stacked in series, and each convolution module comprises a 3 × 3 convolution, a batch normalization process and an activation function LeakyRelu; superposing the relevance, the optical flow, the context cascade feature and the first frame feature in the current pyramid layer on the feature channel dimension, inputting the superposed feature channels into a continuous convolution layer in an occlusion mask estimator, gradually decreasing the number of the feature channels of the continuous convolution layer by layers, wherein the number of the feature channels is respectively 128, 96, 64, 32 and 1, and outputting the feature of a single-channel occlusion mask if the last layer has no activation function; the calculation formula is as follows:

in the formula: f_ⅰ ¹、F_ⅰ ²The i-th ═ {2,3,4,5,6} layer characteristics of the first frame and the second frame are represented respectively; wherein x1 and x2 respectively represent the coordinates of the characteristic pixel points of the corresponding first frame and the second frame; if d is less than or equal to 4, setting the maximum displacement of the pixel point to be 4 pixels, and calculating the matching correlation degree, wherein the size of a target search window in the second frame is 9 multiplied by 9; the method comprises the following steps of (1) solving an L2 norm on a characteristic channel to carry out normalization processing on the correlation degree; corr represents the correlation degree between two frames of feature maps; upflow_i+1Representing the optical flow after the pyramid upper layer optical flow is up-sampled by a factor of two, and the vector size of the pyramid upper layer optical flow is also doubled along with the increase of the optical flow scale; warp_iRepresenting that the feature of a second frame of the current layer is deformed after upsampling by using the (i + 1) th layer of optical flow to obtain a deformation feature warpF_i ²The deformation is beneficial to reducing the matching distance of the feature space, and the deformation and displacement of the pixel points between frames are weakened; corr_iRepresenting the correlation degree between the two frames of feature maps of the ith layer pyramid; cat represents the cascade connection of a plurality of characteristics on the channel dimension to obtain the multi-scale context cascade connection characteristic x_i(ii) a The estimationflow represents an optical flow estimator; x represents the feature after convolution of the last convolution module of the continuous stacking convolution; mask_iRepresenting the ith layer of block mask;

the mask of the obtained shielding mask characteristics_iAfter up-sampling, the data is processed by an activation function and then deformed with the lower pyramidThe features are subjected to inner product on the dimension of the feature channel, and as some useful optical flow information is masked while the edge artifact of the deformation features is removed, the optical flow information with the feature missing after masking needs to be compensated by the deconvolution upper pyramid cascade features containing the missing information, so that the optimized deformation features can be obtained (as shown by a shielding mask estimation module in a dashed frame in fig. 3); the canonical constraint of the multi-scale context occlusion mask on the second frame deformation feature can be expressed as:

in the formula:

representing features after constraining the deformed artifact with the occlusion mask; upmask_i+1Representing the occlusion mask after upsampling by a factor of two; deconv denotes the deconvolution upper pyramid context cascade feature x_i+1(ii) a sigmoid denotes an activation function that will occlude the mask_iObtaining an occlusion probability mask between a threshold value and (0, 1);

fifthly, inputting the superposition of the optical flow, the feature map relevancy, the context cascade feature and the first frame feature of the current pyramid layer into the learnable occlusion mask estimation module in each pyramid layer, outputting one learnable occlusion mask feature for each pyramid layer to be used for regularly constraining the deformation feature of the pyramid layer in the lower layer, and continuously optimizing the optical flow error; when the pyramid bottom layer is reached, a KITTI2015 tracing image sequence shielding mask feature map (as shown in FIG. 4) is obtained; in the shielding mask characteristic diagram, the lower the gray value of a pixel point is, the pixel point tends to exist in a first frame and is in a shielding state in a second frame; on the contrary, in the shielding mask feature map, the higher the gray value of a pixel point is, the more the pixel point tends to be in a state of shielding in the first frame and appearing in the second frame; the problem of fuzzy motion edge of an optical flow estimation image is suppressed by applying constraint on deformation characteristics through a learnable occlusion mask;

constructing a secondary deformation residual flow estimation module by utilizing optical flow estimation; the secondary deformation residual error flow estimator in the module comprises a continuous five-layer convolution module, the number of channels of the continuous convolution layer is gradually decreased layer by layer and is respectively 128, 96, 64, 32 and 2, and the last layer has no activation function; a first secondary deformation residual flow estimation module (shown as a dashed line frame at the lower left of fig. 3), directly deforming a second frame feature by using an initial optical flow calculated by a topmost pyramid through a feature deformer, and then superposing the optical flow, the deformation feature and the first frame feature on a feature channel dimension and inputting the superposed optical flow, the deformation feature and the first frame feature into a secondary deformation residual flow estimator to obtain a residual flow; the sum of the residual flow and the initial optical flow is used as the optical flow estimation after the optimization of the top pyramid; inputting the second frame characteristics of the topmost optimized optical flow deformation into a learnable occlusion mask estimation module to obtain deformation characteristics after regular constraint; inputting the calculated correlation degree of the deformation feature after the mask is shielded and the first frame feature into an optical flow estimator to calculate an optical flow after the mask constraint optimization, and secondarily deforming the second frame feature again by using the optical flow after the mask constraint optimization; superposing the optical flow, the deformation characteristic and the first frame characteristic on the characteristic channel dimension, and then inputting the optical flow, the deformation characteristic and the first frame characteristic into a second secondary deformation residual flow estimation module (shown as a dashed box at the lower right of the figure 3) to obtain a secondary residual flow; the sum of the secondary residual flow and the current layer optical flow is used as the optical flow estimation of the final optimized image motion edge and the large displacement motion; the calculation formula is as follows:

in the formula, flow_jCalculating a correlation degree by using the characteristics of the occlusion mask constrained deformation error, and inputting the correlation degree into an optical flow estimator to obtain optical flow estimation of a jth {2,3,4,5,6} layer;

indicating optimization of optical flow with first time deformation_jPerforming secondary deformation on the second frame feature; residualflow_jIs the j layer residual stream; feat_jIs above and below the jth layerA text cascade feature; finalflow_jPredicting optical flow for layer j;

seventhly, the superposition of the first frame features, the features after secondary deformation and the optical flow of the current pyramid layer is input into a secondary deformation residual flow estimation module in each pyramid layer, and a residual flow is output from each pyramid layer to further optimize the optical flow after the occlusion mask is restrained; when the pyramid bottom layer is reached, the last layer of residual stream estimation is obtained (as shown in fig. 5); the first deformation is pixel-level optical flow estimation, the second deformation is sub-pixel-level optical flow estimation, the secondary residual flow contains rich contour information of a moving object, the optical flow field is further compensated, optical flow learning is guided, the moving edge of an image is optimized, the matching distance of a feature space is reduced, and missing optical flow information caused by large displacement motion is made up;

eighthly, finally, carrying out the same occlusion mask constraint and secondary deformation calculation residual flow in each layer of the pyramid, taking the sum of the optical flow after the occlusion mask is regular and the secondary deformation residual flow as the optical flow estimation finalflow after the motion edge of the optimized image_j(ii) a When the pyramid bottom layer is reached, outputting the final KITTI2015 tracing image sequence motion edge optimized dense optical flow estimation (as shown in FIG. 6); the larger the gray value of the pixel point in the dense optical flow estimation is, the larger the optical flow value of the pixel point is, and the faster the relative motion speed is; conversely, the larger the gray value of the pixel point in the dense optical flow estimation is, the smaller the optical flow value of the pixel point is, and the smaller the relative motion speed is; the rich motion information and the geometric structure information of the object are obtained from the optical flow estimation, and the method can be effectively applied to more advanced visual tasks.

The above description is only for the purpose of illustrating the technical solutions of the present invention and not for the purpose of limiting the same, and other modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the scope of the claims of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. An image sequence optical flow estimation method based on a learnable occlusion mask and secondary deformation optimization comprises the following steps:

1) inputting any two continuous frames of images in an image sequence;

4) constructing a learnable occlusion mask optimization module by utilizing the correlation, wherein the module comprises five convolution modules which are continuously stacked, each convolution module Conv comprises a 3 x 3 convolution, a batch normalization process and an activation function Leaky Relu, the learnable occlusion mask module outputs a single-channel occlusion mask characteristic by inputting the correlation, the optical flow, the context cascade characteristic and the first frame characteristic in the current pyramid layer, and the number of characteristic channels is gradually reduced layer by layer when the learnable occlusion mask module passes through the continuous convolution layers, wherein the number of the characteristic channels is respectively 128, 96, 64, 32 and 1, the last layer does not have the activation function, and the calculation formula is as follows:

in the formula: f_ⅰ ¹、F_ⅰ ²The i-th ═ {2,3,4,5,6} layer characteristics of the first frame and the second frame are represented respectively; wherein x1 and x2 respectively represent the coordinates of the characteristic pixel points of the corresponding first frame and the second frame; if d is less than or equal to 4, setting the maximum displacement of the pixel point to be 4 pixels, and calculating the matching correlation degree, wherein the size of a target search window in the second frame feature is 9 multiplied by 9; the method comprises the following steps of (1) solving an L2 norm on a characteristic channel to carry out normalization processing on the correlation degree; representing the inner product of two features; corr represents the correlation degree between two frames of feature maps; upflow_i+1Representing the optical flow after the pyramid upper-layer optical flow is up-sampled by a factor of two, and the size of the optical flow is doubled as the feature scale is increased by one time; warp_iRepresenting that the feature of a second frame of the current layer is deformed after upsampling by using the (i + 1) th layer of optical flow to obtain a deformation feature warpF_i ²The deformation contributing to a reduction in feature spaceMatching distance, and weakening deformation and displacement of pixel points between frames; corr_iRepresenting the correlation degree between the two frames of feature maps of the ith layer pyramid; cat represents the cascade connection of a plurality of characteristics on the channel dimension to obtain the multi-scale context cascade connection characteristic x_i(ii) a The estimationflow represents an optical flow estimator; x represents the feature after convolution of the last convolution module of the continuous stacking convolution; mask_iRepresenting the ith layer of block mask features;

in the formula:

5) inputting the superposition of the current pyramid layer optical flow, the correlation, the context cascade feature and the first frame feature for the occlusion mask module to obtain an occlusion mask feature map; the lower the gray value of a pixel point in the shielding mask characteristic image, the pixel point tends to be in a state that a first frame exists and a second frame is shielded; on the contrary, in the shielding mask characteristic diagram, the higher the gray value of the pixel point is, the pixel point tends to be in a state of shielding in the first frame and appearing in the second frame; the method has the advantages that constraints are applied to deformation characteristics through a learnable occlusion mask, and the problem of image motion edge blurring is restrained;

6) constructing a secondary deformation optimization module by utilizing optical flow estimation; calculating optical flow according to the correlation degree of the undistorted features or the deformed features after the mask is shielded and the first frame features, performing secondary deformation on the second frame features by using the optical flow, overlapping the optical flow, the deformed features and the first frame features on feature dimensions, passing through a continuous five-layer convolution module, decreasing the number of channels of continuous convolution layers to 128, 96, 64, 32 and 2 respectively, and outputting two-channel residual error flow if the last layer has no activation function; the sum of the residual flow and the current layer optical flow is the optical flow estimation of the final optimized image motion edge and the large displacement motion; the calculation formula is as follows:

indicating optimization of optical flow with first time deformation_jPerforming secondary deformation on the second frame feature; residualflow_jIs the j-th layer residual optical flow; feat_jIs the j level context cascade feature; finalflow_jPredicting optical flow for layer j;