WO2020088766A1

WO2020088766A1 - Methods for optical flow estimation

Info

Publication number: WO2020088766A1
Application number: PCT/EP2018/079903
Authority: WO
Inventors: Nikolay CHUMERIN; Michal NEORAL; Jan Sochman; Jirí MATAS
Original assignee: Toyota Motor Europe; Czech Technical University
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-05-07
Also published as: JP2022509375A; JP7228172B2

Abstract

A method for processing a plurality of image frames to determine an optical flow estimation of one or more pixels is provided. The method includes providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames, estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features, generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions, providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder, and generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.

Description

METHODS FOR OPTICAL FLOW ESTIMATION

FIELD OF THE DISCLOSURE

[0001] The present invention relates to systems and methods for image processing, and more particularly to a neural network implemented optical flow estimation method.

BACKGROUND OF THE DISCLOSURE

[0002] Optical flow is a two-dimensional displacement field describing the projection of scene motion between two or more images. Occlusions caused by scene motion or other factors contribute to the problems with regard to optical flow estimation, i.e., at occluded pixels no visual correspondences exist.

[0003] Optical flow estimation is a core computer vision problem and has many applications, e.g., action recognition, autonomous driving, and video editing, for example.

[0004] Previously performed methods that have not used convolutional neural networks (CNN) addressed this problem by using regularization which extrapolated the optical flow from surrounding, non-occluded areas.

[0005] In current state-of-the-art CNN based algorithms the

regularization is only implicit and the network learns how much reliance may be placed on identified correspondences and how much to extrapolate.

[0006] Previous approaches dealing with occlusions more directly have first estimated initial forward and backward optical flows, with occlusions being identified using the forward-backward consistency check. Occlusion maps are then used for estimation of the final optical flow.

[0007] Further, according to some previous solutions, three frames, with the middle frame as the reference frame, has been used to define a coordinate system for loss computation. The forward flow to the future frame and backward flow to the past frame is then calculated and applied to enable some regularization of these two optical flows.

[0008] Yang et al., "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," CVPR 2018, discloses a CNN model for generation of an estimated optical flow. However, no consideration for how to treat occlusions is discussed.

[0009] Meister et al., "Unflow: Unsupervised Learning of Optical Flow With a Bidirectional Census Loss," AAAI 2018, discloses the use of bidirectional flow estimation for handling occlusions in the optical flow estimation.

SUMMARY OF THE DISCLOSURE

[0010] The present inventors have determined that in the prior methods, occlusions affect the initial optical flow estimation from the very outset of the analysis, and thus, the final solution is adversely affected by failing to consider the initial effect caused by the occlusions.

[0011] In addition, the inventors have recognized that by feeding back previous estimated optical flows to a current occlusion/flow analysis, a CNN is able to learn typical relations between the previous and current time step optical flow, and therefore, allow the network to use these relations in the time step undergoing occlusion/flow estimation.

[0012] Further, optical flow estimation over more than two frames results in a need for pixels to be mapped to a reference coordinate system for loss computation. The mapping is defined by an unknown optical flow itself, and therefore, it becomes difficult to apply temporal regularization before the flow is known. However, by implementing systems according to the present disclosure, with the feedback and feedforward methodology, the system is aided in learning the time-step flow, and it becomes possible to more accurately align the coordinate systems between the frames, thus propagating the previous frame flow into the correct positions in the current frame.

[0013] According to embodiments of the present disclosure, a method for processing a plurality of image frames to determine an optical flow

estimation of one or more pixels is provided. The method includes providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames, estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features, generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions, providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder, generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.

[0014] By taking into account occlusion estimation prior to generation of an estimated flow, increased accuracy of both occlusion presence and the optical flow can be achieved, as well as a reduction in resource usage. In addition, because previously estimated flows may be fed back through the system, there is no limit on temporal horizon, and by recursion, all prior frames may be used for future optical flow estimations.

[0015] The identifying may include generating, by a feature extractor, one or more feature pyramids by extracting one or more features from each of the two or more consecutive image frames, and providing at least one level of each of the one or more feature pyramids to the optical flow estimator.

[0016] The estimating a presence of one or more occlusions may include calculating an estimated correlated cost volume for one or more of the identified features over a plurality of displacements between the two or more consecutive image frames. [0017] The method may include providing the optical flow and the one or more occlusion maps to a refinement network to produce a refined optical flow.

[0018] The method may include providing, to at least one of the optical flow decoder, the occlusion estimator, and the refinement network, an estimated optical flow from a previous time step, the refinement network preferably comprising a convolutional neural network.

[0019] The optical flow decoder and the occlusion estimator may include convolutional neural networks.

[0020] The method may include transforming a flow coordinate system of the optical flow to a frame coordinate system of an image frame under consideration, the transforming comprising warping with bilinear

interpolation.

[0021] Warping may include at least one of forward warping and backward warping.

[0022] The feature extractor may be initialized with an initial estimated optical flow between a first and second image frame of the plurality of image frames, the initial optical flow being estimated prior to application of any warping.

[0023] The one or more convolutional neural networks may be trained end-to-end with weighted multi-task loss over the optical flow decoder and occlusion estimator.

[0024] The training may be performed at all scales according to the loss equation

where a^s is the weight of individual scale losses and a₀ is the occlusion estimation weight, the sums are over all S spatial resolutions,

is optimized loss, and L_QIS pixel-wise cross-entropy loss for occlusion loss.

[0026] The video sequence may include image frames obtained from a road scene in a vehicle, preferably an autonomously operated motor vehicle. [0027] According to further embodiments of the present disclosure, a non-transitory computer readable medium comprising instructions configured to cause a processor to carry out the method described above.

[0028] The non-transitory computer readable medium may be mounted in a vehicle, preferably an autonomously operated motor vehicle. The non- transitory computer readable medium may comprise magnetic storage, optical storage, electronic storage, etc.

[0029] Still further embodiments of the present disclosure include a motor vehicle comprising a processor configured to carry out the method described above, wherein the processor may be further configured to actuate vehicle control systems based, at least in part, on the optical flow.

[0030] It is intended that combinations of the above-described elements and those within the specification may be made, except where otherwise contradictory.

[0031] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.

[0032] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] Fig. 1 is an exemplary logical representation of an optical flow estimation system configured to account for occlusions prior to analysis of optical flow;

[0034] Fig. 2 shows an exemplary time based flow for optical flow estimation and occlusion refinement; and

[0035] Fig. 3 shows a flowchart highlighting an exemplary method according to embodiments of the present disclosure. DESCRIPTION OF THE EMBODIMENTS

[0036] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

[0037] The present disclosure relates to a method for processing image data in order to accurately estimate an optical flow of one or more pixels and/or features across a plurality of image frames.

[0038] The input data therefore, may comprise a plurality of images, for example, from a road scene surrounding an ego vehicle, and may be comprised over a period of time. The input data may be in any suitable format for provision to an input node of a neural network, e.g., a

convolutional neural network (CNN), also referred to herein as a "network." For example, an image data input may be in jpeg format, gif format, etc.

[0039] Image data particularly of interest, although not limiting, may be for example, image data obtained of road scenes, such as for example as taken in front of a vehicle, either stopped or in motion.

[0040] Such image data may be used, for example, for recognition and tracking of objects relevant to a vehicle, or to the driver thereof, for example, during operation of the ego vehicle. Objects of interest may be any suitable object, such as, for example, the road and associated markings, pedestrians, vehicles, obstacles, traffic lights, etc.

[0041] Particularly, the present invention provides a method for estimating an optical flow of one or more objects or pixels thereof, across a plurality of frames of a video sequence.

[0042] Fig. 1 is an exemplary logical representation of an optical flow estimation system configured to account for occlusions prior to analysis of optical flow. [0043] Components of the optical flow estimation system of the present disclosure may include a machine learnable feature pyramid extractor 100, one or more occlusion estimators 110, and an optical flow decoder 2, among others. For example, a refinement network (shown at Fig. 2) may also be provided.

[0044] Learnable feature pyramid extractor 100 comprises a convolutional neural network configured to produce a feature pyramid given one or more input images I. For example, given two input images I_t and I_t+i, L level pyramids of feature representations may be generated, with a bottom

(zeroth) level being the input images, i.e., c_t°= I_t. To generate feature representation at the I^th layer, c[, layers of convolutional filters may be used to downsample the features at the I- 1^th pyramid level,

, for example, by a factor of 2.

[0045] According to embodiments of the present disclosure, each feature pyramid extractor 100 may comprise at least 3 levels (101a, 101b, 101c), for example, 6 levels (the further 3 levels are not shown in the drawings for purposes of clarity). Thus, from the first to the sixth levels of the feature pyramid extractor 100, the number of feature channels may be, for example, respectively 16, 32, 64, 96, 128, and 196.

[0046] Output of at least one level of feature pyramid extractors 100 is fed to an occlusion estimator 110, as well as components of optical flow decoder 2, for example, at least one of a correlation cost volume estimator 105, a warping module 120, and a first optical flow estimation module 115a.

[0047] Optical flow decoder 2 may include, among others, one or more optical flow estimators 115, one or more forward and/or backward warping modules 120, one or more cost volume estimators 105, and one or more up samplers 112, among others. One of skill will understand that each of these components may be implemented within a single neural network (e.g., a convolutional neural network), or be implemented within its own individual neural network receiving inputs from the outputs of the other component neural networks during training and processing.

[0048] Logical configuration of optical flow decoder 2 follows the configuration of the optical flow decoder of PWC-NET described by D. Sun et al. in "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," arXiv: 1709.02371v3, 25 June 2018. In particular, Section 3 of this document, entitled "Approach" and starting at page 3 second column, through page 5, first column, provides one exemplary implementation of a useful optical decoder, and this section is herein incorporated by reference in the present specification.

[0049] Warping modules 120 may be provided in configured to receive, as input, an output from one or more layers of feature pyramid extractors 100. For example, warping may be applied to the output at the I^th level of feature pyramid 100, as shown at Fig. 1. Warping features of the second image I_t+i toward the first image using a 2x upsampled flow from the l+l^th level according to the following:

where x is the pixel index and the upsampled flow up₂(w^l+1) is set to be zero at the top level.

[0050] Bilinear interpolation may be used to implement the warping operation and compute the gradients to the input CNN features and flow for backpropagation.

[0051] For nontranslational motion, warping may be implemented to compensate for geometric distortions and put image patches at a desired scale.

[0052] Additional warping modules 120 may be provided outside of the optical flow decoder 2 for purposes of translation of coordinate systems between image frames I_t and I_t+i, for example, as will be discussed in greater detail below. Such warping modules 120 may receive input from one or more of optical flow decoder 2 and refinement network 250, to facilitate performance of the coordinate translation.

[0053] Correlation cost estimators 105 may be configured to estimate correlation cost volumes for one or more features identified by feature pyramid extractor 100, over a plurality of displacements between two or more consecutive image frames I_t and I_t+i. Correlation cost volume is a value based on a computational/energy cost for associating a pixel in a first frame It at time t with its corresponding pixel at a subsequent frame I_t+i of an image sequence.

[0054] Computation and processing of cost volume is generally known in the art. For example, taking inputs as two tensors Ti and T₂ both from R^HXW^XC^ j_{et D} _ {__{cjmax/ t} o, ..., dmax) and d be from DxD. Then the output of the correlation cost volume is a tensor Y from R^HxWxi^DH^Di_i g = cv(x, d) =

F(Ti, X)^TF(T₂, x+d), where F returns a slice along the channels dimension from the input tensor and x is from {l,...,H}x{l,...,W}.

[0055] In the present disclosure, partial cost volume at multiple feature pyramid levels (e.g., levels 1-6) is implemented, such that correlation cost volume may be estimated for identified features across the feature

pyramid 100.

[0056] Occlusion estimators 110 are configured to estimate the presence of occlusions based on the identified features from feature extractor 100 and the correlation cost volume determined by correlation cost estimation modules 105. The inventors have determined that when the cost volume for a particular position in the cost volume over all examined displacements is high, the pixel is likely occluded in the next frame. Therefore, the output of the first occlusion estimator (i.e., a pre-flow estimation occlusion map) can be fed to the optical flow estimator along with the cost volume data used for generating the pre-flow estimation occlusion map, resulting in more accurately estimated optical flow. [0057] An accuracy improvement can be derived, at least in part, due to the fact that the occlusion estimation does not rely on imprecise flow estimation, which did not account for occlusions prior to generation, thereby allowing the optical flow estimator to benefit from the additional input.

[0058] Both the optical flow estimators 115 and the occlusions estimators

110 may work in a coarse-to-fine manner with higher resolution estimators receiving upsampled flow estimates from the lower resolution estimators.

[0059] Occlusion estimators 110 may implement, for example, five convolutional layers with D, D/2, D/4, D/8 and two output channels

(occluded/not occluded maps), D corresponding to the number of correlation cost volume layers. In addition, each layer may use ReLU activation, or alternatively, certain layers, for example, the final layer, may implement soft- max activation.

[0060] Fig. 2 shows an exemplary time based flow for optical flow estimation and occlusion refinement while Fig. 3 shows a flowchart highlighting an exemplary method according to embodiments of the present disclosure.

[0061] A plurality of images may be received, for example, as part of a video stream (step 305).

[0062] Feature pyramid 100 may then process the images to identify features therein and generate feature maps associated with the images (step 310). Features at certain levels of feature pyramid 100 may be fed forward to, for example, optical flow estimator 115b, correlation cost estimator 105b, warping module 120, etc. For example, as shown at Fig. 1, features in feature pyramid extractor 100 are downsampled spatially 2x with each level, and channels increased with each level. The linking with correlation cost estimator 105a and flow estimator 115a then proceeds along a coarse-to-fine scheme: i.e., starting with features having the lowest spatial resolution, flow estimator 115a estimates the optical flow at that resolution using the cost volume values built by correlation cost estimator 105a using the same features.

[0063] The flow is then upsampled (e.g., 2x) and combined with features having higher resolution. This is repeated until the final resolution is reached.

[0064] In further detail, once the initial set of feature maps for an image

I_t and a second image I_t+i are created by feature pyramid 100, the feature maps may be provided to cost volume estimator 105a for cost volume estimation between I_t and I_t+i, based on the feature maps. The cost volume estimation between the images may then be provided to occlusion estimator 110a and a first optical flow estimator 115a in parallel, to allow occlusion estimator 110a to estimate the presence of one or more occlusions in the image frames based on the cost volume as well as the optical flow from t-1, and optical flow estimator 115a to estimate an optical flow on the features from feature pyramid 100 at the present resolution (step 315).

[0065] Where flow is being analyzed between a first and second image frame of a sequence, an optical flow from t-1 is not available. Therefore, in order to provide an initialization optical flow simulating t-1, the feature extractor 100 as well as the occlusion estimator 110a may be initialized with an initial estimated optical flow between first and second image frames of the plurality of image frames, the initial optical flow being estimated prior to application of any warping in warping module 120. In other words, a first pass through the optical flow decoder 2 may be performed with first and second image frames of the image sequence, and an optical flow estimated, preferably without application of warping module 120. This initialization optical flow may then be provided as the t-1 optical flow to the components of the system.

[0066] Once the occlusions from images I_t to I_t+i have been estimated by occlusion estimator 110, occlusion maps 5a for the estimated occlusions may be created (step 320) and these maps 5a fed forward to optical flow estimator 115a, upsampler 112b, etc. [0067] Optical flow estimator 115a may then create an initial optical flow estimate la based on the occlusion maps 5a, features from feature extractor 100, cost volume information from cost volume estimator 105a, and the warped previous optical flow lb from time step t-1.

[0068] Initial optical flow estimate may then be, for example, upsampled at a 2x upsampling rate by upsampler 112a. As noted above, the flow is estimated on the coarsest scale first using the features of corresponding resolution. To get to a higher resolution, the flow is upsampled and used together with cost volume to estimate the higher resolution flow, and repeated until the final resolution. This output at the final resolution may then be provided to a warping module 120 to be processed as described above, as well as to a second cost volume estimator 105b, occlusion estimator 110b, etc.

[0069] Occlusion maps 5a may be fed to an upsampler 112b to be upsampled at, for example, 2x, with the resulting data sent to second occlusion estimator 110b. In occlusion estimator 110b upsampled initial optical flow estimate la, cost volume from cost volume estimator 105b, and warped optical flow estimate from time t-1 is used to create a final occlusion map 5b.

[0070] In parallel, following the upsampling, warping, and second cost volume calculations, the initial optical flow estimate la may be provided to Optical flow estimator 115b, which, using the final occlusion map 5b, features from feature pyramid 100, and optical flow from time t-1, among others, generates a final optical flow estimate lb between images I_t and I_t+i (step330).

[0071] As shown in Fig. 2, and noted above optical flow and occlusion estimations may be iteratively refined by a refinement network 250 to further improve accuracy. One example of such refinement network is described at Section 4.1 by Ilg, et al., "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks/' 6 December 2016, the contents of this section being incorporated herein by reference.

[0072] According to embodiments of the present disclosure, refinement network 250 (see Fig. 2) may have a similar architecture to the optical flow decoder of FlowNet2 and/or PWC-Net. For example, starting from the refinement network (i.e., the Context Network described at page 4) described by PWC-Net, the DenseNet connections may be removed. Then, instead of using the input images and associated warps, features from feature pyramid 100 on the corresponding scale and associated warps can be substituted, thus providing a richer input representation. The input error channel for these features can then be computed as a sum of the Li loss and structure similarity (SSIM).

[0073] According to the present disclosure, the present inventors have determined that improved results may be obtained using two refinement applications, with diminishing gains obtained with further applications.

[0074] As noted above, PWC-NET forms the basis of the optical decoder 2 of the present disclosure, however, the disclosure provides a description of additional temporal connections to the optical decoder 2, these temporal connections 220 giving the optical flow decoder 2, the occlusions decoder 2 and the refinement network 250 additional input, namely, estimated flow from a previous time step. See, for example, arrows 220 in Figs. 1 and 2.

[0075] When processing video sequences longer than two image frames these connections allow the network to learn typical relations between the previous and current time step flows and use it in the current frame flow estimation. During evaluation the connections also allow continual estimation of the flow on longer sequences and improve the flow with increasing sequence length.

[0076] However, coordinate systems in which the two optical flows are expressed differ and need to be transformed to correspond to one another in order to apply the previous flow to the correct pixels in the current time step. Thus, forward and/or backward warping may be implemented to perform this transformation.

[0077] Forward warping may be used to transform the coordinate system from time step t - 1 using the optical flow F_t-x itself (the forward flow between images i_t.x and l_t). The warped flow F_{t x} is computed as

for all pixel positions x and take care of the positions to which the flow F_t_i maps more than once. In such cases we preserve the larger of the mapped flows. This way we prioritize larger motions, thus faster moving objects. Although the experiments show usefulness of this warping, the main disadvantage of this approach is that the transformation is not differentiable. Thus, the training cannot propagate gradients through this step and relies on the shared weights only.

[0078] Alternatively, the coordinate system may be transformed using the backward flow B_t from frame t to frame t - 1. This may require an extra evaluation of the network, but then the warping is a direct application of the differentiable spatial transformer. In other words, the warping step can be implemented by a differentiable spatial transformation, and can thus be trained end-to-end.

[0079] The gradients may therefore be propagated through the temporal connections during training.

[0080] One of skill will recognize that end-to-end training of the described network(s) can be implemented in a number of ways. For example, starting from simple datasets (e.g., simple objects, rigid motions, etc.), of which the FlyingChairs and FlyingThings datasets are part and which are readily available for download, other datasets may be introduced into the training. Such datasets may include Driving, KGPT15, VirtualKITTI, Sintel, HD1K to use a "curriculum learning" approach. [0081] As some datasets may contain only a subset of required

modalities, the loss can be set to zero when the modality is missing (i.e., "no training").

[0082] It may further be possible to obtain improved results by first training the portion of the network corresponding to PWC-Net (as described above), using the simplest datasets and add the additional modules (i.e., occlusion estimators 110a, 110b, upsampler 112b) following the simple training. This may result in increased rates of optimization by pretraining parts of the network and avoiding local minimas.

[0083] The present invention also includes a computer program product which provides the functionality of any of the methods according to the present invention when executed on a computing device. Such computer program product can be tangibly embodied in a carrier medium carrying machine-readable code for execution by a programmable processor. The present invention thus relates to a carrier medium carrying a computer program product that, when executed on computing means, provides instructions for executing any of the methods as described above.

[0084] The term "carrier medium" refers to any medium that participates in providing instructions to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as a storage device which is part of mass storage.

Common forms of computer readable media include, a CD-ROM, a DVD, a flexible disk or floppy disk, a tape, a memory chip or cartridge or any other medium from which a computer can read. Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

[0085] The computer program product can also be transmitted via a carrier wave in a network, such as a LAN, a WAN or the Internet.

Transmission media can take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.

Transmission media include coaxial cables, copper wire and fibre optics, including the wires that comprise a bus within a computer.

[0086] Based on the output of the network, i.e., an optical flow

estimation for each pixel between an image at time t and an image at time t+1 may be generated.

[0087] In addition, the media may be installed in a vehicle, for example, an autonomously automated vehicle, and the method configured to operate within one or more ECUs of the vehicle. The improved optical flow data may be used for tracking of various objections and elements in a road scene during operation of a vehicle. In addition, based on movements and tracking of said movements, a vehicle ECU may be provided with information to enable decision making in the autonomous operation mode.

[0088] Throughout the description, including the claims, the term

"comprising a" should be understood as being synonymous with "comprising at least one" unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms

"substantially" and/or "approximately" and/or "generally" should be understood to mean falling within such accepted tolerances.

[0089] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.

[0090] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

CAIMS

1. A method for processing a plurality of image frames to determine an optical flow estimation of one or more pixels, the method comprising:

providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames; estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features;

generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions;

providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder; and

generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.

2. The method according to claim 1, wherein the identifying comprises: generating, by a feature extractor, one or more feature pyramids by extracting one or more features from each of the two or more consecutive image frames; and

providing at least one level of each of the one or more feature pyramids to the optical flow estimator.

3. The method according to any of claims 1-2, wherein the estimating a presence of one or more occlusions includes calculating an estimated correlated cost volume for one or more of the identified features over a plurality of displacements between the two or more consecutive image frames.

4. The method according to any of claims 1-3, comprising providing the optical flow and the one or more occlusion maps to a refinement network to produce a refined optical flow.

5. The method according to claim 4, comprising, providing, to at least one of the optical flow decoder, the occlusion estimator, and the refinement network, an estimated optical flow from a previous time step, the refinement network preferably comprising a convolutional neural network.

6. The method according to any of claims 1-5, wherein the optical flow decoder and the occlusion estimator comprise one or more convolutional neural networks.

7. The method according to any of claims 1-6, comprising transforming a flow coordinate system of the optical flow to a frame coordinate system of an image frame under consideration, the transforming comprising warping with bilinear interpolation.

8. The method according to claim 7, wherein the warping comprises at least one of forward warping and backward warping.

9. The method according to any of claims 2-8, wherein the feature extractor is initialized with an initial estimated optical flow between a first and second image frame of the plurality of image frames, the initial optical flow being estimated prior to application of warping.

10. The method according to claim 6, wherein the one or more convolutional neural networks are trained end-to-end with weighted multi- task loss over the optical flow decoder and occlusion estimator.

11. The method according to claim 10, wherein the training is performed at all scales according to the loss equation:

where a^s is the weight of individual scale losses and a₀ is the occlusion estimation weight, the sums are over all S spatial resolutions, L^S _F is optimized loss, and L^s ₀ is pixel-wise cross-entropy loss for occlusion loss.

12. The method according to any of claims 1-11, wherein the video sequence comprises image frames obtained from a road scene in a vehicle, preferably an autonomously operated motor vehicle.

13. A non-transitory computer readable medium comprising

instructions configured to cause a processor to carry out the method according to any of claims 1-12.

14. The non-transitory computer readable medium according to claim 13, wherein the non-transitory computer readable medium is mounted in a vehicle, preferably an autonomously operated motor vehicle.

15. A motor vehicle comprising a processor configured to carry out the method according to any of claims 1-12, wherein the processor is further configured to actuate vehicle control systems based, at least in part, on the optical flow.