WO2020088766A1 - Methods for optical flow estimation - Google Patents

Methods for optical flow estimation Download PDF

Info

Publication number
WO2020088766A1
WO2020088766A1 PCT/EP2018/079903 EP2018079903W WO2020088766A1 WO 2020088766 A1 WO2020088766 A1 WO 2020088766A1 EP 2018079903 W EP2018079903 W EP 2018079903W WO 2020088766 A1 WO2020088766 A1 WO 2020088766A1
Authority
WO
WIPO (PCT)
Prior art keywords
optical flow
occlusion
estimator
image frames
estimated
Prior art date
Application number
PCT/EP2018/079903
Other languages
French (fr)
Inventor
Nikolay CHUMERIN
Michal NEORAL
Jan Sochman
Jirí MATAS
Original Assignee
Toyota Motor Europe
Czech Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toyota Motor Europe, Czech Technical University filed Critical Toyota Motor Europe
Priority to JP2021547880A priority Critical patent/JP7228172B2/en
Priority to PCT/EP2018/079903 priority patent/WO2020088766A1/en
Publication of WO2020088766A1 publication Critical patent/WO2020088766A1/en

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0246Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
    • G05D1/0253Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means extracting relative motion information from a plurality of images taken successively, e.g. visual odometry, optical flow
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • the present invention relates to systems and methods for image processing, and more particularly to a neural network implemented optical flow estimation method.
  • Optical flow is a two-dimensional displacement field describing the projection of scene motion between two or more images. Occlusions caused by scene motion or other factors contribute to the problems with regard to optical flow estimation, i.e., at occluded pixels no visual correspondences exist.
  • Optical flow estimation is a core computer vision problem and has many applications, e.g., action recognition, autonomous driving, and video editing, for example.
  • the inventors have recognized that by feeding back previous estimated optical flows to a current occlusion/flow analysis, a CNN is able to learn typical relations between the previous and current time step optical flow, and therefore, allow the network to use these relations in the time step undergoing occlusion/flow estimation.
  • optical flow estimation over more than two frames results in a need for pixels to be mapped to a reference coordinate system for loss computation.
  • the mapping is defined by an unknown optical flow itself, and therefore, it becomes difficult to apply temporal regularization before the flow is known.
  • the system is aided in learning the time-step flow, and it becomes possible to more accurately align the coordinate systems between the frames, thus propagating the previous frame flow into the correct positions in the current frame.
  • the method includes providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames, estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features, generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions, providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder, generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.
  • the identifying may include generating, by a feature extractor, one or more feature pyramids by extracting one or more features from each of the two or more consecutive image frames, and providing at least one level of each of the one or more feature pyramids to the optical flow estimator.
  • the estimating a presence of one or more occlusions may include calculating an estimated correlated cost volume for one or more of the identified features over a plurality of displacements between the two or more consecutive image frames.
  • the method may include providing the optical flow and the one or more occlusion maps to a refinement network to produce a refined optical flow.
  • the method may include providing, to at least one of the optical flow decoder, the occlusion estimator, and the refinement network, an estimated optical flow from a previous time step, the refinement network preferably comprising a convolutional neural network.
  • the optical flow decoder and the occlusion estimator may include convolutional neural networks.
  • the method may include transforming a flow coordinate system of the optical flow to a frame coordinate system of an image frame under consideration, the transforming comprising warping with bilinear
  • Warping may include at least one of forward warping and backward warping.
  • the feature extractor may be initialized with an initial estimated optical flow between a first and second image frame of the plurality of image frames, the initial optical flow being estimated prior to application of any warping.
  • the one or more convolutional neural networks may be trained end-to-end with weighted multi-task loss over the optical flow decoder and occlusion estimator.
  • the training may be performed at all scales according to the loss equation
  • the video sequence may include image frames obtained from a road scene in a vehicle, preferably an autonomously operated motor vehicle.
  • a non-transitory computer readable medium comprising instructions configured to cause a processor to carry out the method described above.
  • the non-transitory computer readable medium may be mounted in a vehicle, preferably an autonomously operated motor vehicle.
  • the non- transitory computer readable medium may comprise magnetic storage, optical storage, electronic storage, etc.
  • Still further embodiments of the present disclosure include a motor vehicle comprising a processor configured to carry out the method described above, wherein the processor may be further configured to actuate vehicle control systems based, at least in part, on the optical flow.
  • Fig. 1 is an exemplary logical representation of an optical flow estimation system configured to account for occlusions prior to analysis of optical flow;
  • Fig. 2 shows an exemplary time based flow for optical flow estimation and occlusion refinement
  • FIG. 3 shows a flowchart highlighting an exemplary method according to embodiments of the present disclosure. DESCRIPTION OF THE EMBODIMENTS
  • the present disclosure relates to a method for processing image data in order to accurately estimate an optical flow of one or more pixels and/or features across a plurality of image frames.
  • the input data therefore, may comprise a plurality of images, for example, from a road scene surrounding an ego vehicle, and may be comprised over a period of time.
  • the input data may be in any suitable format for provision to an input node of a neural network, e.g., a
  • CNN convolutional neural network
  • an image data input may be in jpeg format, gif format, etc.
  • Image data particularly of interest may be for example, image data obtained of road scenes, such as for example as taken in front of a vehicle, either stopped or in motion.
  • Such image data may be used, for example, for recognition and tracking of objects relevant to a vehicle, or to the driver thereof, for example, during operation of the ego vehicle.
  • Objects of interest may be any suitable object, such as, for example, the road and associated markings, pedestrians, vehicles, obstacles, traffic lights, etc.
  • the present invention provides a method for estimating an optical flow of one or more objects or pixels thereof, across a plurality of frames of a video sequence.
  • Fig. 1 is an exemplary logical representation of an optical flow estimation system configured to account for occlusions prior to analysis of optical flow.
  • Components of the optical flow estimation system of the present disclosure may include a machine learnable feature pyramid extractor 100, one or more occlusion estimators 110, and an optical flow decoder 2, among others.
  • a refinement network shown at Fig. 2 may also be provided.
  • Learnable feature pyramid extractor 100 comprises a convolutional neural network configured to produce a feature pyramid given one or more input images I. For example, given two input images I t and I t+i , L level pyramids of feature representations may be generated, with a bottom
  • layers of convolutional filters may be used to downsample the features at the I- 1 th pyramid level, , for example, by a factor of 2.
  • each feature pyramid extractor 100 may comprise at least 3 levels (101a, 101b, 101c), for example, 6 levels (the further 3 levels are not shown in the drawings for purposes of clarity).
  • the number of feature channels may be, for example, respectively 16, 32, 64, 96, 128, and 196.
  • Output of at least one level of feature pyramid extractors 100 is fed to an occlusion estimator 110, as well as components of optical flow decoder 2, for example, at least one of a correlation cost volume estimator 105, a warping module 120, and a first optical flow estimation module 115a.
  • Optical flow decoder 2 may include, among others, one or more optical flow estimators 115, one or more forward and/or backward warping modules 120, one or more cost volume estimators 105, and one or more up samplers 112, among others.
  • each of these components may be implemented within a single neural network (e.g., a convolutional neural network), or be implemented within its own individual neural network receiving inputs from the outputs of the other component neural networks during training and processing.
  • Logical configuration of optical flow decoder 2 follows the configuration of the optical flow decoder of PWC-NET described by D. Sun et al. in "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume,” arXiv: 1709.02371v3, 25 June 2018.
  • Section 3 of this document, entitled “Approach” and starting at page 3 second column, through page 5, first column, provides one exemplary implementation of a useful optical decoder, and this section is herein incorporated by reference in the present specification.
  • Warping modules 120 may be provided in configured to receive, as input, an output from one or more layers of feature pyramid extractors 100. For example, warping may be applied to the output at the I th level of feature pyramid 100, as shown at Fig. 1. Warping features of the second image I t +i toward the first image using a 2x upsampled flow from the l+l th level according to the following:
  • Bilinear interpolation may be used to implement the warping operation and compute the gradients to the input CNN features and flow for backpropagation.
  • warping may be implemented to compensate for geometric distortions and put image patches at a desired scale.
  • Additional warping modules 120 may be provided outside of the optical flow decoder 2 for purposes of translation of coordinate systems between image frames I t and I t +i, for example, as will be discussed in greater detail below. Such warping modules 120 may receive input from one or more of optical flow decoder 2 and refinement network 250, to facilitate performance of the coordinate translation.
  • Correlation cost estimators 105 may be configured to estimate correlation cost volumes for one or more features identified by feature pyramid extractor 100, over a plurality of displacements between two or more consecutive image frames I t and I t+i .
  • Correlation cost volume is a value based on a computational/energy cost for associating a pixel in a first frame It at time t with its corresponding pixel at a subsequent frame I t+i of an image sequence.
  • partial cost volume at multiple feature pyramid levels (e.g., levels 1-6) is implemented, such that correlation cost volume may be estimated for identified features across the feature
  • Occlusion estimators 110 are configured to estimate the presence of occlusions based on the identified features from feature extractor 100 and the correlation cost volume determined by correlation cost estimation modules 105. The inventors have determined that when the cost volume for a particular position in the cost volume over all examined displacements is high, the pixel is likely occluded in the next frame. Therefore, the output of the first occlusion estimator (i.e., a pre-flow estimation occlusion map) can be fed to the optical flow estimator along with the cost volume data used for generating the pre-flow estimation occlusion map, resulting in more accurately estimated optical flow.
  • a pre-flow estimation occlusion map can be fed to the optical flow estimator along with the cost volume data used for generating the pre-flow estimation occlusion map, resulting in more accurately estimated optical flow.
  • An accuracy improvement can be derived, at least in part, due to the fact that the occlusion estimation does not rely on imprecise flow estimation, which did not account for occlusions prior to generation, thereby allowing the optical flow estimator to benefit from the additional input.
  • the 110 may work in a coarse-to-fine manner with higher resolution estimators receiving upsampled flow estimates from the lower resolution estimators.
  • Occlusion estimators 110 may implement, for example, five convolutional layers with D, D/2, D/4, D/8 and two output channels
  • each layer may use ReLU activation, or alternatively, certain layers, for example, the final layer, may implement soft- max activation.
  • FIG. 2 shows an exemplary time based flow for optical flow estimation and occlusion refinement while Fig. 3 shows a flowchart highlighting an exemplary method according to embodiments of the present disclosure.
  • a plurality of images may be received, for example, as part of a video stream (step 305).
  • Feature pyramid 100 may then process the images to identify features therein and generate feature maps associated with the images (step 310).
  • Features at certain levels of feature pyramid 100 may be fed forward to, for example, optical flow estimator 115b, correlation cost estimator 105b, warping module 120, etc.
  • features in feature pyramid extractor 100 are downsampled spatially 2x with each level, and channels increased with each level.
  • the linking with correlation cost estimator 105a and flow estimator 115a then proceeds along a coarse-to-fine scheme: i.e., starting with features having the lowest spatial resolution, flow estimator 115a estimates the optical flow at that resolution using the cost volume values built by correlation cost estimator 105a using the same features.
  • the flow is then upsampled (e.g., 2x) and combined with features having higher resolution. This is repeated until the final resolution is reached.
  • upsampled e.g., 2x
  • the feature maps may be provided to cost volume estimator 105a for cost volume estimation between I t and I t +i, based on the feature maps.
  • the cost volume estimation between the images may then be provided to occlusion estimator 110a and a first optical flow estimator 115a in parallel, to allow occlusion estimator 110a to estimate the presence of one or more occlusions in the image frames based on the cost volume as well as the optical flow from t-1, and optical flow estimator 115a to estimate an optical flow on the features from feature pyramid 100 at the present resolution (step 315).
  • the feature extractor 100 as well as the occlusion estimator 110a may be initialized with an initial estimated optical flow between first and second image frames of the plurality of image frames, the initial optical flow being estimated prior to application of any warping in warping module 120.
  • a first pass through the optical flow decoder 2 may be performed with first and second image frames of the image sequence, and an optical flow estimated, preferably without application of warping module 120. This initialization optical flow may then be provided as the t-1 optical flow to the components of the system.
  • occlusion maps 5a for the estimated occlusions may be created (step 320) and these maps 5a fed forward to optical flow estimator 115a, upsampler 112b, etc.
  • Optical flow estimator 115a may then create an initial optical flow estimate la based on the occlusion maps 5a, features from feature extractor 100, cost volume information from cost volume estimator 105a, and the warped previous optical flow lb from time step t-1.
  • Initial optical flow estimate may then be, for example, upsampled at a 2x upsampling rate by upsampler 112a.
  • the flow is estimated on the coarsest scale first using the features of corresponding resolution.
  • the flow is upsampled and used together with cost volume to estimate the higher resolution flow, and repeated until the final resolution.
  • This output at the final resolution may then be provided to a warping module 120 to be processed as described above, as well as to a second cost volume estimator 105b, occlusion estimator 110b, etc.
  • Occlusion maps 5a may be fed to an upsampler 112b to be upsampled at, for example, 2x, with the resulting data sent to second occlusion estimator 110b.
  • occlusion estimator 110b upsampled initial optical flow estimate la, cost volume from cost volume estimator 105b, and warped optical flow estimate from time t-1 is used to create a final occlusion map 5b.
  • the initial optical flow estimate la may be provided to Optical flow estimator 115b, which, using the final occlusion map 5b, features from feature pyramid 100, and optical flow from time t-1, among others, generates a final optical flow estimate lb between images I t and I t +i (step330).
  • optical flow and occlusion estimations may be iteratively refined by a refinement network 250 to further improve accuracy.
  • a refinement network 250 is described at Section 4.1 by Ilg, et al., "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks/' 6 December 2016, the contents of this section being incorporated herein by reference.
  • refinement network 250 may have a similar architecture to the optical flow decoder of FlowNet2 and/or PWC-Net.
  • the DenseNet connections may be removed.
  • features from feature pyramid 100 on the corresponding scale and associated warps can be substituted, thus providing a richer input representation.
  • the input error channel for these features can then be computed as a sum of the Li loss and structure similarity (SSIM).
  • the present inventors have determined that improved results may be obtained using two refinement applications, with diminishing gains obtained with further applications.
  • PWC-NET forms the basis of the optical decoder 2 of the present disclosure, however, the disclosure provides a description of additional temporal connections to the optical decoder 2, these temporal connections 220 giving the optical flow decoder 2, the occlusions decoder 2 and the refinement network 250 additional input, namely, estimated flow from a previous time step. See, for example, arrows 220 in Figs. 1 and 2.
  • connections When processing video sequences longer than two image frames these connections allow the network to learn typical relations between the previous and current time step flows and use it in the current frame flow estimation. During evaluation the connections also allow continual estimation of the flow on longer sequences and improve the flow with increasing sequence length.
  • Forward warping may be used to transform the coordinate system from time step t - 1 using the optical flow F t-x itself (the forward flow between images i t.x and l t ).
  • the warped flow F t x is computed as for all pixel positions x and take care of the positions to which the flow F t _i maps more than once. In such cases we preserve the larger of the mapped flows. This way we prioritize larger motions, thus faster moving objects.
  • the experiments show usefulness of this warping, the main disadvantage of this approach is that the transformation is not differentiable. Thus, the training cannot propagate gradients through this step and relies on the shared weights only.
  • the coordinate system may be transformed using the backward flow B t from frame t to frame t - 1. This may require an extra evaluation of the network, but then the warping is a direct application of the differentiable spatial transformer.
  • the warping step can be implemented by a differentiable spatial transformation, and can thus be trained end-to-end.
  • the gradients may therefore be propagated through the temporal connections during training.
  • end-to-end training of the described network(s) can be implemented in a number of ways. For example, starting from simple datasets (e.g., simple objects, rigid motions, etc.), of which the FlyingChairs and FlyingThings datasets are part and which are readily available for download, other datasets may be introduced into the training. Such datasets may include Driving, KGPT15, VirtualKITTI, Sintel, HD1K to use a "curriculum learning" approach. [0081] As some datasets may contain only a subset of required
  • the loss can be set to zero when the modality is missing (i.e., "no training").
  • the present invention also includes a computer program product which provides the functionality of any of the methods according to the present invention when executed on a computing device.
  • Such computer program product can be tangibly embodied in a carrier medium carrying machine-readable code for execution by a programmable processor.
  • the present invention thus relates to a carrier medium carrying a computer program product that, when executed on computing means, provides instructions for executing any of the methods as described above.
  • carrier medium refers to any medium that participates in providing instructions to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media.
  • Non-volatile media includes, for example, optical or magnetic disks, such as a storage device which is part of mass storage.
  • Computer readable media include, a CD-ROM, a DVD, a flexible disk or floppy disk, a tape, a memory chip or cartridge or any other medium from which a computer can read.
  • Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
  • the computer program product can also be transmitted via a carrier wave in a network, such as a LAN, a WAN or the Internet.
  • Transmission media can take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
  • Transmission media include coaxial cables, copper wire and fibre optics, including the wires that comprise a bus within a computer.
  • the media may be installed in a vehicle, for example, an autonomously automated vehicle, and the method configured to operate within one or more ECUs of the vehicle.
  • the improved optical flow data may be used for tracking of various objections and elements in a road scene during operation of a vehicle.
  • a vehicle ECU may be provided with information to enable decision making in the autonomous operation mode.

Abstract

A method for processing a plurality of image frames to determine an optical flow estimation of one or more pixels is provided. The method includes providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames, estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features, generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions, providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder, and generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.

Description

METHODS FOR OPTICAL FLOW ESTIMATION
FIELD OF THE DISCLOSURE
[0001] The present invention relates to systems and methods for image processing, and more particularly to a neural network implemented optical flow estimation method.
BACKGROUND OF THE DISCLOSURE
[0002] Optical flow is a two-dimensional displacement field describing the projection of scene motion between two or more images. Occlusions caused by scene motion or other factors contribute to the problems with regard to optical flow estimation, i.e., at occluded pixels no visual correspondences exist.
[0003] Optical flow estimation is a core computer vision problem and has many applications, e.g., action recognition, autonomous driving, and video editing, for example.
[0004] Previously performed methods that have not used convolutional neural networks (CNN) addressed this problem by using regularization which extrapolated the optical flow from surrounding, non-occluded areas.
[0005] In current state-of-the-art CNN based algorithms the
regularization is only implicit and the network learns how much reliance may be placed on identified correspondences and how much to extrapolate.
[0006] Previous approaches dealing with occlusions more directly have first estimated initial forward and backward optical flows, with occlusions being identified using the forward-backward consistency check. Occlusion maps are then used for estimation of the final optical flow.
[0007] Further, according to some previous solutions, three frames, with the middle frame as the reference frame, has been used to define a coordinate system for loss computation. The forward flow to the future frame and backward flow to the past frame is then calculated and applied to enable some regularization of these two optical flows.
[0008] Yang et al., "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," CVPR 2018, discloses a CNN model for generation of an estimated optical flow. However, no consideration for how to treat occlusions is discussed.
[0009] Meister et al., "Unflow: Unsupervised Learning of Optical Flow With a Bidirectional Census Loss," AAAI 2018, discloses the use of bidirectional flow estimation for handling occlusions in the optical flow estimation.
SUMMARY OF THE DISCLOSURE
[0010] The present inventors have determined that in the prior methods, occlusions affect the initial optical flow estimation from the very outset of the analysis, and thus, the final solution is adversely affected by failing to consider the initial effect caused by the occlusions.
[0011] In addition, the inventors have recognized that by feeding back previous estimated optical flows to a current occlusion/flow analysis, a CNN is able to learn typical relations between the previous and current time step optical flow, and therefore, allow the network to use these relations in the time step undergoing occlusion/flow estimation.
[0012] Further, optical flow estimation over more than two frames results in a need for pixels to be mapped to a reference coordinate system for loss computation. The mapping is defined by an unknown optical flow itself, and therefore, it becomes difficult to apply temporal regularization before the flow is known. However, by implementing systems according to the present disclosure, with the feedback and feedforward methodology, the system is aided in learning the time-step flow, and it becomes possible to more accurately align the coordinate systems between the frames, thus propagating the previous frame flow into the correct positions in the current frame.
[0013] According to embodiments of the present disclosure, a method for processing a plurality of image frames to determine an optical flow
estimation of one or more pixels is provided. The method includes providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames, estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features, generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions, providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder, generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.
[0014] By taking into account occlusion estimation prior to generation of an estimated flow, increased accuracy of both occlusion presence and the optical flow can be achieved, as well as a reduction in resource usage. In addition, because previously estimated flows may be fed back through the system, there is no limit on temporal horizon, and by recursion, all prior frames may be used for future optical flow estimations.
[0015] The identifying may include generating, by a feature extractor, one or more feature pyramids by extracting one or more features from each of the two or more consecutive image frames, and providing at least one level of each of the one or more feature pyramids to the optical flow estimator.
[0016] The estimating a presence of one or more occlusions may include calculating an estimated correlated cost volume for one or more of the identified features over a plurality of displacements between the two or more consecutive image frames. [0017] The method may include providing the optical flow and the one or more occlusion maps to a refinement network to produce a refined optical flow.
[0018] The method may include providing, to at least one of the optical flow decoder, the occlusion estimator, and the refinement network, an estimated optical flow from a previous time step, the refinement network preferably comprising a convolutional neural network.
[0019] The optical flow decoder and the occlusion estimator may include convolutional neural networks.
[0020] The method may include transforming a flow coordinate system of the optical flow to a frame coordinate system of an image frame under consideration, the transforming comprising warping with bilinear
interpolation.
[0021] Warping may include at least one of forward warping and backward warping.
[0022] The feature extractor may be initialized with an initial estimated optical flow between a first and second image frame of the plurality of image frames, the initial optical flow being estimated prior to application of any warping.
[0023] The one or more convolutional neural networks may be trained end-to-end with weighted multi-task loss over the optical flow decoder and occlusion estimator.
[0024] The training may be performed at all scales according to the loss equation
Figure imgf000006_0001
where as is the weight of individual scale losses and a0 is the occlusion estimation weight, the sums are over all S spatial resolutions,
Figure imgf000006_0002
is optimized loss, and LQIS pixel-wise cross-entropy loss for occlusion loss.
[0026] The video sequence may include image frames obtained from a road scene in a vehicle, preferably an autonomously operated motor vehicle. [0027] According to further embodiments of the present disclosure, a non-transitory computer readable medium comprising instructions configured to cause a processor to carry out the method described above.
[0028] The non-transitory computer readable medium may be mounted in a vehicle, preferably an autonomously operated motor vehicle. The non- transitory computer readable medium may comprise magnetic storage, optical storage, electronic storage, etc.
[0029] Still further embodiments of the present disclosure include a motor vehicle comprising a processor configured to carry out the method described above, wherein the processor may be further configured to actuate vehicle control systems based, at least in part, on the optical flow.
[0030] It is intended that combinations of the above-described elements and those within the specification may be made, except where otherwise contradictory.
[0031] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.
[0032] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof.
BRIEF DESCRIPTION OF THE DRAWINGS
[0033] Fig. 1 is an exemplary logical representation of an optical flow estimation system configured to account for occlusions prior to analysis of optical flow;
[0034] Fig. 2 shows an exemplary time based flow for optical flow estimation and occlusion refinement; and
[0035] Fig. 3 shows a flowchart highlighting an exemplary method according to embodiments of the present disclosure. DESCRIPTION OF THE EMBODIMENTS
[0036] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
[0037] The present disclosure relates to a method for processing image data in order to accurately estimate an optical flow of one or more pixels and/or features across a plurality of image frames.
[0038] The input data therefore, may comprise a plurality of images, for example, from a road scene surrounding an ego vehicle, and may be comprised over a period of time. The input data may be in any suitable format for provision to an input node of a neural network, e.g., a
convolutional neural network (CNN), also referred to herein as a "network." For example, an image data input may be in jpeg format, gif format, etc.
[0039] Image data particularly of interest, although not limiting, may be for example, image data obtained of road scenes, such as for example as taken in front of a vehicle, either stopped or in motion.
[0040] Such image data may be used, for example, for recognition and tracking of objects relevant to a vehicle, or to the driver thereof, for example, during operation of the ego vehicle. Objects of interest may be any suitable object, such as, for example, the road and associated markings, pedestrians, vehicles, obstacles, traffic lights, etc.
[0041] Particularly, the present invention provides a method for estimating an optical flow of one or more objects or pixels thereof, across a plurality of frames of a video sequence.
[0042] Fig. 1 is an exemplary logical representation of an optical flow estimation system configured to account for occlusions prior to analysis of optical flow. [0043] Components of the optical flow estimation system of the present disclosure may include a machine learnable feature pyramid extractor 100, one or more occlusion estimators 110, and an optical flow decoder 2, among others. For example, a refinement network (shown at Fig. 2) may also be provided.
[0044] Learnable feature pyramid extractor 100 comprises a convolutional neural network configured to produce a feature pyramid given one or more input images I. For example, given two input images It and It+i, L level pyramids of feature representations may be generated, with a bottom
(zeroth) level being the input images, i.e., ct°= It. To generate feature representation at the Ith layer, c[, layers of convolutional filters may be used to downsample the features at the I- 1th pyramid level,
Figure imgf000009_0001
, for example, by a factor of 2.
[0045] According to embodiments of the present disclosure, each feature pyramid extractor 100 may comprise at least 3 levels (101a, 101b, 101c), for example, 6 levels (the further 3 levels are not shown in the drawings for purposes of clarity). Thus, from the first to the sixth levels of the feature pyramid extractor 100, the number of feature channels may be, for example, respectively 16, 32, 64, 96, 128, and 196.
[0046] Output of at least one level of feature pyramid extractors 100 is fed to an occlusion estimator 110, as well as components of optical flow decoder 2, for example, at least one of a correlation cost volume estimator 105, a warping module 120, and a first optical flow estimation module 115a.
[0047] Optical flow decoder 2 may include, among others, one or more optical flow estimators 115, one or more forward and/or backward warping modules 120, one or more cost volume estimators 105, and one or more up samplers 112, among others. One of skill will understand that each of these components may be implemented within a single neural network (e.g., a convolutional neural network), or be implemented within its own individual neural network receiving inputs from the outputs of the other component neural networks during training and processing.
[0048] Logical configuration of optical flow decoder 2 follows the configuration of the optical flow decoder of PWC-NET described by D. Sun et al. in "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," arXiv: 1709.02371v3, 25 June 2018. In particular, Section 3 of this document, entitled "Approach" and starting at page 3 second column, through page 5, first column, provides one exemplary implementation of a useful optical decoder, and this section is herein incorporated by reference in the present specification.
[0049] Warping modules 120 may be provided in configured to receive, as input, an output from one or more layers of feature pyramid extractors 100. For example, warping may be applied to the output at the Ith level of feature pyramid 100, as shown at Fig. 1. Warping features of the second image It+i toward the first image using a 2x upsampled flow from the l+lth level according to the following:
Figure imgf000010_0001
where x is the pixel index and the upsampled flow up2(wl+1) is set to be zero at the top level.
[0050] Bilinear interpolation may be used to implement the warping operation and compute the gradients to the input CNN features and flow for backpropagation.
[0051] For nontranslational motion, warping may be implemented to compensate for geometric distortions and put image patches at a desired scale.
[0052] Additional warping modules 120 may be provided outside of the optical flow decoder 2 for purposes of translation of coordinate systems between image frames It and It+i, for example, as will be discussed in greater detail below. Such warping modules 120 may receive input from one or more of optical flow decoder 2 and refinement network 250, to facilitate performance of the coordinate translation.
[0053] Correlation cost estimators 105 may be configured to estimate correlation cost volumes for one or more features identified by feature pyramid extractor 100, over a plurality of displacements between two or more consecutive image frames It and It+i. Correlation cost volume is a value based on a computational/energy cost for associating a pixel in a first frame It at time t with its corresponding pixel at a subsequent frame It+i of an image sequence.
[0054] Computation and processing of cost volume is generally known in the art. For example, taking inputs as two tensors Ti and T2 both from RHXWXC^ jet D _ {_cjmax/ t o, ..., dmax) and d be from DxD. Then the output of the correlation cost volume is a tensor Y from RHxWxiDHDii g = cv(x, d) =
F(Ti, X)TF(T2, x+d), where F returns a slice along the channels dimension from the input tensor and x is from {l,...,H}x{l,...,W}.
[0055] In the present disclosure, partial cost volume at multiple feature pyramid levels (e.g., levels 1-6) is implemented, such that correlation cost volume may be estimated for identified features across the feature
pyramid 100.
[0056] Occlusion estimators 110 are configured to estimate the presence of occlusions based on the identified features from feature extractor 100 and the correlation cost volume determined by correlation cost estimation modules 105. The inventors have determined that when the cost volume for a particular position in the cost volume over all examined displacements is high, the pixel is likely occluded in the next frame. Therefore, the output of the first occlusion estimator (i.e., a pre-flow estimation occlusion map) can be fed to the optical flow estimator along with the cost volume data used for generating the pre-flow estimation occlusion map, resulting in more accurately estimated optical flow. [0057] An accuracy improvement can be derived, at least in part, due to the fact that the occlusion estimation does not rely on imprecise flow estimation, which did not account for occlusions prior to generation, thereby allowing the optical flow estimator to benefit from the additional input.
[0058] Both the optical flow estimators 115 and the occlusions estimators
110 may work in a coarse-to-fine manner with higher resolution estimators receiving upsampled flow estimates from the lower resolution estimators.
[0059] Occlusion estimators 110 may implement, for example, five convolutional layers with D, D/2, D/4, D/8 and two output channels
(occluded/not occluded maps), D corresponding to the number of correlation cost volume layers. In addition, each layer may use ReLU activation, or alternatively, certain layers, for example, the final layer, may implement soft- max activation.
[0060] Fig. 2 shows an exemplary time based flow for optical flow estimation and occlusion refinement while Fig. 3 shows a flowchart highlighting an exemplary method according to embodiments of the present disclosure.
[0061] A plurality of images may be received, for example, as part of a video stream (step 305).
[0062] Feature pyramid 100 may then process the images to identify features therein and generate feature maps associated with the images (step 310). Features at certain levels of feature pyramid 100 may be fed forward to, for example, optical flow estimator 115b, correlation cost estimator 105b, warping module 120, etc. For example, as shown at Fig. 1, features in feature pyramid extractor 100 are downsampled spatially 2x with each level, and channels increased with each level. The linking with correlation cost estimator 105a and flow estimator 115a then proceeds along a coarse-to-fine scheme: i.e., starting with features having the lowest spatial resolution, flow estimator 115a estimates the optical flow at that resolution using the cost volume values built by correlation cost estimator 105a using the same features.
[0063] The flow is then upsampled (e.g., 2x) and combined with features having higher resolution. This is repeated until the final resolution is reached.
[0064] In further detail, once the initial set of feature maps for an image
It and a second image It+i are created by feature pyramid 100, the feature maps may be provided to cost volume estimator 105a for cost volume estimation between It and It+i, based on the feature maps. The cost volume estimation between the images may then be provided to occlusion estimator 110a and a first optical flow estimator 115a in parallel, to allow occlusion estimator 110a to estimate the presence of one or more occlusions in the image frames based on the cost volume as well as the optical flow from t-1, and optical flow estimator 115a to estimate an optical flow on the features from feature pyramid 100 at the present resolution (step 315).
[0065] Where flow is being analyzed between a first and second image frame of a sequence, an optical flow from t-1 is not available. Therefore, in order to provide an initialization optical flow simulating t-1, the feature extractor 100 as well as the occlusion estimator 110a may be initialized with an initial estimated optical flow between first and second image frames of the plurality of image frames, the initial optical flow being estimated prior to application of any warping in warping module 120. In other words, a first pass through the optical flow decoder 2 may be performed with first and second image frames of the image sequence, and an optical flow estimated, preferably without application of warping module 120. This initialization optical flow may then be provided as the t-1 optical flow to the components of the system.
[0066] Once the occlusions from images It to It+i have been estimated by occlusion estimator 110, occlusion maps 5a for the estimated occlusions may be created (step 320) and these maps 5a fed forward to optical flow estimator 115a, upsampler 112b, etc. [0067] Optical flow estimator 115a may then create an initial optical flow estimate la based on the occlusion maps 5a, features from feature extractor 100, cost volume information from cost volume estimator 105a, and the warped previous optical flow lb from time step t-1.
[0068] Initial optical flow estimate may then be, for example, upsampled at a 2x upsampling rate by upsampler 112a. As noted above, the flow is estimated on the coarsest scale first using the features of corresponding resolution. To get to a higher resolution, the flow is upsampled and used together with cost volume to estimate the higher resolution flow, and repeated until the final resolution. This output at the final resolution may then be provided to a warping module 120 to be processed as described above, as well as to a second cost volume estimator 105b, occlusion estimator 110b, etc.
[0069] Occlusion maps 5a may be fed to an upsampler 112b to be upsampled at, for example, 2x, with the resulting data sent to second occlusion estimator 110b. In occlusion estimator 110b upsampled initial optical flow estimate la, cost volume from cost volume estimator 105b, and warped optical flow estimate from time t-1 is used to create a final occlusion map 5b.
[0070] In parallel, following the upsampling, warping, and second cost volume calculations, the initial optical flow estimate la may be provided to Optical flow estimator 115b, which, using the final occlusion map 5b, features from feature pyramid 100, and optical flow from time t-1, among others, generates a final optical flow estimate lb between images It and It+i (step330).
[0071] As shown in Fig. 2, and noted above optical flow and occlusion estimations may be iteratively refined by a refinement network 250 to further improve accuracy. One example of such refinement network is described at Section 4.1 by Ilg, et al., "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks/' 6 December 2016, the contents of this section being incorporated herein by reference.
[0072] According to embodiments of the present disclosure, refinement network 250 (see Fig. 2) may have a similar architecture to the optical flow decoder of FlowNet2 and/or PWC-Net. For example, starting from the refinement network (i.e., the Context Network described at page 4) described by PWC-Net, the DenseNet connections may be removed. Then, instead of using the input images and associated warps, features from feature pyramid 100 on the corresponding scale and associated warps can be substituted, thus providing a richer input representation. The input error channel for these features can then be computed as a sum of the Li loss and structure similarity (SSIM).
[0073] According to the present disclosure, the present inventors have determined that improved results may be obtained using two refinement applications, with diminishing gains obtained with further applications.
[0074] As noted above, PWC-NET forms the basis of the optical decoder 2 of the present disclosure, however, the disclosure provides a description of additional temporal connections to the optical decoder 2, these temporal connections 220 giving the optical flow decoder 2, the occlusions decoder 2 and the refinement network 250 additional input, namely, estimated flow from a previous time step. See, for example, arrows 220 in Figs. 1 and 2.
[0075] When processing video sequences longer than two image frames these connections allow the network to learn typical relations between the previous and current time step flows and use it in the current frame flow estimation. During evaluation the connections also allow continual estimation of the flow on longer sequences and improve the flow with increasing sequence length.
[0076] However, coordinate systems in which the two optical flows are expressed differ and need to be transformed to correspond to one another in order to apply the previous flow to the correct pixels in the current time step. Thus, forward and/or backward warping may be implemented to perform this transformation.
[0077] Forward warping may be used to transform the coordinate system from time step t - 1 using the optical flow Ft-x itself (the forward flow between images it.x and lt). The warped flow Ft x is computed as
Figure imgf000016_0001
for all pixel positions x and take care of the positions to which the flow Ft_i maps more than once. In such cases we preserve the larger of the mapped flows. This way we prioritize larger motions, thus faster moving objects. Although the experiments show usefulness of this warping, the main disadvantage of this approach is that the transformation is not differentiable. Thus, the training cannot propagate gradients through this step and relies on the shared weights only.
[0078] Alternatively, the coordinate system may be transformed using the backward flow Bt from frame t to frame t - 1. This may require an extra evaluation of the network, but then the warping is a direct application of the differentiable spatial transformer. In other words, the warping step can be implemented by a differentiable spatial transformation, and can thus be trained end-to-end.
[0079] The gradients may therefore be propagated through the temporal connections during training.
[0080] One of skill will recognize that end-to-end training of the described network(s) can be implemented in a number of ways. For example, starting from simple datasets (e.g., simple objects, rigid motions, etc.), of which the FlyingChairs and FlyingThings datasets are part and which are readily available for download, other datasets may be introduced into the training. Such datasets may include Driving, KGPT15, VirtualKITTI, Sintel, HD1K to use a "curriculum learning" approach. [0081] As some datasets may contain only a subset of required
modalities, the loss can be set to zero when the modality is missing (i.e., "no training").
[0082] It may further be possible to obtain improved results by first training the portion of the network corresponding to PWC-Net (as described above), using the simplest datasets and add the additional modules (i.e., occlusion estimators 110a, 110b, upsampler 112b) following the simple training. This may result in increased rates of optimization by pretraining parts of the network and avoiding local minimas.
[0083] The present invention also includes a computer program product which provides the functionality of any of the methods according to the present invention when executed on a computing device. Such computer program product can be tangibly embodied in a carrier medium carrying machine-readable code for execution by a programmable processor. The present invention thus relates to a carrier medium carrying a computer program product that, when executed on computing means, provides instructions for executing any of the methods as described above.
[0084] The term "carrier medium" refers to any medium that participates in providing instructions to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as a storage device which is part of mass storage.
Common forms of computer readable media include, a CD-ROM, a DVD, a flexible disk or floppy disk, a tape, a memory chip or cartridge or any other medium from which a computer can read. Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
[0085] The computer program product can also be transmitted via a carrier wave in a network, such as a LAN, a WAN or the Internet.
Transmission media can take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
Transmission media include coaxial cables, copper wire and fibre optics, including the wires that comprise a bus within a computer.
[0086] Based on the output of the network, i.e., an optical flow
estimation for each pixel between an image at time t and an image at time t+1 may be generated.
[0087] In addition, the media may be installed in a vehicle, for example, an autonomously automated vehicle, and the method configured to operate within one or more ECUs of the vehicle. The improved optical flow data may be used for tracking of various objections and elements in a road scene during operation of a vehicle. In addition, based on movements and tracking of said movements, a vehicle ECU may be provided with information to enable decision making in the autonomous operation mode.
[0088] Throughout the description, including the claims, the term
"comprising a" should be understood as being synonymous with "comprising at least one" unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms
"substantially" and/or "approximately" and/or "generally" should be understood to mean falling within such accepted tolerances.
[0089] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.
[0090] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

CAIMS
1. A method for processing a plurality of image frames to determine an optical flow estimation of one or more pixels, the method comprising:
providing a plurality of image frames of a video sequence and identifying features within each image from of the plurality of image frames; estimating, by an occlusion estimator, a presence of one or more occlusions in two or more consecutive image frames of the video sequence based on at least the identified features;
generating, by the occlusion estimator, one or more occlusion maps based on the estimated presence of the one or more occlusions;
providing the one or more occlusion maps to an optical flow estimator of an optical flow decoder; and
generating, by the optical flow decoder, an estimated optical flow for one or more pixels across the plurality of image frames based on the identified features and the one or more occlusion maps.
2. The method according to claim 1, wherein the identifying comprises: generating, by a feature extractor, one or more feature pyramids by extracting one or more features from each of the two or more consecutive image frames; and
providing at least one level of each of the one or more feature pyramids to the optical flow estimator.
3. The method according to any of claims 1-2, wherein the estimating a presence of one or more occlusions includes calculating an estimated correlated cost volume for one or more of the identified features over a plurality of displacements between the two or more consecutive image frames.
4. The method according to any of claims 1-3, comprising providing the optical flow and the one or more occlusion maps to a refinement network to produce a refined optical flow.
5. The method according to claim 4, comprising, providing, to at least one of the optical flow decoder, the occlusion estimator, and the refinement network, an estimated optical flow from a previous time step, the refinement network preferably comprising a convolutional neural network.
6. The method according to any of claims 1-5, wherein the optical flow decoder and the occlusion estimator comprise one or more convolutional neural networks.
7. The method according to any of claims 1-6, comprising transforming a flow coordinate system of the optical flow to a frame coordinate system of an image frame under consideration, the transforming comprising warping with bilinear interpolation.
8. The method according to claim 7, wherein the warping comprises at least one of forward warping and backward warping.
9. The method according to any of claims 2-8, wherein the feature extractor is initialized with an initial estimated optical flow between a first and second image frame of the plurality of image frames, the initial optical flow being estimated prior to application of warping.
10. The method according to claim 6, wherein the one or more convolutional neural networks are trained end-to-end with weighted multi- task loss over the optical flow decoder and occlusion estimator.
11. The method according to claim 10, wherein the training is performed at all scales according to the loss equation:
Figure imgf000020_0001
where as is the weight of individual scale losses and a0 is the occlusion estimation weight, the sums are over all S spatial resolutions, LS F is optimized loss, and Ls 0 is pixel-wise cross-entropy loss for occlusion loss.
12. The method according to any of claims 1-11, wherein the video sequence comprises image frames obtained from a road scene in a vehicle, preferably an autonomously operated motor vehicle.
13. A non-transitory computer readable medium comprising
instructions configured to cause a processor to carry out the method according to any of claims 1-12.
14. The non-transitory computer readable medium according to claim 13, wherein the non-transitory computer readable medium is mounted in a vehicle, preferably an autonomously operated motor vehicle.
15. A motor vehicle comprising a processor configured to carry out the method according to any of claims 1-12, wherein the processor is further configured to actuate vehicle control systems based, at least in part, on the optical flow.
PCT/EP2018/079903 2018-10-31 2018-10-31 Methods for optical flow estimation WO2020088766A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021547880A JP7228172B2 (en) 2018-10-31 2018-10-31 Methods for optical flow estimation
PCT/EP2018/079903 WO2020088766A1 (en) 2018-10-31 2018-10-31 Methods for optical flow estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/079903 WO2020088766A1 (en) 2018-10-31 2018-10-31 Methods for optical flow estimation

Publications (1)

Publication Number Publication Date
WO2020088766A1 true WO2020088766A1 (en) 2020-05-07

Family

ID=64109865

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/079903 WO2020088766A1 (en) 2018-10-31 2018-10-31 Methods for optical flow estimation

Country Status (2)

Country Link
JP (1) JP7228172B2 (en)
WO (1) WO2020088766A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582483A (en) * 2020-05-14 2020-08-25 哈尔滨工程大学 Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism
CN112132871A (en) * 2020-08-05 2020-12-25 天津(滨海)人工智能军民融合创新中心 Visual feature point tracking method and device based on feature optical flow information, storage medium and terminal
CN112347996A (en) * 2020-11-30 2021-02-09 上海眼控科技股份有限公司 Scene state judgment method, device, equipment and storage medium
CN112465872A (en) * 2020-12-10 2021-03-09 南昌航空大学 Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization
CN112785629A (en) * 2021-01-21 2021-05-11 陕西师范大学 Aurora motion characterization method based on unsupervised deep optical flow network
CN113658231A (en) * 2021-07-07 2021-11-16 北京旷视科技有限公司 Optical flow prediction method, optical flow prediction device, electronic device, and storage medium
CN116546183A (en) * 2023-04-06 2023-08-04 华中科技大学 3D dynamic video generation method based on single frame image
CN116883913A (en) * 2023-09-05 2023-10-13 长江信达软件技术(武汉)有限责任公司 Ship identification method and system based on video stream adjacent frames

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5680487A (en) * 1991-12-23 1997-10-21 Texas Instruments Incorporated System and method for determining optical flow
US20070092122A1 (en) * 2005-09-15 2007-04-26 Jiangjian Xiao Method and system for segment-based optical flow estimation
US20100194741A1 (en) * 2009-01-30 2010-08-05 Microsoft Corporation Depth map movement tracking via optical flow and velocity prediction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5680487A (en) * 1991-12-23 1997-10-21 Texas Instruments Incorporated System and method for determining optical flow
US20070092122A1 (en) * 2005-09-15 2007-04-26 Jiangjian Xiao Method and system for segment-based optical flow estimation
US20100194741A1 (en) * 2009-01-30 2010-08-05 Microsoft Corporation Depth map movement tracking via optical flow and velocity prediction

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
D. SUN ET AL.: "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume", ARXIV:1709.02371V3, 25 June 2018 (2018-06-25)
EDDY ILG ET AL: "Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 August 2018 (2018-08-06), XP081412610 *
FAN LIJIE ET AL: "End-to-End Learning of Motion Representation for Video Understanding", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 6016 - 6025, XP033473517, DOI: 10.1109/CVPR.2018.00630 *
ILG ET AL., FLOWNET 2.0: EVOLUTION OF OPTICAL FLOW ESTIMATION WITH DEEP NETWORKS, 6 December 2016 (2016-12-06)
KENNEDY RYAN ET AL: "Optical Flow with Geometric Occlusion Estimation and Fusion of Multiple Frames", 13 January 2015, INTERNATIONAL CONFERENCE ON COMPUTER ANALYSIS OF IMAGES AND PATTERNS. CAIP 2017: COMPUTER ANALYSIS OF IMAGES AND PATTERNS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 364 - 377, ISBN: 978-3-642-17318-9, XP047303951 *
LI XU ET AL: "Motion detail preserving optical flow estimation", 2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 13-18 JUNE 2010, SAN FRANCISCO, CA, USA, IEEE, PISCATAWAY, NJ, USA, 13 June 2010 (2010-06-13), pages 1293 - 1300, XP031725650, ISBN: 978-1-4244-6984-0 *
MEISTER ET AL.: "Unflow: Unsupervised Learning of Optical Flow With a Bidirectional Census Loss", AAAI, 2018
SUN DEQING ET AL: "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 8934 - 8943, XP033473818, DOI: 10.1109/CVPR.2018.00931 *
YANG ET AL.: "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume", CVPR, 2018
YANG WANG ET AL: "Occlusion Aware Unsupervised Learning of Optical Flow", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 November 2017 (2017-11-16), XP080837653 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582483A (en) * 2020-05-14 2020-08-25 哈尔滨工程大学 Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism
CN112132871A (en) * 2020-08-05 2020-12-25 天津(滨海)人工智能军民融合创新中心 Visual feature point tracking method and device based on feature optical flow information, storage medium and terminal
CN112132871B (en) * 2020-08-05 2022-12-06 天津(滨海)人工智能军民融合创新中心 Visual feature point tracking method and device based on feature optical flow information, storage medium and terminal
CN112347996A (en) * 2020-11-30 2021-02-09 上海眼控科技股份有限公司 Scene state judgment method, device, equipment and storage medium
CN112465872B (en) * 2020-12-10 2022-08-26 南昌航空大学 Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization
CN112465872A (en) * 2020-12-10 2021-03-09 南昌航空大学 Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization
CN112785629A (en) * 2021-01-21 2021-05-11 陕西师范大学 Aurora motion characterization method based on unsupervised deep optical flow network
CN113658231A (en) * 2021-07-07 2021-11-16 北京旷视科技有限公司 Optical flow prediction method, optical flow prediction device, electronic device, and storage medium
CN113658231B (en) * 2021-07-07 2023-09-26 北京旷视科技有限公司 Optical flow prediction method and device, electronic equipment and storage medium
CN116546183A (en) * 2023-04-06 2023-08-04 华中科技大学 3D dynamic video generation method based on single frame image
CN116546183B (en) * 2023-04-06 2024-03-22 华中科技大学 Dynamic image generation method and system with parallax effect based on single frame image
CN116883913A (en) * 2023-09-05 2023-10-13 长江信达软件技术(武汉)有限责任公司 Ship identification method and system based on video stream adjacent frames
CN116883913B (en) * 2023-09-05 2023-11-21 长江信达软件技术(武汉)有限责任公司 Ship identification method and system based on video stream adjacent frames

Also Published As

Publication number Publication date
JP2022509375A (en) 2022-01-20
JP7228172B2 (en) 2023-02-24

Similar Documents

Publication Publication Date Title
WO2020088766A1 (en) Methods for optical flow estimation
Shivakumar et al. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion
Eldesokey et al. Confidence propagation through cnns for guided sparse depth regression
JP7106665B2 (en) MONOCULAR DEPTH ESTIMATION METHOD AND DEVICE, DEVICE AND STORAGE MEDIUM THEREOF
JP6837158B2 (en) Video identification and training methods, equipment, electronic devices and media
Dosovitskiy et al. Flownet: Learning optical flow with convolutional networks
Fischer et al. Flownet: Learning optical flow with convolutional networks
US10810745B2 (en) Method and apparatus with image segmentation
KR102235745B1 (en) Method for training a convolutional recurrent neural network and for semantic segmentation of inputted video using the trained convolutional recurrent neural network
Yin et al. Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields
Iyer et al. Geometric consistency for self-supervised end-to-end visual odometry
Lee et al. Depth completion using plane-residual representation
US11049270B2 (en) Method and apparatus for calculating depth map based on reliability
CN113657560B (en) Weak supervision image semantic segmentation method and system based on node classification
CN116686017A (en) Time bottleneck attention architecture for video action recognition
CN109300151B (en) Image processing method and device and electronic equipment
Qu et al. Depth completion via deep basis fitting
Chang et al. Attention-aware feature aggregation for real-time stereo matching on edge devices
Zhou et al. Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss
Schuster et al. Ssgp: Sparse spatial guided propagation for robust and generic interpolation
CN111696110A (en) Scene segmentation method and system
Liu et al. Understanding road layout from videos as a whole
CN113159236A (en) Multi-focus image fusion method and device based on multi-scale transformation
EP3977359A1 (en) Mixture distribution estimation for future prediction
Bayramli et al. Raft-msf: Self-supervised monocular scene flow using recurrent optimizer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18796916

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021547880

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18796916

Country of ref document: EP

Kind code of ref document: A1