WO2024081455A1 - Methods and apparatus for optical flow estimation with contrastive learning - Google Patents

Methods and apparatus for optical flow estimation with contrastive learning Download PDF

Info

Publication number
WO2024081455A1
WO2024081455A1 PCT/US2023/069357 US2023069357W WO2024081455A1 WO 2024081455 A1 WO2024081455 A1 WO 2024081455A1 US 2023069357 W US2023069357 W US 2023069357W WO 2024081455 A1 WO2024081455 A1 WO 2024081455A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
wise
response
loss
map
Prior art date
Application number
PCT/US2023/069357
Other languages
French (fr)
Inventor
Zhiqi ZHANG
Pan JI
Nitin Bansal
Changjiang Cai
Qingan Yan
Xiangyu Xu
Huangying ZHAN
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2024081455A1 publication Critical patent/WO2024081455A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present invention relates to object movement prediction. More specifically, the invention relates to methods and apparatus for more accurate optical flow analysis.
  • Optical flow analysis / estimation is a critical component in several high-level vision problems such as action recognition, video segmentation, autonomous driving, editing, and the like.
  • Traditional methods for optical flow estimation involve formulating the problem as an optimization problem using hand-crafted features, which can be empirical.
  • Optical flow estimation is typically achieved by attempting to maximize the visual similarity between adjacent frames through energy minimization.
  • Deep neural networks have been shown to be effective in optical flow estimation in terms of accuracy.
  • Synthetic datasets are typically easier to generate and include labels or identifiers for all features appearing on the images, e.g. vehicle 1, pedestrian 2, building 5, etc., e.g. ground truth data.
  • real word data typically include very few labels, as such labeling has to be performed manually.
  • synthetic datasets i.e. computer-generated scenes and images, are often used for model training.
  • a drawback to this strategy is that the generated deep learning models have a tendency to overfit to the synthetic training datasets, which would subsequently show performance drops on real-world data.
  • the present invention relates to object movement prediction. More specifically, the invention relates to methods and apparatus for more accurate optical flow analysis.
  • Embodiments disclose a semi-supervised framework for improving the determination of optical flow. More specifically, real world datasets are used to determine optical flow estimation using automatically labeled features on the ground truth, real world data. Features may span contiguous groups or separate groups of pixels on an image. In operation, features within ground truth data (real world data) are automatically labeled with pseudo feature labels to form pseudo ground truth data. Subsequently, pseudo labels for features are maintained typically when the pseudo feature labels help reduce the optical flow loss, and pseudo labels for features may be removed or deleted, typically when the pseudo feature labels increase the optical flow loss.
  • pseudo labels may be assigned to features to form pseudo ground truth data.
  • This pseudo ground truth data is then used to predict feature-wise flow for features in a first temporal feature map.
  • the predicted feature-wise flow is termed a warped feature map, herein.
  • Embodiments then use contrastive flow loss determinations upon the warped feature map and a second temporal feature map to determine feature to feature correspondence, or lack thereof. In some cases, the feature-wise contrastive flow loss may then be fed back to help determine whether the pseudo labels should be modified or maintained.
  • the semi-supervised framework described herein provides optical flow feedback based upon feature to feature comparison of a feature map.
  • This optical flow feedback may be combined with other optical flow systems, such as recurrent all-pairs field transforms (RAFT) (Teed 2020) architecture.
  • RAFT recurrent all-pairs field transforms
  • Conventional optical flow systems such as RAFT typically perform a pixel by pixel optical flow analysis. By combining this contrastive loss function with the RAFT optical flow analysis the optical flow predictions are improved.
  • a technique for a computing system for estimating optical flow may include determining, in a computing system, a first features map in response to a first image, and a second features map in response to a second image map, and implementing, in the computing system, a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map
  • GRU gated recurrent unit
  • a process may include determining, in the computing system, a warped feature map in response to the second image map, and implementing, in the computing system, a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map.
  • a method may include determining, in the computing system, a pixel-wise flow loss in response to the pixel-wise flow prediction and in response to pixel-wise ground truth data and modifying, in the computing system, parameters of the GRU in response to the pixel-wise flow loss and to the feature-wise loss.
  • a computing system for estimating optical flow may include a pixel-based analysis system configured to determine a first features map in response to a first image, and a second features map in response to a second image map, wherein the pixel-based analysis system comprises a gated recurrent unit (GRU) configured to determine a pixel-wise flow prediction in response to the first features map and the second features map.
  • GRU gated recurrent unit
  • An apparatus may include a feature-based analy sis system coupled to the pixel-based optical flow analysis system, wherein the feature-based analysis system is configured to determine a warped feature map in response to the second image map, wherein the feature-based analysis system comprises a contrastive loss unit configured to determine a feature-wise loss in response to first features in the first image map and second features in the warped feature map.
  • a pixel-based analysis system is configured to modify parameters of the GRU in response to the pixel-wise flow loss and to the feature-wise loss.
  • a method is disclosed.
  • One process may include operating an optical flow' prediction system comprising contrastive loss functionality in response to a synthetic dataset to determine a first optical flow loss, and adjusting parameters of the optical flow prediction system in response to the first optical flow loss.
  • a technique may include operating the optical flow' prediction system in response to a real world dataset to determine a second optical flow loss, and adjusting parameters of the optical flow prediction system in response to the second optical flow' loss.
  • FIG. 1 illustrates a functional block diagram of some embodiments of the present invention
  • FIGs. 2A-C illustrate a process diagram according to various embodiments of the present invention.
  • FIG. 3 illustrates a system diagram according to various embodiments of the present invention.
  • Fig. 1 illustrates a logical block diagram, an embodiment of the present invention. More specifically a system 100 includes a contrastive loss portion 102 and a recurrent allpairs field transforms (RAFT) portion 104.
  • Inputs into system 100 include a series of images 106.
  • input images 106 can include synthetic images, e.g. datasets including labeled features, where ground truth flow data 108 is known.
  • input images 106 can include real world images, e.g. datasets where some features are manually labeled and many features are unlabeled. In such cases, ground truth flow data 108 is not known for unlabeled features.
  • System 100 may be termed RAFT-CF, herein.
  • RAFT portion 104 includes three major components, a feature encoder 110 that determines feature maps 112 that store feature vectors for each pixel in input images 106; a correlation portion 114 that determines a four-dimensional (4D) correlation volume 116 that includes displacements based upon feature maps 112; and a gated recurrent unit (GRU) 118 that determines an optical flow prediction 120 for feature map 122.
  • Optical flow prediction 120 is compared to ground truth flow data 108, to determine flow loss 124.
  • feature encoder 110 receives images 106 having a height, width, and color depth (e.g. HxWx3). Next, the positions of pixels (e.g. x and y positions) are then determined and stored in the form of coordinate frames 126 (e.g. HxWx5). Next encoders 128 are used to identify features, e.g. specific colored pixels, from coordinate frames 126. In various embodiments, the features may be stored as feature vectors in feature maps 112 at different resolutions.
  • correlation layer 114 receives feature maps 112 and determines a multi-dimensional (e.g. four-dimensional, 4D) correlation volume 116.
  • This 4D correlation volume 116 is typically determined by comparing visual similarity between pixels in feature maps 112 and then determining displacements of pixels in feature maps 112.
  • the correlation volume 116 is then inputted to gated recurrent unit (GRU) 118 that warps an input feature map 146 and outputs optical flow prediction 120.
  • Optical flow prediction 120 thus represents the predicted optical flow.
  • optical flow prediction 120 is then compared to ground truth flow data 108, to determine flow loss 124.
  • Flow loss 124 typically indicates how well GRU 118 can predict the pixel by pixel optical flow.
  • Flow loss 124 can be fed back 140 into GRU 118 to change its parameters. In some cases, if flow loss 122 is small, the change in parameters may be smaller than if flow loss 122 is large.
  • RAFT portion 104 is ty pically limited to determining predicted optical flow for situations where ground truth flow is known, i.e. with synthetic images and synthetic datasets. Additionally, it is typically limited to determining optical flow vectors on a pixel by pixel basis. As mentioned above, RAFT portion 104 may sometimes overfit synthetic datasets thus when trained, RAFT portion 104 may have trouble determining accurate optical flow vectors when provided with real world datasets.
  • contrastive loss portion 102 is used to supplement RAFT portion 104.
  • Contrastive loss portion 102 includes a feature warp process 144 that receives a features map 128 and a ground truth flow 130 and outputs a warped feature map 142.
  • a contrastive loss process 132 receives warped feature map 142 and feature map 134 and provides contrastive loss feedback 136.
  • feature map 128 may be based upon synthetic datasets, and in other cases, feature map 128 may be based upon real world data. In the case of real world data, contrastive loss feedback 136 may be used to fine tune ground truth flow 130, as discussed below.
  • ground truth flow 130 may specify an optical flow for manually labeled features from real world data, e.g. a house, a car, a bicycle, or the like. Additionally in some cases, ground truth flow 130 may specify a flow for predicted or features guessed from the images, e.g. a first blob of pixels may be guessed and labeled to be pedestrian, a second blob of similar colored pixels may be guessed and labeled as a ball, and the like. Such features are called pseudo labels, and ground truth flow 108 with pseudo labels are called pseudo ground truth flow.
  • the optical flows of the labeled feature and the pseudo labeled features in ground truth flow 130 are used to process features map 128. More specifically, within feature warp process 144, ground truth flow 130 warps features map 128 and outputs warped feature map 142. Next, a contrastive loss process 132 is then performed by contrasting warped feature map 142 to feature map 134. In various embodiments, the contrasting is based upon features (e.g. groups of pixels), not individual pixels. Contrastive loss process 132 provides feedback for the pseudo ground truth flow 130. For example, if a feature of warped feature map 142 (that was pseudo labeled) aligns or matches a feature of feature map 134, the pseudo label may be maintained as being correct.
  • a feature of warped feature map 142 that was pseudo labeled
  • the pseudo label may be removed as being an incorrect labeling.
  • optical flow for pseudo labeled features that substantially match optical flow for features in features map 134 can be maintained in pseudo ground truth 130, and optical flow for pseudo labeled features that do not match optical flow for features in features map 134 are removed from pseudo ground truth 130.
  • unlabeled features in the real world data can be labeled in this iterative process
  • the process may then be repeated (136) with different pseudo labels for features.
  • feedback from contrastive loss portion 102 data may be used as feedback 148 to adjust parameters of GRU 118. More specifically , as discussed above, GRU 118 may predict optical flow on a pixel by pixel basis, but contrastive loss portion 102 predicts optical flow on a feature by feature basis.
  • optical flow predictions by GRU 118 are typically improved.
  • Experimental data results gathered by the inventors have confirmed this improvement.
  • Figs. 2A-C illustrate a more complete flow process according to some embodiments.
  • a process includes three phases, phase 202 is a synthetic dataset training phase, phase 204 is a real world dataset training phase, and phase 206 is a real world use phase.
  • Fig. 2B real world images 218 are provided to the trained system 210’.
  • N A 2 images are used, and logically arranged in an NxN grid 216.
  • a K- Fold cross validation process is performed selecting training subsets of real word images 218 as input into system 210’.
  • system 210’ uses contrastive loss functionality on features and pseudo features to train system 210’ based upon the training subsets of real world images.
  • the trained system 210’ is then tested with real world images 214 that are not in the training subset (testing subset), to determine predicted flows 218.
  • the predicted flows 218 are compared to the pseudo ground truth flows 220 for the testing subset to determine an error 222, as shown.
  • the next training subset and testing subset from images 218 are used to determine errors 224, and the like.
  • the errors may be combined or averaged to determine an average error or feedback.
  • the feedback may be used as feedback 226 to change one or more parameters of the GRU within system 210’, discussed above. The process may be repeated until the error feedback is reduced, in which case system 210” is formed.
  • system 210 has been trained using synthetic datasets in Fig. 2A and then trained using real world data in Fig. 2B. Accordingly, new images 228 can be provided to system 210”, and system 210” can predict the optical flow 230. In some cases, if ground truth optical flow 232 is known, a flow loss 234 can again be determined, and fed back 236 into system 210”.
  • predicted optical flow 230 may be used as input into other processes, such as a driver assist system, an autonomous driving system, an area mapping system, and the like.
  • Fig. 3 illustrates a functional block diagram of various embodiments of the present invention. More specifically, it is contemplated that computers (e.g. servers, laptops, streaming servers, virtual machines, etc.) may be implemented with a subset or superset of the below-illustrated components.
  • computers e.g. servers, laptops, streaming servers, virtual machines, etc.
  • computers may be implemented with a subset or superset of the below-illustrated components.
  • a computing device 300 may include some, but not necessarily all of the following components: an applications processor / microprocessor 302, memory 304, a display 306, an image acquisition device 310, audio input / output devices 312, and the like.
  • Data and communications from and to computing device 300 can be provided via a wired interface 314 (e g. Ethernet, dock, plug, controller interface to peripheral devices); miscellaneous rf receivers, e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316; an NFC interface (e.g. antenna or coil) and driver 318; RF interfaces and drivers 320, and the like.
  • wired interface 314 e.g. Ethernet, dock, plug, controller interface to peripheral devices
  • miscellaneous rf receivers e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316
  • NFC interface e.g. antenna or coil
  • driver 318 e.g. antenna or coil
  • computing device 300 may be a computing device (e.g. Apple iPad, Microsoft Surface, Samsung Galaxy Note, an Android Tablet); a smartphone (e.g. Apple iPhone, Google Pixel, Samsung Galaxy S); a computer (e.g. netbook, laptop, convertible), a media player (e.g. Apple iPod); or the like
  • computing device 300 may include one or more processors 302.
  • processors 302 may also be termed application processors, and may include a processor core, a video/graphics core, and other cores.
  • Processors 302 may include processors from Apple (A14 Bionic, Al 5 Bionic), NVidia (Tegra), Intel (Core), Qualcomm (Snapdragon), Samsung (Exynos), ARM (Cortex), MIPS technology, a microcontroller, and the like.
  • processing accelerators may also be included, e.g. an Al accelerator, Google (Tensor processing unit), a GPU, or the like. It is contemplated that other existing and / or later-developed processors / microcontrollers may be used in various embodiments of the present invention.
  • memory 304 may include different types of memory (including memory controllers), such as flash memory' (e.g. NOR, NAND), SRAM, DDR SDRAM, or the like.
  • Memory 304 may be fixed within computing device 300 and may also include removable memory (e g. SD, SDHC, MMC, MINI SD, MICRO SD, SIM).
  • computer-executable software code e.g. firmware, application programs
  • security applications application data, operating system data, databases, or the like.
  • a secure device including secure memory and / or a secure processor are provided. It is contemplated that other existing and I or later-developed memory and memory technology may be used in various embodiments of the present invention.
  • display 306 may be based upon a variety of later- developed or current display technology, including LED or OLED displays and / or status lights; touch screen technology (e.g. resistive displays, capacitive displays, optical sensor displays, electromagnetic resonance, or the like); and the like. Additionally, display 306 may include single touch or multiple-touch sensing capability. Any later-developed or conventional output display technology may be used for embodiments of the output display, such as LED IPS, OLED, Plasma, electronic ink (e.g. electrophoretic, electrowetting, interferometric modulating), or the like. In various embodiments, the resolution of such displays and the resolution of such touch sensors may be set based upon engineering or nonengineering factors (e.g. sales, marketing).
  • display 306 may be integrated into computing device 300 or may be separate. In some embodiments, display 306 may be in virtually any size or resolution, such as a 3K resolution display, a microdisplay, one or more individual status or communication lights, e.g. LEDs, or the like.
  • acquisition device 310 may include one or more sensors, drivers, lenses, and the like.
  • the sensors may be visible light, infrared, and / or UV sensitive sensors, ultrasonic sensors, or the like, that are based upon any later-developed or convention sensor technology, such as CMOS, CCD, or the like.
  • image recognition algorithms, image processing algorithms, or other software programs for operation upon processor 302, to process the acquired data may pair with enabled hardware to provide functionality such as: facial recognition (e.g.
  • audio input / output 312 may include a microphone(s) / speakers.
  • voice processing and / or recognition software may be provided to applications processor 302 to enable the user to operate computing device 300 by stating voice commands.
  • audio input 312 may provide user input data in the form of a spoken word or phrase, or the like, as described above.
  • audio input / output 312 may be integrated into computing device 300 or may be separate.
  • wired interface 314 may be used to provide data or instruction transfers between computing device 300 and an external source, such as a computer, a remote server, a POS server, a local security server, a storage network, another computing device 300, an IMU, video camera, or the like.
  • Embodiments may include any later-developed or conventional physical interface / protocol, such as: USB, micro USB, mini USB, USB-C, Firewire, Apple Lightning connector, Ethernet, POTS, custom interface or dock, or the like.
  • wired interface 314 may also provide electrical power, or the like to power source 324, or the like.
  • interface 314 may utilize close physical contact of device 300 to a dock for transfer of data, magnetic power, heat energy, light energy, laser energy, or the like Additionally, software that enables communications over such networks is typically provided.
  • a wireless interface 316 may also be provided to provide wireless data transfers between computing device 300 and external sources, such as computers, storage networks, headphones, microphones, cameras, IMUs, or the like.
  • wireless protocols may include Wi-Fi (e.g. IEEE 802.11 a/b/g/n, WiMAX), Bluetooth, Bluetooth Low Energy (BLE) IR, near field communication (NFC), ZigBee, Ultra-Wide Band (UWB), Wi-Fi, mesh communications, and the like.
  • GNSS e.g. GPS
  • GPS e.g. GPS
  • wireless interface 316 a wireless interface that is distinct from the Wi-Fi circuitry, the Bluetooth circuitry, and the like.
  • GPS receiving hardware may provide user input data in the form of current GPS coordinates, or the like, as described above.
  • RF interfaces 320 may support any future-developed or conventional radio frequency communications protocol, such as CDMA-based protocols (e.g. WCDMA), GSM-based protocols, HSUPA-based protocols, G4, G5, or the like.
  • CDMA-based protocols e.g. WCDMA
  • GSM-based protocols e.g. GSM-based protocols
  • HSUPA-based protocols G4, G5, or the like.
  • various functionality is provided upon a single IC package, for example, the Marvel PXA330 processor, and the like.
  • data transmissions between a smart device and the services may occur via Wi-Fi, a mesh network, 4G, 5G, or the like.
  • the functional blocks in Fig. 3 are shown as being separate, it should be understood that the various functionality may be regrouped into different physical devices.
  • some processors 302 may include Bluetooth functionality. Additionally, some functionality need not be included in some blocks, for example, GPS functionality need not be provided in a provider server.
  • any number of future developed, current operating systems, or custom operating systems may be supported, such as iPhone OS (e.g. iOS), Google Android, Linux, Windows, MacOS, or the like.
  • the operating system may be a multi-threaded multi-tasking operating system. Accordingly, inputs and / or outputs from and to display 306 and inputs / or outputs to physical sensors 322 may be processed in parallel processing threads. In other embodiments, such events or outputs may be processed serially, or the like Inputs and outputs from other functional blocks may also be processed in parallel or serially, in other embodiments of the present invention, such as acquisition device 310 and physical sensors 322.
  • physical sensors 322 may include accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, imaging sensors (e.g. blood oxygen, heartbeat, blood vessel, iris data, etc.), thermometer, otoacoustic emission (OAE) testing hardware, and the like.
  • the data from such sensors may be used to capture data associated with device 300, and a user of device 300.
  • Such data may include physical motion data, pressure data, orientation data, or the like.
  • Data captured by sensors 322 may be processed by software running upon processor 302 to determine characteristics of the user, e.g. gait, gesture performance data, or the like, and used for user authentication purposes.
  • sensors 322 may also include physical output data, e.g. vibrations, pressures, and the like.
  • a power supply 324 may be implemented with a battery (e.g. LiPo), ultracapacitor, or the like, that provides operating electrical power to device 300.
  • a battery e.g. LiPo
  • ultracapacitor e.g. LiPo
  • any number of power generation techniques may be utilized to supplement or even replace power supply 324, such as solar power, liquid metal power generation, thermoelectric engines, rf harvesting (e.g. NFC) or the like.
  • Fig. 3 is representative of the components possible for a processing device. It will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention. Embodiments of the present invention may include at least some but need not include all of the functional blocks illustrated in Fig. 3.
  • a processing unit may include some of the functional blocks in Fig. 3, but it need not include an accelerometer or other physical sensor 322, an acquisition device 310, an accelerometer 322, an internal powder source 324, or the like.
  • outputs from embodiments may be provided to an autonomous driving system which may steer a vehicle (e.g. car, drone) based upon the predicted flow data; outputs may be used to provide audible, visual, or haptic feedback to a user, for example, if a feature in a field of view is identified or labeled as a pedestrian; a product being manufactured may be identified for further inspection, for example, if the determined optical flow of product does not match predefined criteria; a robot may be identified as requiring servicing, for example, if the determined optical flow does not match predefined criteria; and the like.
  • other methods for segmenting real world data may be used besides K-Fold cross validations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A method for a computing system includes determining a first and second features map in response to a first and second images, implementing a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map, determining a warped feature map in response to the second image map, implementing a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map, determining a pixel-wise flow loss in response to the pixel-wise flow prediction and to pixel-wise ground truth data, and modifying parameters of the GRU in response to the pixelwise flow loss and to the feature-wise loss.

Description

METHODS AND APPARATUS FOR OPTICAL FLOW ESTIMATION WITH
CONTRASTIVE LEARNING
CROSS-REFERENCE TO RELATED CASES
[0001] The present invention is a non-provisional of and claims priority to U.S. App. No. 63/415,358 filed October 12, 2022. That application is herein incorporated by reference, for all purposes.
BACKGROUND
[0002] The present invention relates to object movement prediction. More specifically, the invention relates to methods and apparatus for more accurate optical flow analysis.
[0003] Optical flow analysis / estimation is a critical component in several high-level vision problems such as action recognition, video segmentation, autonomous driving, editing, and the like. Traditional methods for optical flow estimation involve formulating the problem as an optimization problem using hand-crafted features, which can be empirical. Optical flow estimation is typically achieved by attempting to maximize the visual similarity between adjacent frames through energy minimization. Deep neural networks have been shown to be effective in optical flow estimation in terms of accuracy. These methods, however, still have limitations, such as difficulty in handling occlusions, small fast-moving objects, and capturing global motion, and rectifying and recovering from early errors.
[0004] To improve the accuracy of end-to-end optical flow networks, synthetic datasets, not real world datasets, are often used for pre-training. Synthetic datasets are typically easier to generate and include labels or identifiers for all features appearing on the images, e.g. vehicle 1, pedestrian 2, building 5, etc., e.g. ground truth data. In contrast, real word data typically include very few labels, as such labeling has to be performed manually. Because large-scale data are necessary to train deep learning networks, synthetic datasets, i.e. computer-generated scenes and images, are often used for model training. A drawback to this strategy is that the generated deep learning models have a tendency to overfit to the synthetic training datasets, which would subsequently show performance drops on real-world data.
[0005] In light of the above, what are desired are solutions that address the above challenges, with reduced drawbacks.
SUMMARY
[0006] The present invention relates to object movement prediction. More specifically, the invention relates to methods and apparatus for more accurate optical flow analysis. [0007] Embodiments disclose a semi-supervised framework for improving the determination of optical flow. More specifically, real world datasets are used to determine optical flow estimation using automatically labeled features on the ground truth, real world data. Features may span contiguous groups or separate groups of pixels on an image. In operation, features within ground truth data (real world data) are automatically labeled with pseudo feature labels to form pseudo ground truth data. Subsequently, pseudo labels for features are maintained typically when the pseudo feature labels help reduce the optical flow loss, and pseudo labels for features may be removed or deleted, typically when the pseudo feature labels increase the optical flow loss. More specifically, pseudo labels may be assigned to features to form pseudo ground truth data. This pseudo ground truth data is then used to predict feature-wise flow for features in a first temporal feature map. The predicted feature-wise flow is termed a warped feature map, herein. Embodiments then use contrastive flow loss determinations upon the warped feature map and a second temporal feature map to determine feature to feature correspondence, or lack thereof. In some cases, the feature-wise contrastive flow loss may then be fed back to help determine whether the pseudo labels should be modified or maintained.
[0008] In some embodiments, the semi-supervised framework described herein provides optical flow feedback based upon feature to feature comparison of a feature map. This optical flow feedback may be combined with other optical flow systems, such as recurrent all-pairs field transforms (RAFT) (Teed 2020) architecture. Conventional optical flow systems such as RAFT typically perform a pixel by pixel optical flow analysis. By combining this contrastive loss function with the RAFT optical flow analysis the optical flow predictions are improved.
[0009] According to one aspect, a method for a computing system for estimating optical flow is disclosed. A technique may include determining, in a computing system, a first features map in response to a first image, and a second features map in response to a second image map, and implementing, in the computing system, a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map A process may include determining, in the computing system, a warped feature map in response to the second image map, and implementing, in the computing system, a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map. A method may include determining, in the computing system, a pixel-wise flow loss in response to the pixel-wise flow prediction and in response to pixel-wise ground truth data and modifying, in the computing system, parameters of the GRU in response to the pixel-wise flow loss and to the feature-wise loss.
[0010] According to another aspect, a computing system for estimating optical flow is disclosed. One device may include a pixel-based analysis system configured to determine a first features map in response to a first image, and a second features map in response to a second image map, wherein the pixel-based analysis system comprises a gated recurrent unit (GRU) configured to determine a pixel-wise flow prediction in response to the first features map and the second features map. An apparatus may include a feature-based analy sis system coupled to the pixel-based optical flow analysis system, wherein the feature-based analysis system is configured to determine a warped feature map in response to the second image map, wherein the feature-based analysis system comprises a contrastive loss unit configured to determine a feature-wise loss in response to first features in the first image map and second features in the warped feature map. In some systems, a pixel-based analysis system is configured to modify parameters of the GRU in response to the pixel-wise flow loss and to the feature-wise loss.
[0011] According to yet another aspect, a method is disclosed. One process may include operating an optical flow' prediction system comprising contrastive loss functionality in response to a synthetic dataset to determine a first optical flow loss, and adjusting parameters of the optical flow prediction system in response to the first optical flow loss. A technique may include operating the optical flow' prediction system in response to a real world dataset to determine a second optical flow loss, and adjusting parameters of the optical flow prediction system in response to the second optical flow' loss.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] In order to more fully understand the present invention, reference is made to the accompanying drawings. Understanding that these drawings are not to be considered limitations in the scope of the invention, the presently described embodiments and the presently understood best mode of the invention are described with additional detail through use of the accompanying drawings in which:
[0013] Fig. 1 illustrates a functional block diagram of some embodiments of the present invention;
[0014] Figs. 2A-C illustrate a process diagram according to various embodiments of the present invention; and
[0015] Fig. 3 illustrates a system diagram according to various embodiments of the present invention.
DETAILED DESCRIPTION
[0016] Fig. 1 illustrates a logical block diagram, an embodiment of the present invention. More specifically a system 100 includes a contrastive loss portion 102 and a recurrent allpairs field transforms (RAFT) portion 104. Inputs into system 100 include a series of images 106. In some embodiments, input images 106 can include synthetic images, e.g. datasets including labeled features, where ground truth flow data 108 is known. In other embodiments, input images 106 can include real world images, e.g. datasets where some features are manually labeled and many features are unlabeled. In such cases, ground truth flow data 108 is not known for unlabeled features. System 100 may be termed RAFT-CF, herein.
[0017] In various embodiments, RAFT portion 104 includes three major components, a feature encoder 110 that determines feature maps 112 that store feature vectors for each pixel in input images 106; a correlation portion 114 that determines a four-dimensional (4D) correlation volume 116 that includes displacements based upon feature maps 112; and a gated recurrent unit (GRU) 118 that determines an optical flow prediction 120 for feature map 122. Optical flow prediction 120 is compared to ground truth flow data 108, to determine flow loss 124.
[0018] In operation, feature encoder 110 receives images 106 having a height, width, and color depth (e.g. HxWx3). Next, the positions of pixels (e.g. x and y positions) are then determined and stored in the form of coordinate frames 126 (e.g. HxWx5). Next encoders 128 are used to identify features, e.g. specific colored pixels, from coordinate frames 126. In various embodiments, the features may be stored as feature vectors in feature maps 112 at different resolutions.
[0019] In various embodiments, correlation layer 114 receives feature maps 112 and determines a multi-dimensional (e.g. four-dimensional, 4D) correlation volume 116. This 4D correlation volume 116 is typically determined by comparing visual similarity between pixels in feature maps 112 and then determining displacements of pixels in feature maps 112. The correlation volume 116 is then inputted to gated recurrent unit (GRU) 118 that warps an input feature map 146 and outputs optical flow prediction 120. Optical flow prediction 120 thus represents the predicted optical flow. [0020] In various embodiments, as illustrated, optical flow prediction 120 is then compared to ground truth flow data 108, to determine flow loss 124. Flow loss 124 typically indicates how well GRU 118 can predict the pixel by pixel optical flow. Flow loss 124 can be fed back 140 into GRU 118 to change its parameters. In some cases, if flow loss 122 is small, the change in parameters may be smaller than if flow loss 122 is large.
[0021] RAFT portion 104 is ty pically limited to determining predicted optical flow for situations where ground truth flow is known, i.e. with synthetic images and synthetic datasets. Additionally, it is typically limited to determining optical flow vectors on a pixel by pixel basis. As mentioned above, RAFT portion 104 may sometimes overfit synthetic datasets thus when trained, RAFT portion 104 may have trouble determining accurate optical flow vectors when provided with real world datasets.
[0022] In various embodiments, contrastive loss portion 102 is used to supplement RAFT portion 104. Contrastive loss portion 102 includes a feature warp process 144 that receives a features map 128 and a ground truth flow 130 and outputs a warped feature map 142. A contrastive loss process 132 receives warped feature map 142 and feature map 134 and provides contrastive loss feedback 136. In some cases, feature map 128 may be based upon synthetic datasets, and in other cases, feature map 128 may be based upon real world data. In the case of real world data, contrastive loss feedback 136 may be used to fine tune ground truth flow 130, as discussed below.
[0023] In operation, ground truth flow 130 may specify an optical flow for manually labeled features from real world data, e.g. a house, a car, a bicycle, or the like. Additionally in some cases, ground truth flow 130 may specify a flow for predicted or features guessed from the images, e.g. a first blob of pixels may be guessed and labeled to be pedestrian, a second blob of similar colored pixels may be guessed and labeled as a ball, and the like. Such features are called pseudo labels, and ground truth flow 108 with pseudo labels are called pseudo ground truth flow.
[0024] In various embodiments, the optical flows of the labeled feature and the pseudo labeled features in ground truth flow 130 are used to process features map 128. More specifically, within feature warp process 144, ground truth flow 130 warps features map 128 and outputs warped feature map 142. Next, a contrastive loss process 132 is then performed by contrasting warped feature map 142 to feature map 134. In various embodiments, the contrasting is based upon features (e.g. groups of pixels), not individual pixels. Contrastive loss process 132 provides feedback for the pseudo ground truth flow 130. For example, if a feature of warped feature map 142 (that was pseudo labeled) aligns or matches a feature of feature map 134, the pseudo label may be maintained as being correct. Further, if a feature of warped feature map 142 (that was pseudo labeled) does not align or does not match a feature of feature map 134, the pseudo label may be removed as being an incorrect labeling. Stated again, optical flow for pseudo labeled features that substantially match optical flow for features in features map 134 can be maintained in pseudo ground truth 130, and optical flow for pseudo labeled features that do not match optical flow for features in features map 134 are removed from pseudo ground truth 130. In this way, unlabeled features in the real world data can be labeled in this iterative process In some embodiments, the process may then be repeated (136) with different pseudo labels for features.
[0025] In some embodiments, after certain conditions, feedback from contrastive loss portion 102 data may be used as feedback 148 to adjust parameters of GRU 118. More specifically , as discussed above, GRU 118 may predict optical flow on a pixel by pixel basis, but contrastive loss portion 102 predicts optical flow on a feature by feature basis.
Accordingly, by combining these optical flow predictions, the optical flow predictions by GRU 118 are typically improved. Experimental data results gathered by the inventors have confirmed this improvement.
[0026] Figs. 2A-C illustrate a more complete flow process according to some embodiments. As illustrated in Figs. 2A-C, a process includes three phases, phase 202 is a synthetic dataset training phase, phase 204 is a real world dataset training phase, and phase 206 is a real world use phase.
[0027] Initially in Fig. 2A, synthetic data images 208 are provided to a system 210 similar to that disclosed above. As illustrated, a predicted flow 212 is determined and a flow loss 214 is determined. Feedback 216 is then provided to train system 210. The process continues until the flow loss is reduced, in which case system 210’ is formed.
[0028] Next, in Fig. 2B, real world images 218 are provided to the trained system 210’. In this example NA2 images are used, and logically arranged in an NxN grid 216. Next, a K- Fold cross validation process is performed selecting training subsets of real word images 218 as input into system 210’. As illustrated in Fig. 1, above, system 210’ uses contrastive loss functionality on features and pseudo features to train system 210’ based upon the training subsets of real world images. The trained system 210’ is then tested with real world images 214 that are not in the training subset (testing subset), to determine predicted flows 218. The predicted flows 218 are compared to the pseudo ground truth flows 220 for the testing subset to determine an error 222, as shown. The next training subset and testing subset from images 218 are used to determine errors 224, and the like. In various embodiments, after all K-Fold cross validations, the errors may be combined or averaged to determine an average error or feedback. The feedback may be used as feedback 226 to change one or more parameters of the GRU within system 210’, discussed above. The process may be repeated until the error feedback is reduced, in which case system 210” is formed.
[0029] In Fig. 2C, system 210” has been trained using synthetic datasets in Fig. 2A and then trained using real world data in Fig. 2B. Accordingly, new images 228 can be provided to system 210”, and system 210” can predict the optical flow 230. In some cases, if ground truth optical flow 232 is known, a flow loss 234 can again be determined, and fed back 236 into system 210”.
[0030] In some embodiments, predicted optical flow 230 may be used as input into other processes, such as a driver assist system, an autonomous driving system, an area mapping system, and the like.
[0031] Fig. 3 illustrates a functional block diagram of various embodiments of the present invention. More specifically, it is contemplated that computers (e.g. servers, laptops, streaming servers, virtual machines, etc.) may be implemented with a subset or superset of the below-illustrated components.
[0032] In Fig. 3, a computing device 300 may include some, but not necessarily all of the following components: an applications processor / microprocessor 302, memory 304, a display 306, an image acquisition device 310, audio input / output devices 312, and the like. Data and communications from and to computing device 300 can be provided via a wired interface 314 (e g. Ethernet, dock, plug, controller interface to peripheral devices); miscellaneous rf receivers, e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316; an NFC interface (e.g. antenna or coil) and driver 318; RF interfaces and drivers 320, and the like. Also included in some embodiments are physical sensors 322 (e.g. (MEMS-based) accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, bioimagmg sensors, etc.).
[0033] In various embodiments, computing device 300 may be a computing device (e.g. Apple iPad, Microsoft Surface, Samsung Galaxy Note, an Android Tablet); a smartphone (e.g. Apple iPhone, Google Pixel, Samsung Galaxy S); a computer (e.g. netbook, laptop, convertible), a media player (e.g. Apple iPod); or the like Typically, computing device 300 may include one or more processors 302. Such processors 302 may also be termed application processors, and may include a processor core, a video/graphics core, and other cores. Processors 302 may include processors from Apple (A14 Bionic, Al 5 Bionic), NVidia (Tegra), Intel (Core), Qualcomm (Snapdragon), Samsung (Exynos), ARM (Cortex), MIPS technology, a microcontroller, and the like. In some embodiments, processing accelerators may also be included, e.g. an Al accelerator, Google (Tensor processing unit), a GPU, or the like. It is contemplated that other existing and / or later-developed processors / microcontrollers may be used in various embodiments of the present invention.
[0034] In various embodiments, memory 304 may include different types of memory (including memory controllers), such as flash memory' (e.g. NOR, NAND), SRAM, DDR SDRAM, or the like. Memory 304 may be fixed within computing device 300 and may also include removable memory (e g. SD, SDHC, MMC, MINI SD, MICRO SD, SIM). The above are examples of computer readable tangible media that may be used to store embodiments of the present invention, such as computer-executable software code (e.g. firmware, application programs), security applications, application data, operating system data, databases, or the like. Additionally, in some embodiments, a secure device including secure memory and / or a secure processor are provided. It is contemplated that other existing and I or later-developed memory and memory technology may be used in various embodiments of the present invention.
[0035] In vanous embodiments, display 306 may be based upon a variety of later- developed or current display technology, including LED or OLED displays and / or status lights; touch screen technology (e.g. resistive displays, capacitive displays, optical sensor displays, electromagnetic resonance, or the like); and the like. Additionally, display 306 may include single touch or multiple-touch sensing capability. Any later-developed or conventional output display technology may be used for embodiments of the output display, such as LED IPS, OLED, Plasma, electronic ink (e.g. electrophoretic, electrowetting, interferometric modulating), or the like. In various embodiments, the resolution of such displays and the resolution of such touch sensors may be set based upon engineering or nonengineering factors (e.g. sales, marketing). In some embodiments, display 306 may be integrated into computing device 300 or may be separate. In some embodiments, display 306 may be in virtually any size or resolution, such as a 3K resolution display, a microdisplay, one or more individual status or communication lights, e.g. LEDs, or the like.
[0036] In some embodiments of the present invention, acquisition device 310 may include one or more sensors, drivers, lenses, and the like. The sensors may be visible light, infrared, and / or UV sensitive sensors, ultrasonic sensors, or the like, that are based upon any later-developed or convention sensor technology, such as CMOS, CCD, or the like. In some embodiments of the present invention, image recognition algorithms, image processing algorithms, or other software programs for operation upon processor 302, to process the acquired data. For example, such software may pair with enabled hardware to provide functionality such as: facial recognition (e.g. Face ID, head tracking, camera parameter control, or the like); fingerprint capture / analysis; blood vessel capture / analysis; iris scanning capture / analysis; otoacoustic emission (OAE) profiling and matching; and the like. [0037] In various embodiments, audio input / output 312 may include a microphone(s) / speakers. In various embodiments, voice processing and / or recognition software may be provided to applications processor 302 to enable the user to operate computing device 300 by stating voice commands. In various embodiments of the present invention, audio input 312 may provide user input data in the form of a spoken word or phrase, or the like, as described above. In some embodiments, audio input / output 312 may be integrated into computing device 300 or may be separate.
[0038] In various embodiments, wired interface 314 may be used to provide data or instruction transfers between computing device 300 and an external source, such as a computer, a remote server, a POS server, a local security server, a storage network, another computing device 300, an IMU, video camera, or the like. Embodiments may include any later-developed or conventional physical interface / protocol, such as: USB, micro USB, mini USB, USB-C, Firewire, Apple Lightning connector, Ethernet, POTS, custom interface or dock, or the like. In some embodiments, wired interface 314 may also provide electrical power, or the like to power source 324, or the like. In other embodiments interface 314 may utilize close physical contact of device 300 to a dock for transfer of data, magnetic power, heat energy, light energy, laser energy, or the like Additionally, software that enables communications over such networks is typically provided.
[0039] In various embodiments, a wireless interface 316 may also be provided to provide wireless data transfers between computing device 300 and external sources, such as computers, storage networks, headphones, microphones, cameras, IMUs, or the like. As illustrated in Fig. 3, wireless protocols may include Wi-Fi (e.g. IEEE 802.11 a/b/g/n, WiMAX), Bluetooth, Bluetooth Low Energy (BLE) IR, near field communication (NFC), ZigBee, Ultra-Wide Band (UWB), Wi-Fi, mesh communications, and the like.
[0040] GNSS (e.g. GPS) receiving capability may also be included in various embodiments of the present invention. As illustrated in Fig. 3, GPS functionality is included as part of wireless interface 316 merely for sake of convenience, although in implementation, such functionality may be performed by circuitry that is distinct from the Wi-Fi circuitry, the Bluetooth circuitry, and the like. In various embodiments of the present invention, GPS receiving hardware may provide user input data in the form of current GPS coordinates, or the like, as described above.
[0041] Additional wireless communications may be provided via RF interfaces in various embodiments. In various embodiments, RF interfaces 320 may support any future-developed or conventional radio frequency communications protocol, such as CDMA-based protocols (e.g. WCDMA), GSM-based protocols, HSUPA-based protocols, G4, G5, or the like. In some embodiments, various functionality is provided upon a single IC package, for example, the Marvel PXA330 processor, and the like. As described above, data transmissions between a smart device and the services may occur via Wi-Fi, a mesh network, 4G, 5G, or the like. [0042] Although the functional blocks in Fig. 3 are shown as being separate, it should be understood that the various functionality may be regrouped into different physical devices. For example, some processors 302 may include Bluetooth functionality. Additionally, some functionality need not be included in some blocks, for example, GPS functionality need not be provided in a provider server.
[0043] In various embodiments, any number of future developed, current operating systems, or custom operating systems may be supported, such as iPhone OS (e.g. iOS), Google Android, Linux, Windows, MacOS, or the like. In various embodiments of the present invention, the operating system may be a multi-threaded multi-tasking operating system. Accordingly, inputs and / or outputs from and to display 306 and inputs / or outputs to physical sensors 322 may be processed in parallel processing threads. In other embodiments, such events or outputs may be processed serially, or the like Inputs and outputs from other functional blocks may also be processed in parallel or serially, in other embodiments of the present invention, such as acquisition device 310 and physical sensors 322.
[0044] In some embodiments of the present invention, physical sensors 322 (e.g. MEMS- based) may include accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, imaging sensors (e.g. blood oxygen, heartbeat, blood vessel, iris data, etc.), thermometer, otoacoustic emission (OAE) testing hardware, and the like. The data from such sensors may be used to capture data associated with device 300, and a user of device 300. Such data may include physical motion data, pressure data, orientation data, or the like. Data captured by sensors 322 may be processed by software running upon processor 302 to determine characteristics of the user, e.g. gait, gesture performance data, or the like, and used for user authentication purposes. In some embodiments, sensors 322 may also include physical output data, e.g. vibrations, pressures, and the like. [0045] In some embodiments, a power supply 324 may be implemented with a battery (e.g. LiPo), ultracapacitor, or the like, that provides operating electrical power to device 300. In various embodiments, any number of power generation techniques may be utilized to supplement or even replace power supply 324, such as solar power, liquid metal power generation, thermoelectric engines, rf harvesting (e.g. NFC) or the like.
[0046] Fig. 3 is representative of the components possible for a processing device. It will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention. Embodiments of the present invention may include at least some but need not include all of the functional blocks illustrated in Fig. 3. For example, a processing unit may include some of the functional blocks in Fig. 3, but it need not include an accelerometer or other physical sensor 322, an acquisition device 310, an accelerometer 322, an internal powder source 324, or the like.
[0047] In light of the above, other variations and adaptations can be envisioned to one of ordinary skills in the art. For example, outputs from embodiments may be provided to an autonomous driving system which may steer a vehicle (e.g. car, drone) based upon the predicted flow data; outputs may be used to provide audible, visual, or haptic feedback to a user, for example, if a feature in a field of view is identified or labeled as a pedestrian; a product being manufactured may be identified for further inspection, for example, if the determined optical flow of product does not match predefined criteria; a robot may be identified as requiring servicing, for example, if the determined optical flow does not match predefined criteria; and the like. Further, in other embodiments, other methods for segmenting real world data may be used besides K-Fold cross validations.
[0048] The block diagrams of the architecture and flow charts are grouped for ease of understanding. However, it should be understood that combinations of blocks, additions of new blocks, re-arrangement of blocks, and the like are contemplated in alternative embodiments of the present invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.

Claims

CLAIMS We claim:
1. A method for a computing system for estimating optical flow comprising: determining, in a computing system, a first features map in response to a first image, and a second features map in response to a second image map; implementing, in the computing system, a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map; determining, in the computing system, a warped feature map in response to the second image map; implementing, in the computing system, a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map; determining, in the computing system, a pixel-wise flow loss in response to the pixelwise flow prediction and in response to pixel-wise ground truth data; and modifying, in the computing system, parameters of the GRU in response to the pixelwise flow loss and to the feature-wise loss.
2. The method of claim 1 wherein the determining, in the computing system, the warped feature map comprises determining, in the computing system, the warped feature map also in response to a featurewise ground truth data; and wherein the feature-wise ground truth data comprises pre-identified features.
3. The method of claim 2 further comprising: labeling, in the computing system, a pseudo feature in the feature-wise ground truth data; determining, in the computing system, feature-wise pseudo ground truth data in response to labeling of the pseudo feature; determining, in the computing system, a revised warped feature map in response to the feature-wise pseudo ground truth data; and implementing, in the computing system, the feature-wise contrastive loss function to determine a revised feature-wise loss in response to first features in the first image map and third features in the revised warped features map.
4. The method of claim 3 further comprising: determining, in the computing system, whether the revised feature-wise loss is less than the feature-wise loss; and wherein the modifying, in the computing system, parameters of the GRU comprises: modifying, in the computing system, parameters of the GRU in response to the pixel-wise flow loss and to the revised feature-wise loss, and in response to the revised feature-wise loss being determined to be less than the feature-wise loss.
5. The method of claim 3 further comprising: determining, in the computing system, whether the feature-wise loss is less than the revised feature-wise loss; and removing, in the computing system, labeling of the pseudo feature in the feature-wise ground truth data, and in response to the feature-wise loss being determined to be less than the revised feature-wise loss.
6. The method of claim 1 further comprising: determining a correlation volume in response to the first features map and the second features map; and wherein the implementing, in the computing system, the gated recurrent unit (GRU) comprises implementing, in the computing system, the gated recurrent unit (GRU) to determine the pixel-wise flow prediction in response to the first features map and to the correlation volume.
7. The method of claim 6 wherein the correlation volume comprises parameters selected from a group consisting of: location of a region on an image, a size of a region within an image, a correlation parameter between images, and temporal data.
8. The method of claim 1 wherein the feature-wise loss is associated with a plurality of pixels; and wherein the pixel-wise flow loss is associated with a pixel from the plurality of pixels.
9. A computing system for estimating optical flow comprising: a pixel -based analysis system configured to determine a first features map in response to a first image, and a second features map in response to a second image map, wherein the pixel-based analysis system comprises a gated recurrent unit (GRU) configured to determine a pixel-wise flow" prediction in response to the first features map and the second features map; and a feature-based analysis system coupled to the pixel-based optical flow analysis system, wherein the feature-based analysis system is configured to determine a warped feature map in response to the second image map, wherein the feature-based analysis system comprises a contrastive loss unit configured to determine a feature-wise loss in response to first features in the first image map and second features in the warped feature map; wherein the pixel-based analysis system is configured to modify parameters of the GRU in response to a pixel-wise flow loss and to the feature-wise loss.
10. The computing system of claim 9 wherein the feature-based analysis system is configured to determine the warped feature map in response to the second feature map and a feature-wise ground truth data; and wherein the feature-wise ground truth data comprises pre-identified features.
11. The computing system of claim 10 wherein the feature-based analysis system is configured to label a pseudo feature in the feature-wise ground truth data; wherein the feature-based analysis system is configured to determine feature-wise pseudo ground truth data in response to labeling of the pseudo feature; wherein the feature-based analysis system is configured to determine a revised warped feature map in response to the feature-wise pseudo ground truth data; and wherein the contrastive loss unit configured is configured to determine a revised feature-wise loss in response to first features in the first image map and third features in the revised warped features map.
12. The computing system of claim 11 further comprising: wherein the feature-based analysis system is configured to determine whether the revised feature-wise loss is less than the feature-wise loss; and wherein the pixel-based analysis system is configured to modify parameters of the GRU in response to the pixel-wise flow loss and to the revised feature-wise loss, and in response to the revised feature-wise loss being determined to be less than the feature-wise loss.
13. The computing system of claim 11 further comprising: wherein the feature-based analysis system is configured to determine whether the feature-wise loss is less than the revised feature-wise loss; and wherein the feature-based analysis system is configured to remove labeling of the pseudo feature in the feature-wise ground truth data, and in response to the feature-wise loss being determined to be less than the revised feature-wise loss.
14. The computing system of claim 9 wherein the pixel-based analysis system is configure to determine a correlation volume in response to the first features map and the second features map; and wherein the gated recurrent unit (GRU) is configured to determine the pixel-wise flow prediction in response to the first features map, the second features map and the correlation volume.
15. The computing system of claim 14 wherein the correlation volume comprises parameters selected from a group consisting of: location of a region on an image, a size of a region within an image, a correlation parameter between images, and temporal data.
16. The computing system of claim 9 wherein the feature-wise loss is associated with a plurality of pixels; and wherein the pixel-wise flow loss is associated with a pixel from the plurality of pixels.
17. The computing system of claim 9 wherein the pixel-based analysis system comprises a recurrent all-pairs field transforms (RAFT) system.
18. A method comprising: operating an optical flow prediction system comprising contrastive loss functionality in response to a synthetic dataset to determine a first optical flow loss; adjusting parameters of the optical flow prediction system in response to the first optical flow loss; thereafter operating the optical flow prediction system in response to a real world dataset to determine a second optical flow loss; and adjusting parameters of the optical flow prediction system in response to the second optical flow loss.
19. The method of claim 18 wherein the operating the optical flow prediction system in response to the real world dataset comprises: determining a first subset of real world data from the real world dataset; operating the optical flow prediction system in response to the first subset of real world data to determine a third optical flow loss; determining a second subset of real world data from the real world dataset; operating the optical flow prediction system in response to the second subset of real w orld data to determine a fourth optical flow loss; and determining the second optical flow loss in response to the third optical flow loss and the fourth optical flow loss.
20. The method of claim 19 wherein the first subset of real world data comprises a plurality of real world images including a first real world image; wherein the first real world image comprises a plurality of manually labeled features: and wherein the operating the optical flow prediction system in response to the first subset of real world data to determine comprises automatically labeling a pseudo feature in the first real world image.
PCT/US2023/069357 2022-10-12 2023-06-29 Methods and apparatus for optical flow estimation with contrastive learning WO2024081455A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263415358P 2022-10-12 2022-10-12
US63/415,358 2022-10-12

Publications (1)

Publication Number Publication Date
WO2024081455A1 true WO2024081455A1 (en) 2024-04-18

Family

ID=90670160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/069357 WO2024081455A1 (en) 2022-10-12 2023-06-29 Methods and apparatus for optical flow estimation with contrastive learning

Country Status (1)

Country Link
WO (1) WO2024081455A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226892A1 (en) * 2012-02-29 2013-08-29 Fluential, Llc Multimodal natural language interface for faceted search
US20190294970A1 (en) * 2018-03-23 2019-09-26 The Governing Council Of The University Of Toronto Systems and methods for polygon object annotation and a method of training an object annotation system
US20200327377A1 (en) * 2019-03-21 2020-10-15 Illumina, Inc. Artificial Intelligence-Based Quality Scoring
US20210240761A1 (en) * 2019-01-31 2021-08-05 Shenzhen Sensetime Technology Co., Ltd. Method and device for cross-modal information retrieval, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226892A1 (en) * 2012-02-29 2013-08-29 Fluential, Llc Multimodal natural language interface for faceted search
US20190294970A1 (en) * 2018-03-23 2019-09-26 The Governing Council Of The University Of Toronto Systems and methods for polygon object annotation and a method of training an object annotation system
US20210240761A1 (en) * 2019-01-31 2021-08-05 Shenzhen Sensetime Technology Co., Ltd. Method and device for cross-modal information retrieval, and storage medium
US20200327377A1 (en) * 2019-03-21 2020-10-15 Illumina, Inc. Artificial Intelligence-Based Quality Scoring

Similar Documents

Publication Publication Date Title
US11798271B2 (en) Depth and motion estimations in machine learning environments
US10198823B1 (en) Segmentation of object image data from background image data
US11526713B2 (en) Embedding human labeler influences in machine learning interfaces in computing environments
US9965865B1 (en) Image data segmentation using depth data
US9442564B1 (en) Motion sensor-based head location estimation and updating
US9911395B1 (en) Glare correction via pixel processing
US9652031B1 (en) Trust shifting for user position detection
US9354709B1 (en) Tilt gesture detection
US9832452B1 (en) Robust user detection and tracking
US10055013B2 (en) Dynamic object tracking for user interfaces
CN116912514A (en) Neural network for detecting objects in images
US20120026335A1 (en) Attribute-Based Person Tracking Across Multiple Cameras
US11727576B2 (en) Object segmentation and feature tracking
US11600039B2 (en) Mechanism for improved light estimation
JP2021061573A (en) Imaging system, method for imaging, imaging system for imaging target, and method for processing intensity image of dynamic scene acquired using template, and event data acquired asynchronously
US11381743B1 (en) Region of interest capture for electronic devices
US11598976B1 (en) Object recognition for improving interfaces on an eyewear device and other wearable and mobile devices
US20220075994A1 (en) Real-time facial landmark detection
KR20240024277A (en) Gaze classification
Nafea et al. A Review of Lightweight Object Detection Algorithms for Mobile Augmented Reality
US20190096073A1 (en) Histogram and entropy-based texture detection
WO2024081455A1 (en) Methods and apparatus for optical flow estimation with contrastive learning
US20230115371A1 (en) Efficient vision perception
WO2024049513A1 (en) Methods and apparatus for forecasting collisions using egocentric video data
WO2023044661A1 (en) Learning reliable keypoints in situ with introspective self-supervision

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878062

Country of ref document: EP

Kind code of ref document: A1