WO2024081455A1

WO2024081455A1 - Methods and apparatus for optical flow estimation with contrastive learning

Info

Publication number: WO2024081455A1
Application number: PCT/US2023/069357
Authority: WO
Inventors: Zhiqi ZHANG; Pan JI; Nitin Bansal; Changjiang Cai; Qingan Yan; Xiangyu Xu; Huangying ZHAN; Yi Xu
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-10-12
Filing date: 2023-06-29
Publication date: 2024-04-18

Abstract

A method for a computing system includes determining a first and second features map in response to a first and second images, implementing a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map, determining a warped feature map in response to the second image map, implementing a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map, determining a pixel-wise flow loss in response to the pixel-wise flow prediction and to pixel-wise ground truth data, and modifying parameters of the GRU in response to the pixelwise flow loss and to the feature-wise loss.

Description

METHODS AND APPARATUS FOR OPTICAL FLOW ESTIMATION WITH

CONTRASTIVE LEARNING

CROSS-REFERENCE TO RELATED CASES

[0001] The present invention is a non-provisional of and claims priority to U.S. App. No. 63/415,358 filed October 12, 2022. That application is herein incorporated by reference, for all purposes.

BACKGROUND

[0002] The present invention relates to object movement prediction. More specifically, the invention relates to methods and apparatus for more accurate optical flow analysis.

[0003] Optical flow analysis / estimation is a critical component in several high-level vision problems such as action recognition, video segmentation, autonomous driving, editing, and the like. Traditional methods for optical flow estimation involve formulating the problem as an optimization problem using hand-crafted features, which can be empirical. Optical flow estimation is typically achieved by attempting to maximize the visual similarity between adjacent frames through energy minimization. Deep neural networks have been shown to be effective in optical flow estimation in terms of accuracy. These methods, however, still have limitations, such as difficulty in handling occlusions, small fast-moving objects, and capturing global motion, and rectifying and recovering from early errors.

[0004] To improve the accuracy of end-to-end optical flow networks, synthetic datasets, not real world datasets, are often used for pre-training. Synthetic datasets are typically easier to generate and include labels or identifiers for all features appearing on the images, e.g. vehicle 1, pedestrian 2, building 5, etc., e.g. ground truth data. In contrast, real word data typically include very few labels, as such labeling has to be performed manually. Because large-scale data are necessary to train deep learning networks, synthetic datasets, i.e. computer-generated scenes and images, are often used for model training. A drawback to this strategy is that the generated deep learning models have a tendency to overfit to the synthetic training datasets, which would subsequently show performance drops on real-world data.

[0005] In light of the above, what are desired are solutions that address the above challenges, with reduced drawbacks.

SUMMARY

[0006] The present invention relates to object movement prediction. More specifically, the invention relates to methods and apparatus for more accurate optical flow analysis. [0007] Embodiments disclose a semi-supervised framework for improving the determination of optical flow. More specifically, real world datasets are used to determine optical flow estimation using automatically labeled features on the ground truth, real world data. Features may span contiguous groups or separate groups of pixels on an image. In operation, features within ground truth data (real world data) are automatically labeled with pseudo feature labels to form pseudo ground truth data. Subsequently, pseudo labels for features are maintained typically when the pseudo feature labels help reduce the optical flow loss, and pseudo labels for features may be removed or deleted, typically when the pseudo feature labels increase the optical flow loss. More specifically, pseudo labels may be assigned to features to form pseudo ground truth data. This pseudo ground truth data is then used to predict feature-wise flow for features in a first temporal feature map. The predicted feature-wise flow is termed a warped feature map, herein. Embodiments then use contrastive flow loss determinations upon the warped feature map and a second temporal feature map to determine feature to feature correspondence, or lack thereof. In some cases, the feature-wise contrastive flow loss may then be fed back to help determine whether the pseudo labels should be modified or maintained.

[0008] In some embodiments, the semi-supervised framework described herein provides optical flow feedback based upon feature to feature comparison of a feature map. This optical flow feedback may be combined with other optical flow systems, such as recurrent all-pairs field transforms (RAFT) (Teed 2020) architecture. Conventional optical flow systems such as RAFT typically perform a pixel by pixel optical flow analysis. By combining this contrastive loss function with the RAFT optical flow analysis the optical flow predictions are improved.

[0009] According to one aspect, a method for a computing system for estimating optical flow is disclosed. A technique may include determining, in a computing system, a first features map in response to a first image, and a second features map in response to a second image map, and implementing, in the computing system, a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map A process may include determining, in the computing system, a warped feature map in response to the second image map, and implementing, in the computing system, a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map. A method may include determining, in the computing system, a pixel-wise flow loss in response to the pixel-wise flow prediction and in response to pixel-wise ground truth data and modifying, in the computing system, parameters of the GRU in response to the pixel-wise flow loss and to the feature-wise loss.

[0010] According to another aspect, a computing system for estimating optical flow is disclosed. One device may include a pixel-based analysis system configured to determine a first features map in response to a first image, and a second features map in response to a second image map, wherein the pixel-based analysis system comprises a gated recurrent unit (GRU) configured to determine a pixel-wise flow prediction in response to the first features map and the second features map. An apparatus may include a feature-based analy sis system coupled to the pixel-based optical flow analysis system, wherein the feature-based analysis system is configured to determine a warped feature map in response to the second image map, wherein the feature-based analysis system comprises a contrastive loss unit configured to determine a feature-wise loss in response to first features in the first image map and second features in the warped feature map. In some systems, a pixel-based analysis system is configured to modify parameters of the GRU in response to the pixel-wise flow loss and to the feature-wise loss.

[0011] According to yet another aspect, a method is disclosed. One process may include operating an optical flow' prediction system comprising contrastive loss functionality in response to a synthetic dataset to determine a first optical flow loss, and adjusting parameters of the optical flow prediction system in response to the first optical flow loss. A technique may include operating the optical flow' prediction system in response to a real world dataset to determine a second optical flow loss, and adjusting parameters of the optical flow prediction system in response to the second optical flow' loss.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] In order to more fully understand the present invention, reference is made to the accompanying drawings. Understanding that these drawings are not to be considered limitations in the scope of the invention, the presently described embodiments and the presently understood best mode of the invention are described with additional detail through use of the accompanying drawings in which:

[0013] Fig. 1 illustrates a functional block diagram of some embodiments of the present invention;

[0014] Figs. 2A-C illustrate a process diagram according to various embodiments of the present invention; and

[0015] Fig. 3 illustrates a system diagram according to various embodiments of the present invention.

DETAILED DESCRIPTION

[0016] Fig. 1 illustrates a logical block diagram, an embodiment of the present invention. More specifically a system 100 includes a contrastive loss portion 102 and a recurrent allpairs field transforms (RAFT) portion 104. Inputs into system 100 include a series of images 106. In some embodiments, input images 106 can include synthetic images, e.g. datasets including labeled features, where ground truth flow data 108 is known. In other embodiments, input images 106 can include real world images, e.g. datasets where some features are manually labeled and many features are unlabeled. In such cases, ground truth flow data 108 is not known for unlabeled features. System 100 may be termed RAFT-CF, herein.

[0017] In various embodiments, RAFT portion 104 includes three major components, a feature encoder 110 that determines feature maps 112 that store feature vectors for each pixel in input images 106; a correlation portion 114 that determines a four-dimensional (4D) correlation volume 116 that includes displacements based upon feature maps 112; and a gated recurrent unit (GRU) 118 that determines an optical flow prediction 120 for feature map 122. Optical flow prediction 120 is compared to ground truth flow data 108, to determine flow loss 124.

[0018] In operation, feature encoder 110 receives images 106 having a height, width, and color depth (e.g. HxWx3). Next, the positions of pixels (e.g. x and y positions) are then determined and stored in the form of coordinate frames 126 (e.g. HxWx5). Next encoders 128 are used to identify features, e.g. specific colored pixels, from coordinate frames 126. In various embodiments, the features may be stored as feature vectors in feature maps 112 at different resolutions.

[0019] In various embodiments, correlation layer 114 receives feature maps 112 and determines a multi-dimensional (e.g. four-dimensional, 4D) correlation volume 116. This 4D correlation volume 116 is typically determined by comparing visual similarity between pixels in feature maps 112 and then determining displacements of pixels in feature maps 112. The correlation volume 116 is then inputted to gated recurrent unit (GRU) 118 that warps an input feature map 146 and outputs optical flow prediction 120. Optical flow prediction 120 thus represents the predicted optical flow. [0020] In various embodiments, as illustrated, optical flow prediction 120 is then compared to ground truth flow data 108, to determine flow loss 124. Flow loss 124 typically indicates how well GRU 118 can predict the pixel by pixel optical flow. Flow loss 124 can be fed back 140 into GRU 118 to change its parameters. In some cases, if flow loss 122 is small, the change in parameters may be smaller than if flow loss 122 is large.

[0021] RAFT portion 104 is ty pically limited to determining predicted optical flow for situations where ground truth flow is known, i.e. with synthetic images and synthetic datasets. Additionally, it is typically limited to determining optical flow vectors on a pixel by pixel basis. As mentioned above, RAFT portion 104 may sometimes overfit synthetic datasets thus when trained, RAFT portion 104 may have trouble determining accurate optical flow vectors when provided with real world datasets.

[0022] In various embodiments, contrastive loss portion 102 is used to supplement RAFT portion 104. Contrastive loss portion 102 includes a feature warp process 144 that receives a features map 128 and a ground truth flow 130 and outputs a warped feature map 142. A contrastive loss process 132 receives warped feature map 142 and feature map 134 and provides contrastive loss feedback 136. In some cases, feature map 128 may be based upon synthetic datasets, and in other cases, feature map 128 may be based upon real world data. In the case of real world data, contrastive loss feedback 136 may be used to fine tune ground truth flow 130, as discussed below.

[0023] In operation, ground truth flow 130 may specify an optical flow for manually labeled features from real world data, e.g. a house, a car, a bicycle, or the like. Additionally in some cases, ground truth flow 130 may specify a flow for predicted or features guessed from the images, e.g. a first blob of pixels may be guessed and labeled to be pedestrian, a second blob of similar colored pixels may be guessed and labeled as a ball, and the like. Such features are called pseudo labels, and ground truth flow 108 with pseudo labels are called pseudo ground truth flow.

[0024] In various embodiments, the optical flows of the labeled feature and the pseudo labeled features in ground truth flow 130 are used to process features map 128. More specifically, within feature warp process 144, ground truth flow 130 warps features map 128 and outputs warped feature map 142. Next, a contrastive loss process 132 is then performed by contrasting warped feature map 142 to feature map 134. In various embodiments, the contrasting is based upon features (e.g. groups of pixels), not individual pixels. Contrastive loss process 132 provides feedback for the pseudo ground truth flow 130. For example, if a feature of warped feature map 142 (that was pseudo labeled) aligns or matches a feature of feature map 134, the pseudo label may be maintained as being correct. Further, if a feature of warped feature map 142 (that was pseudo labeled) does not align or does not match a feature of feature map 134, the pseudo label may be removed as being an incorrect labeling. Stated again, optical flow for pseudo labeled features that substantially match optical flow for features in features map 134 can be maintained in pseudo ground truth 130, and optical flow for pseudo labeled features that do not match optical flow for features in features map 134 are removed from pseudo ground truth 130. In this way, unlabeled features in the real world data can be labeled in this iterative process In some embodiments, the process may then be repeated (136) with different pseudo labels for features.

[0025] In some embodiments, after certain conditions, feedback from contrastive loss portion 102 data may be used as feedback 148 to adjust parameters of GRU 118. More specifically , as discussed above, GRU 118 may predict optical flow on a pixel by pixel basis, but contrastive loss portion 102 predicts optical flow on a feature by feature basis.

Accordingly, by combining these optical flow predictions, the optical flow predictions by GRU 118 are typically improved. Experimental data results gathered by the inventors have confirmed this improvement.

[0026] Figs. 2A-C illustrate a more complete flow process according to some embodiments. As illustrated in Figs. 2A-C, a process includes three phases, phase 202 is a synthetic dataset training phase, phase 204 is a real world dataset training phase, and phase 206 is a real world use phase.

[0027] Initially in Fig. 2A, synthetic data images 208 are provided to a system 210 similar to that disclosed above. As illustrated, a predicted flow 212 is determined and a flow loss 214 is determined. Feedback 216 is then provided to train system 210. The process continues until the flow loss is reduced, in which case system 210’ is formed.

[0028] Next, in Fig. 2B, real world images 218 are provided to the trained system 210’. In this example N^A2 images are used, and logically arranged in an NxN grid 216. Next, a K- Fold cross validation process is performed selecting training subsets of real word images 218 as input into system 210’. As illustrated in Fig. 1, above, system 210’ uses contrastive loss functionality on features and pseudo features to train system 210’ based upon the training subsets of real world images. The trained system 210’ is then tested with real world images 214 that are not in the training subset (testing subset), to determine predicted flows 218. The predicted flows 218 are compared to the pseudo ground truth flows 220 for the testing subset to determine an error 222, as shown. The next training subset and testing subset from images 218 are used to determine errors 224, and the like. In various embodiments, after all K-Fold cross validations, the errors may be combined or averaged to determine an average error or feedback. The feedback may be used as feedback 226 to change one or more parameters of the GRU within system 210’, discussed above. The process may be repeated until the error feedback is reduced, in which case system 210” is formed.

[0029] In Fig. 2C, system 210” has been trained using synthetic datasets in Fig. 2A and then trained using real world data in Fig. 2B. Accordingly, new images 228 can be provided to system 210”, and system 210” can predict the optical flow 230. In some cases, if ground truth optical flow 232 is known, a flow loss 234 can again be determined, and fed back 236 into system 210”.

[0030] In some embodiments, predicted optical flow 230 may be used as input into other processes, such as a driver assist system, an autonomous driving system, an area mapping system, and the like.

[0031] Fig. 3 illustrates a functional block diagram of various embodiments of the present invention. More specifically, it is contemplated that computers (e.g. servers, laptops, streaming servers, virtual machines, etc.) may be implemented with a subset or superset of the below-illustrated components.

[0032] In Fig. 3, a computing device 300 may include some, but not necessarily all of the following components: an applications processor / microprocessor 302, memory 304, a display 306, an image acquisition device 310, audio input / output devices 312, and the like. Data and communications from and to computing device 300 can be provided via a wired interface 314 (e g. Ethernet, dock, plug, controller interface to peripheral devices); miscellaneous rf receivers, e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316; an NFC interface (e.g. antenna or coil) and driver 318; RF interfaces and drivers 320, and the like. Also included in some embodiments are physical sensors 322 (e.g. (MEMS-based) accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, bioimagmg sensors, etc.).

[0033] In various embodiments, computing device 300 may be a computing device (e.g. Apple iPad, Microsoft Surface, Samsung Galaxy Note, an Android Tablet); a smartphone (e.g. Apple iPhone, Google Pixel, Samsung Galaxy S); a computer (e.g. netbook, laptop, convertible), a media player (e.g. Apple iPod); or the like Typically, computing device 300 may include one or more processors 302. Such processors 302 may also be termed application processors, and may include a processor core, a video/graphics core, and other cores. Processors 302 may include processors from Apple (A14 Bionic, Al 5 Bionic), NVidia (Tegra), Intel (Core), Qualcomm (Snapdragon), Samsung (Exynos), ARM (Cortex), MIPS technology, a microcontroller, and the like. In some embodiments, processing accelerators may also be included, e.g. an Al accelerator, Google (Tensor processing unit), a GPU, or the like. It is contemplated that other existing and / or later-developed processors / microcontrollers may be used in various embodiments of the present invention.

[0034] In various embodiments, memory 304 may include different types of memory (including memory controllers), such as flash memory' (e.g. NOR, NAND), SRAM, DDR SDRAM, or the like. Memory 304 may be fixed within computing device 300 and may also include removable memory (e g. SD, SDHC, MMC, MINI SD, MICRO SD, SIM). The above are examples of computer readable tangible media that may be used to store embodiments of the present invention, such as computer-executable software code (e.g. firmware, application programs), security applications, application data, operating system data, databases, or the like. Additionally, in some embodiments, a secure device including secure memory and / or a secure processor are provided. It is contemplated that other existing and I or later-developed memory and memory technology may be used in various embodiments of the present invention.

[0035] In vanous embodiments, display 306 may be based upon a variety of later- developed or current display technology, including LED or OLED displays and / or status lights; touch screen technology (e.g. resistive displays, capacitive displays, optical sensor displays, electromagnetic resonance, or the like); and the like. Additionally, display 306 may include single touch or multiple-touch sensing capability. Any later-developed or conventional output display technology may be used for embodiments of the output display, such as LED IPS, OLED, Plasma, electronic ink (e.g. electrophoretic, electrowetting, interferometric modulating), or the like. In various embodiments, the resolution of such displays and the resolution of such touch sensors may be set based upon engineering or nonengineering factors (e.g. sales, marketing). In some embodiments, display 306 may be integrated into computing device 300 or may be separate. In some embodiments, display 306 may be in virtually any size or resolution, such as a 3K resolution display, a microdisplay, one or more individual status or communication lights, e.g. LEDs, or the like.

[0036] In some embodiments of the present invention, acquisition device 310 may include one or more sensors, drivers, lenses, and the like. The sensors may be visible light, infrared, and / or UV sensitive sensors, ultrasonic sensors, or the like, that are based upon any later-developed or convention sensor technology, such as CMOS, CCD, or the like. In some embodiments of the present invention, image recognition algorithms, image processing algorithms, or other software programs for operation upon processor 302, to process the acquired data. For example, such software may pair with enabled hardware to provide functionality such as: facial recognition (e.g. Face ID, head tracking, camera parameter control, or the like); fingerprint capture / analysis; blood vessel capture / analysis; iris scanning capture / analysis; otoacoustic emission (OAE) profiling and matching; and the like. [0037] In various embodiments, audio input / output 312 may include a microphone(s) / speakers. In various embodiments, voice processing and / or recognition software may be provided to applications processor 302 to enable the user to operate computing device 300 by stating voice commands. In various embodiments of the present invention, audio input 312 may provide user input data in the form of a spoken word or phrase, or the like, as described above. In some embodiments, audio input / output 312 may be integrated into computing device 300 or may be separate.

[0038] In various embodiments, wired interface 314 may be used to provide data or instruction transfers between computing device 300 and an external source, such as a computer, a remote server, a POS server, a local security server, a storage network, another computing device 300, an IMU, video camera, or the like. Embodiments may include any later-developed or conventional physical interface / protocol, such as: USB, micro USB, mini USB, USB-C, Firewire, Apple Lightning connector, Ethernet, POTS, custom interface or dock, or the like. In some embodiments, wired interface 314 may also provide electrical power, or the like to power source 324, or the like. In other embodiments interface 314 may utilize close physical contact of device 300 to a dock for transfer of data, magnetic power, heat energy, light energy, laser energy, or the like Additionally, software that enables communications over such networks is typically provided.

[0039] In various embodiments, a wireless interface 316 may also be provided to provide wireless data transfers between computing device 300 and external sources, such as computers, storage networks, headphones, microphones, cameras, IMUs, or the like. As illustrated in Fig. 3, wireless protocols may include Wi-Fi (e.g. IEEE 802.11 a/b/g/n, WiMAX), Bluetooth, Bluetooth Low Energy (BLE) IR, near field communication (NFC), ZigBee, Ultra-Wide Band (UWB), Wi-Fi, mesh communications, and the like.

[0040] GNSS (e.g. GPS) receiving capability may also be included in various embodiments of the present invention. As illustrated in Fig. 3, GPS functionality is included as part of wireless interface 316 merely for sake of convenience, although in implementation, such functionality may be performed by circuitry that is distinct from the Wi-Fi circuitry, the Bluetooth circuitry, and the like. In various embodiments of the present invention, GPS receiving hardware may provide user input data in the form of current GPS coordinates, or the like, as described above.

[0041] Additional wireless communications may be provided via RF interfaces in various embodiments. In various embodiments, RF interfaces 320 may support any future-developed or conventional radio frequency communications protocol, such as CDMA-based protocols (e.g. WCDMA), GSM-based protocols, HSUPA-based protocols, G4, G5, or the like. In some embodiments, various functionality is provided upon a single IC package, for example, the Marvel PXA330 processor, and the like. As described above, data transmissions between a smart device and the services may occur via Wi-Fi, a mesh network, 4G, 5G, or the like. [0042] Although the functional blocks in Fig. 3 are shown as being separate, it should be understood that the various functionality may be regrouped into different physical devices. For example, some processors 302 may include Bluetooth functionality. Additionally, some functionality need not be included in some blocks, for example, GPS functionality need not be provided in a provider server.

[0043] In various embodiments, any number of future developed, current operating systems, or custom operating systems may be supported, such as iPhone OS (e.g. iOS), Google Android, Linux, Windows, MacOS, or the like. In various embodiments of the present invention, the operating system may be a multi-threaded multi-tasking operating system. Accordingly, inputs and / or outputs from and to display 306 and inputs / or outputs to physical sensors 322 may be processed in parallel processing threads. In other embodiments, such events or outputs may be processed serially, or the like Inputs and outputs from other functional blocks may also be processed in parallel or serially, in other embodiments of the present invention, such as acquisition device 310 and physical sensors 322.

[0044] In some embodiments of the present invention, physical sensors 322 (e.g. MEMS- based) may include accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, imaging sensors (e.g. blood oxygen, heartbeat, blood vessel, iris data, etc.), thermometer, otoacoustic emission (OAE) testing hardware, and the like. The data from such sensors may be used to capture data associated with device 300, and a user of device 300. Such data may include physical motion data, pressure data, orientation data, or the like. Data captured by sensors 322 may be processed by software running upon processor 302 to determine characteristics of the user, e.g. gait, gesture performance data, or the like, and used for user authentication purposes. In some embodiments, sensors 322 may also include physical output data, e.g. vibrations, pressures, and the like. [0045] In some embodiments, a power supply 324 may be implemented with a battery (e.g. LiPo), ultracapacitor, or the like, that provides operating electrical power to device 300. In various embodiments, any number of power generation techniques may be utilized to supplement or even replace power supply 324, such as solar power, liquid metal power generation, thermoelectric engines, rf harvesting (e.g. NFC) or the like.

[0046] Fig. 3 is representative of the components possible for a processing device. It will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention. Embodiments of the present invention may include at least some but need not include all of the functional blocks illustrated in Fig. 3. For example, a processing unit may include some of the functional blocks in Fig. 3, but it need not include an accelerometer or other physical sensor 322, an acquisition device 310, an accelerometer 322, an internal powder source 324, or the like.

[0047] In light of the above, other variations and adaptations can be envisioned to one of ordinary skills in the art. For example, outputs from embodiments may be provided to an autonomous driving system which may steer a vehicle (e.g. car, drone) based upon the predicted flow data; outputs may be used to provide audible, visual, or haptic feedback to a user, for example, if a feature in a field of view is identified or labeled as a pedestrian; a product being manufactured may be identified for further inspection, for example, if the determined optical flow of product does not match predefined criteria; a robot may be identified as requiring servicing, for example, if the determined optical flow does not match predefined criteria; and the like. Further, in other embodiments, other methods for segmenting real world data may be used besides K-Fold cross validations.

[0048] The block diagrams of the architecture and flow charts are grouped for ease of understanding. However, it should be understood that combinations of blocks, additions of new blocks, re-arrangement of blocks, and the like are contemplated in alternative embodiments of the present invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.

Claims

CLAIMS We claim:

1. A method for a computing system for estimating optical flow comprising: determining, in a computing system, a first features map in response to a first image, and a second features map in response to a second image map; implementing, in the computing system, a gated recurrent unit (GRU) to determine a pixel-wise flow prediction in response to the first features map and the second features map; determining, in the computing system, a warped feature map in response to the second image map; implementing, in the computing system, a feature-wise contrastive loss function to determine a feature-wise loss in response to first features in the first image map and second features in the warped features map; determining, in the computing system, a pixel-wise flow loss in response to the pixelwise flow prediction and in response to pixel-wise ground truth data; and modifying, in the computing system, parameters of the GRU in response to the pixelwise flow loss and to the feature-wise loss.

2. The method of claim 1 wherein the determining, in the computing system, the warped feature map comprises determining, in the computing system, the warped feature map also in response to a featurewise ground truth data; and wherein the feature-wise ground truth data comprises pre-identified features.

3. The method of claim 2 further comprising: labeling, in the computing system, a pseudo feature in the feature-wise ground truth data; determining, in the computing system, feature-wise pseudo ground truth data in response to labeling of the pseudo feature; determining, in the computing system, a revised warped feature map in response to the feature-wise pseudo ground truth data; and implementing, in the computing system, the feature-wise contrastive loss function to determine a revised feature-wise loss in response to first features in the first image map and third features in the revised warped features map.

4. The method of claim 3 further comprising: determining, in the computing system, whether the revised feature-wise loss is less than the feature-wise loss; and wherein the modifying, in the computing system, parameters of the GRU comprises: modifying, in the computing system, parameters of the GRU in response to the pixel-wise flow loss and to the revised feature-wise loss, and in response to the revised feature-wise loss being determined to be less than the feature-wise loss.

5. The method of claim 3 further comprising: determining, in the computing system, whether the feature-wise loss is less than the revised feature-wise loss; and removing, in the computing system, labeling of the pseudo feature in the feature-wise ground truth data, and in response to the feature-wise loss being determined to be less than the revised feature-wise loss.

6. The method of claim 1 further comprising: determining a correlation volume in response to the first features map and the second features map; and wherein the implementing, in the computing system, the gated recurrent unit (GRU) comprises implementing, in the computing system, the gated recurrent unit (GRU) to determine the pixel-wise flow prediction in response to the first features map and to the correlation volume.

7. The method of claim 6 wherein the correlation volume comprises parameters selected from a group consisting of: location of a region on an image, a size of a region within an image, a correlation parameter between images, and temporal data.

8. The method of claim 1 wherein the feature-wise loss is associated with a plurality of pixels; and wherein the pixel-wise flow loss is associated with a pixel from the plurality of pixels.

9. A computing system for estimating optical flow comprising: a pixel -based analysis system configured to determine a first features map in response to a first image, and a second features map in response to a second image map, wherein the pixel-based analysis system comprises a gated recurrent unit (GRU) configured to determine a pixel-wise flow" prediction in response to the first features map and the second features map; and a feature-based analysis system coupled to the pixel-based optical flow analysis system, wherein the feature-based analysis system is configured to determine a warped feature map in response to the second image map, wherein the feature-based analysis system comprises a contrastive loss unit configured to determine a feature-wise loss in response to first features in the first image map and second features in the warped feature map; wherein the pixel-based analysis system is configured to modify parameters of the GRU in response to a pixel-wise flow loss and to the feature-wise loss.

10. The computing system of claim 9 wherein the feature-based analysis system is configured to determine the warped feature map in response to the second feature map and a feature-wise ground truth data; and wherein the feature-wise ground truth data comprises pre-identified features.

11. The computing system of claim 10 wherein the feature-based analysis system is configured to label a pseudo feature in the feature-wise ground truth data; wherein the feature-based analysis system is configured to determine feature-wise pseudo ground truth data in response to labeling of the pseudo feature; wherein the feature-based analysis system is configured to determine a revised warped feature map in response to the feature-wise pseudo ground truth data; and wherein the contrastive loss unit configured is configured to determine a revised feature-wise loss in response to first features in the first image map and third features in the revised warped features map.

12. The computing system of claim 11 further comprising: wherein the feature-based analysis system is configured to determine whether the revised feature-wise loss is less than the feature-wise loss; and wherein the pixel-based analysis system is configured to modify parameters of the GRU in response to the pixel-wise flow loss and to the revised feature-wise loss, and in response to the revised feature-wise loss being determined to be less than the feature-wise loss.

13. The computing system of claim 11 further comprising: wherein the feature-based analysis system is configured to determine whether the feature-wise loss is less than the revised feature-wise loss; and wherein the feature-based analysis system is configured to remove labeling of the pseudo feature in the feature-wise ground truth data, and in response to the feature-wise loss being determined to be less than the revised feature-wise loss.

14. The computing system of claim 9 wherein the pixel-based analysis system is configure to determine a correlation volume in response to the first features map and the second features map; and wherein the gated recurrent unit (GRU) is configured to determine the pixel-wise flow prediction in response to the first features map, the second features map and the correlation volume.

15. The computing system of claim 14 wherein the correlation volume comprises parameters selected from a group consisting of: location of a region on an image, a size of a region within an image, a correlation parameter between images, and temporal data.

16. The computing system of claim 9 wherein the feature-wise loss is associated with a plurality of pixels; and wherein the pixel-wise flow loss is associated with a pixel from the plurality of pixels.

17. The computing system of claim 9 wherein the pixel-based analysis system comprises a recurrent all-pairs field transforms (RAFT) system.

18. A method comprising: operating an optical flow prediction system comprising contrastive loss functionality in response to a synthetic dataset to determine a first optical flow loss; adjusting parameters of the optical flow prediction system in response to the first optical flow loss; thereafter operating the optical flow prediction system in response to a real world dataset to determine a second optical flow loss; and adjusting parameters of the optical flow prediction system in response to the second optical flow loss.

19. The method of claim 18 wherein the operating the optical flow prediction system in response to the real world dataset comprises: determining a first subset of real world data from the real world dataset; operating the optical flow prediction system in response to the first subset of real world data to determine a third optical flow loss; determining a second subset of real world data from the real world dataset; operating the optical flow prediction system in response to the second subset of real w orld data to determine a fourth optical flow loss; and determining the second optical flow loss in response to the third optical flow loss and the fourth optical flow loss.

20. The method of claim 19 wherein the first subset of real world data comprises a plurality of real world images including a first real world image; wherein the first real world image comprises a plurality of manually labeled features: and wherein the operating the optical flow prediction system in response to the first subset of real world data to determine comprises automatically labeling a pseudo feature in the first real world image.