US20230343062A1 - Tracking users across image frames using fingerprints obtained from image analysis - Google Patents
Tracking users across image frames using fingerprints obtained from image analysis Download PDFInfo
- Publication number
- US20230343062A1 US20230343062A1 US18/215,075 US202318215075A US2023343062A1 US 20230343062 A1 US20230343062 A1 US 20230343062A1 US 202318215075 A US202318215075 A US 202318215075A US 2023343062 A1 US2023343062 A1 US 2023343062A1
- Authority
- US
- United States
- Prior art keywords
- images
- human
- series
- bounding box
- embeddings
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010191 image analysis Methods 0.000 title description 2
- 238000000034 method Methods 0.000 claims abstract description 22
- 241000282414 Homo sapiens Species 0.000 claims description 34
- 230000000694 effects Effects 0.000 claims description 10
- 241000282412 Homo Species 0.000 claims 12
- 238000001514 detection method Methods 0.000 abstract description 12
- 230000008569 process Effects 0.000 description 10
- 238000000605 extraction Methods 0.000 description 9
- 238000010801 machine learning Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 230000006399 behavior Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 102100034112 Alkyldihydroxyacetonephosphate synthase, peroxisomal Human genes 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 101000799143 Homo sapiens Alkyldihydroxyacetonephosphate synthase, peroxisomal Proteins 0.000 description 1
- 238000000848 angular dependent Auger electron spectroscopy Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30248—Vehicle exterior or interior
- G06T2207/30252—Vehicle exterior; Vicinity of vehicle
- G06T2207/30261—Obstacle
Abstract
Systems and methods are disclosed herein for tracking a vulnerable road user (VRU) regardless of occlusion. In an embodiment, the system captures a series of images including the VRU, and inputs each of the images into a detection model. The system receives a bounding box for each of the series of images of the VRU as output from the detection model. The system inputs each bounding box into a multi-task model, and receives as output from the multi-task model an embedding for each bounding box. The system determines, using the embeddings for each bounding box across the series of images, an indication of which of the embeddings correspond to the VRU.
Description
- This application is a continuation of pending U.S. patent application Ser. No. 16/857,645, filed on Apr. 24, 2020, entitled “Tracking Vulnerable Road Users Across Inage Frames Using Fingerprints Obtained from Image Analysis”, the contents of which is hereby incorporated in its entirety by reference.
- This disclosure relates generally to intelligent road vehicle applications, and more specifically relates to tracking vulnerable road users from frame-to-frame.
- Autonomous vehicles attempt to track pedestrians over time; however, when a pedestrian is occluded in a set of intermediate frames of the video (e.g., because a user walks behind an object or a car drives past a pedestrian, or another pedestrian passes the user), tracking the pedestrian across frames may be difficult or impossible. When tracking fails, existing systems may assume that when a pedestrian re-enters the frame, following occlusion, that the pedestrian is a new person that was not observed in a prior frame. If this happens, previously observed information and predictions made for the pedestrian are lost. This results in suboptimal predictions of how a pedestrian may act, which may result in inaccurate, inefficient, or unsafe activity of an autonomous vehicle.
-
FIG. 1 depicts one embodiment of a vehicle including a fingerprint tool for tracking vulnerable road users before and after occlusion. -
FIG. 2 depicts one embodiment of an image that may be preprocessed prior to being used by the fingerprint tool. -
FIG. 3 depicts one embodiment of exemplary modules executed by the fingerprint tool and a data flow therethrough. -
FIG. 4 depicts one embodiment of a neural network model for extracting a fingerprint from an image. -
FIG. 5 depicts an exemplary flowchart of a process for using a fingerprint tool. - The figures depict various embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
- (a) Overview
- Systems and methods are disclosed herein for tracking a vulnerable road user (“VRU”) over time regardless of whether occlusion occurs. In order to overcome the occlusion problem, a multi-task machine learning model is trained to combine multiple encoders, each of which is trained to predict activity of a VRU. Cropped bounding box images are input into the machine learning model, and a fingerprint of the VRU is output, optionally along with additional information. By combining the multiple encoders into the machine learning model, a high degree of confidence is achieved across frames that a same individual is tracked, even where partial or complete occlusion occurs.
- (b) Capturing Images of VRUs
-
FIG. 1 depicts one embodiment of a vehicle including a fingerprint tool for tracking vulnerable road users before and after occlusion.Environment 100 includesfingerprint tool 110, which is depicted as part ofvehicle 120.Vehicle 120 may be an autonomous, semi-autonomous, or non-autonomous vehicle.VRUs 130 are depicted throughoutenvironment 100. The term VRU, as used herein, refers not only to pedestrians, but also to human beings on micro-mobility vehicles such as bicycles or scooters or wheelchairs (and similar), and companions to human beings, such as dogs and other pets.Vehicle 120 is equipped with one or more camera sensors, Camera sensors, as used herein, capture images over time. The images may be picture images, and may be include, or separately produce, images showing other sensory input such as depth and/or thermal information. The images may be captured periodically, based on a certain amount of time passing, or based on some other event, such asvehicle 120 having traveled a certain amount of distance since the last time an image was taken. The images, taken together, may form a video (e.g., a high-resolution video); in this context, the images are occasionally referred to as “frames” herein. The images from multiple camera sensors taken at a same time may be stitched together (e.g., to form panoramic or 360-degree images), or may be used as stand-alone images.Vehicle 120 may also be equipped with other sensors that provide auxiliary information in connection with the images from the camera sensor(s). Exemplary sensors include a global positioning system (GPS) sensor, and an inertial measurement unit (IMU) sensor, though any sensor for measuring any type of information may be included. -
Fingerprint tool 110 receives the images, generates a fingerprint of eachVRU 130, and tracksVRUs 130 over time. The term fingerprint, as used herein, refers to an anonymous representation of a VRU that is used to track the VRU's actions until such a time that the VRU leaves the periphery of the camera sensors ofvehicle 120. Structural details of what a fingerprint includes are discussed in further detail with reference toFIGS. 3 and 4 .Fingerprint tool 110 tracks each ofVRUs 130 regardless of whether the VRUs are occluded, in part or in full, by another object or another VRU. The details of howfingerprint tool 110 functions are described in further detail with reference toFIG. 3 below.Fingerprint tool 110 may feed output to other tools ofvehicle 120, which may result in certain functionality, such as modified autonomous driving behavior, an alert to a driver ofvehicle 120, data for processing by a dash cam, and so on.Fingerprint tool 110 is depicted as being installed invehicle 120, but may be installed, in whole or in part, on a remote server, wherevehicle 120 communicates with the remote server to pass along images, and to receive the outputs offingerprint tool 110. In an embodiment,fingerprint tool 110 may be entirely incorporated into a cloud context, to perform post-event evaluation of VRU behavior and accident detection (e.g., for insurance or fleet management purposes). In such an embodiment, sensor data is sent to the cloud context for processing in manners similar to those described herein. - (c) Exemplary Pre-Processing of Images
-
FIG. 2 depicts one embodiment of an image that may be preprocessed prior to being used by the fingerprint tool.Image 200 represents a frame, such as a full frame of a video captured by image sensors ofvehicle 120. In an embodiment,image 200 is fed in its entirety as input intofingerprint tool 110 for processing. In an embodiment,fingerprint tool 110 preprocessesimage 200 by executing a detection model, which detects one ormore VRUs 230 inimage 200, and responsively applies a bounding box. To preprocessimage 200,fingerprint tool 110 may crop the one ormore VRUs 230 into cropped portion(s) 250, and use the one or more cropped portion(s) 250 as input into an encoder, such as one or more shared layers of a fingerprint extraction module described below with reference toFIG. 3 . Additionally,fingerprint tool 110 may determine the coordinates of each bounding box for input into a tracking module. In an embodiment, as will be described below with respect toFIG. 3 below, bothimage 200 and croppedportions 250 may be used as input into a tracking module. Using cropped portion(s) 250 as input to the encoder offers an advantage to using full images, as non-cropped parts of the full images may include noise that is not informative as to activity of the VRU. Moreover, running the encoder, which is processor-intensive, on relevant bounding box data to the exclusion of the full image increases accuracy, and reduces latency and general processing unit (GPU) footprint. Processing by the encoder and tracking module, as well outputs of the tracking model, are discussed in further detail below with reference toFIGS. 3-4 . - (d) Fingerprint Tool Functionality
-
FIG. 3 depicts one embodiment of exemplary modules executed by the fingerprint tool and a data flow therethrough.Sensor module 305 receives sensor data captured byvehicle 120, such as images and auxiliary data like GPS and IMU data.Sensor module 305 passes the images to VRU detectionmachine learning model 310, and passes the auxiliary data (e.g., GPS and IMU data, depth data, thermal data, etc.) to trackingmodule 330. VRU detectionmachine learning model 310 pre-processes the images and outputs croppedbounding box images 315 and/or boundingbox coordinates 320, as was discussed with reference toFIG. 2 . To achieve this end, the VRU detectionmachine learning model 310 is trained to detect one or more VRUs in the image, and to output a cropped bounding box of each VRU detected in the image and/or the coordinates of where bounding boxes would be applied in the context of the image. Images may be pre-processed in real-time or near-real-time as live video is received, and frames are taken therefrom. In an embodiment where video is processed after an event has occurred, such as post-processing on a cloud platform, images may be pre-processed in faster than real time (e.g., a video can be analyzed in less time than the length of the video itself). -
Fingerprint extraction module 325 generates a fingerprint of each VRU depicted in each bounding box of the image. To do this,fingerprint extraction module 325 feeds, as input to a multi-task model, each cropped boundingbox 315 that is received as output from VRU detectionmachine learning model 310. The multi-task model is a multi-task neural network with branches that are each trained to determine different parameters about a VRU, and its particulars are discussed in further detail with reference toFIG. 4 . The multi-task model outputs a fingerprint for each bounding box in the image. In an embodiment, each fingerprint comprises an embedding (i.e., a vector in latent space containing data that is characteristic of the bounding box), and is output or derived from an intermediate layer of the multitask neural network, which acts as an encoder. Further details of this embedding are described below with respect toFIG. 4 .Fingerprint extraction module 325 may store each of the fingerprints to memory. -
Tracking model 330 obtains the fingerprints generated byfingerprint extraction module 325 from the memory (or receives them directly as output of fingerprint extraction module 325), and determines therefrom which fingerprints represent the same entity (that is, which bounding boxes contain the same person across different ones of the images).Tracking module 330 may also receive auxiliary data, such as GPS data, IMU data, image depth data and/or thermal data (or partial image depth data and/or thermal data indicating depth/thermal data for each bounding box), and any other auxiliary data, fromsensor module 305. In an embodiment,tracking module 330 performs this determination by using an unsupervised learning model, such as a clustering algorithm. Exemplary clustering algorithms include a nearest neighbor clustering algorithm on a set of data as well as centroid, density, and hierarchical algorithms. Soft clustering may be used as well, for example, using an expectation-maximization algorithm to compute maximum likelihood or maximum a posteriori estimates to obtain the parameters of the model (e.g., Gaussian mixture module). A Bayesian inference may be used to perform soft clustering in order to receive as output an uncertainty value (described further below). An expectation-maximization algorithm may also, or alternatively, use Bayesian inference to obtain the maximum likelihood (MLE) and the maximum a posteriori estimates (MAP). A full Bayesian inference may be used to obtain a distribution over the parameters and latent variables of the Gaussian mixture module to obtain a better understanding of model fit and uncertainty. The set of data may include each of the fingerprints, across each of the images and each of the bounding boxes obtained therefrom, in a given session, a session being an amount of time of operation (e.g., the last ten minutes, the last ten seconds, since a car was turned on, since a given VRU was detected, etc.). In another embodiment, the set of data includes fingerprints obtained across sessions, where those sessions may be from asame vehicle 120, or from images derived fromdifferent vehicles 120. In either case, the set of data may also include the auxiliary data. The memory to which fingerprint extraction module 302 stores the fingerprints may be purged at the end of a session, or may be maintained for any prescribed amount of time, and may optionally be stored with auxiliary data, such as GPS data (which will improve tracking across sessions). -
Tracking module 330 identifies, as a result of the clustering algorithm, clusters of embeddings, where each cluster has the embedding corresponding to the bounding box of a particular person through the frames. Thus,tracking module 330 determines whether a user is a same user across various frames, despite occlusions, based on a fingerprint having an embedding within a given cluster.Tracking module 330 may apply a set of logical rules to the clustering. For example,tracking module 330 may prevent the clustering of two vectors corresponding to bounding boxes that appear in the same frame, as the same person cannot exist twice in a single frame. -
Tracking module 330 may output, in addition to a determination that a bounding box includes a same person across frames, a certainty score associated with its determination. That is, a probability that its determination that this is a same person across frames is true. To determine the certainty score,tracking module 330 may apply a probabilistic clustering algorithm (e.g., a soft clustering) to each cluster to determine the probability that the person belongs to a given cluster. For example, a Gaussian Mixture Model (GMIM) may be used as the soft clustering algorithm, where a latent variable is assigned to each data point. The latent variables each represent that the probability that a given observed data point belongs to a given cluster component (that is, the particular component of the GMM). By fitting a GMM to the embeddings,tracking module 330 may obtain for each new observation the probability of belonging to the same embedding.Tracking module 330 may assign, or derive, the certainty score using the probability. - (e) Exemplary Multi-Task Model for Fingerprint Tool
-
FIG. 4 depicts one embodiment of a neural network model for extracting a fingerprint from an image.Multi-task model 400 is an exemplary representation of multi-task model 312.Multi-task model 400 may be initialized based on a trigger. Exemplary triggers include ignition ofvehicle 120, a VRU being detected in an image after a threshold amount of time had passed since the last time a VRU was detected, and the like. Following initialization, model inputs 410 are received bymodel 400. The structure of the multi-task model has a set of sharedlayers 420 and a plurality of branching task-specific layers 430, each branch of the branching task-specific layers 430 corresponding to atask 450. Thetasks 450 are related within the domain, meaning that each of thetasks 450 predicts activity that are predictable based on a highly overlapping information space. For example, in predicting attributes of a pedestrian for use by an autonomous transportation system, thedifferent tasks 450 may predict activities such as whether the pedestrian is aware of the vehicle, whether the pedestrian is looking at a phone, whether the pedestrian is an adult, and so on. As such, when trained, the sharedlayers 420 produce information that is useful for performing each oftasks 450 and outputting each of these predictions. - To train
multi-task model 400, training examples include labels associated with thetasks 450. Each training example includes a label for a branching task. In the context of a neural network, during training, the training examples are sampled, and for each sample a backpropagation algorithm may be performed to update 440 the task-specific layers 430 corresponding to the sampled example and the sharedlayers 420, though in non-neural-network implementations, back-propagation need not be used. Where back-propagation is used, the shared layers may be updated through backpropagation by a sample having a label for each oftasks 450. However, in use by fingerprint extraction module 302,model 400 is not used to perform the tasks that is has been trained to perform (that is, predict what it was trained to predict). Rather, the sharedlayers 420 have become good at converting the input (that is, the pixel information from the bounding boxes) into a set of information useful for the task-specific layers 430 to make the task-specific predictions. As such, the last layer 420 (or one of the later layers) of sharedlayers 420, when a particular bounding box is input, produces a set of information at the neurons of that layer that contains information that is highly relevant in the domain of the set of tasks. This does not preclude the branches or the encoder (that is, the shared layers) from performing predictions as they are trained to do so (for example, the encoder/shared layers themselves may be used to detect whether a person is looking at a vehicle or not, and a VRU's intention to perform an activity may be determined by a trained branch corresponding to that activity). - Given the relevance of
last layer 420, rather than usemulti-task model 400 to make a prediction as it was trained to do, fingerprint extraction module 302 feeds in the pixel data from the bounding box intomodel 400 and then takes the values of last layer 420 (or one of the later layers) and stores that as an embedding for the bounding box. This embedding is a fingerprint, containing characteristic information about the contents of the bounding box that is relevant in the context of the tasks for which that model was trained. The dimensions of each embedding have no meaning, but the general direction of the embedding in latent space has meaning. Thus, as described with reference toFIG. 3 , the embeddings are compared in a clustering algorithm to see how similar they are within the context of the set of tasks for which multi-taskmodel 400 was trained to perform. - (f) Exemplary Process of Using Fingerprint Tool
-
FIG. 5 depicts an exemplary flowchart of a process for using a fingerprint tool.Process 500 begins with a fingerprint tool (e.g., fingerprint tool 110) capturing 502 images including a VRU (e.g., VRU 130). The images may be frames of a video. Capturing may be performed by cameras offingerprint tool 110, or may be performed by receiving images from camera sensors installed on vehicle 120 (e.g., wherefingerprint tool 110 is decoupled from the camera sensors, or wherefingerprint tool 110 is in part, or in whole, operating in a cloud environment), or a combination of both.Sensor module 305 may capture the images along with auxiliary data. -
Fingerprint tool 110inputs 504 each of the images into a detection model (e.g., VRU detection machine learning model 310).Fingerprint 110 receives 506 a bounding box for each of the series of images of the VRU as output from the detection module. Where multiple VRUs are present in the image, multiple bounding boxes may be received for that image. The detection module may also output bounding box coordinates (e.g., bounding box coordinates 320 for use by tracking module 330).Fingerprint tool 110inputs 508 each bounding box into a multi-task model (e.g., multi-task model 325), and receives 510 as output from the multi-task model an embedding for each bounding box. -
Fingerprint tool 110 determines 512, using the embeddings for each bounding box across the series of images, an indication of which of the embeddings correspond to the VRU. For example, a clustering algorithm may be used, as discussed above, to determine to which cluster each given embedding corresponds, each cluster being representative of a different VRU. Thus, despite occlusion,fingerprint tool 110 may resolve whether a detected VRU is a same VRU or a different VRU from one captured in a prior frame. Advantageously, in a scenario where embeddings are tracked over time, this enables knowledge of whether a VRU that is encountered byvehicle 120 is a same VRU that has been encountered before, all the while maintaining that VRU's anonymity. This may feed back to other systems, such as a reporting or alert system, an autonomous vehicle drive mechanism, and others, where predictions for that particular VRU are influenced by that individual's past behavior. Other data may play into VRU identification byfingerprint tool 110, such as auxiliary data received that corresponds to the frame. Confidence scores that a given embedding corresponds to its indicated cluster may be output by the clustering algorithm as well, indicating howconfident fingerprint tool 110 is that the VRU detected in a bounding box is a given known VRU. - (g) Summary
-
Model 400 may sit on a general processing unit (GPU) used for a perception stack of an autonomous vehicle, or the GPU of an ADAS system, either integrated into the vehicle, or added as a retrofit solution.Model 400 may also be integrated in telematics and fleet management solutions such as mobile phones or purpose-built dashcams. Alternatively,model 400 may run on a cloud-based server wheremodel 400 analyzes data post-event, rather than in real time. - Advantageously, VRUs are tracked using the techniques discussed herein without assigning an identity to the tracked VRUs. Thus, automotive, robotics, infrastructure, logistics, security, retail industries, and the like can interface with
model 400 to enable individual interactions with people based on their fingerprint.Model 400 may also allow for more accurate high-level understanding of behavior, asmodel 400 can observe VRUs behavior change over time with more confidence. Thus, explainability and reliability of predictive models is achieved, which results in safer and more efficient driving and operation of vehicles. - Additionally, the outputting of a level of confidence from the models allows for an understanding of uncertainty of the perception. In practice, this allows a perception stack of a vehicle to understand whether it can rely on the incoming information. This aids both transparency in understanding of decisions and the creation of fail-safe processes such as deciding to rely only on depth sensor data when confidence in a camera of
vehicle 120 is very low due to the presence of many occlusions (e.g., a train is passing by that obscures many VRUs). - Additional applications where the disclosed systems and methods may be incorporated include an infrastructure-based analysis application (e.g., in road or train platforms), in-vehicle movement of people (e.g., where persons are at risk based on their movement within a bus or a train), and cloud-based analysis of data that has been captured in these and similar situations. Intelligent road vehicles, as used herein, refers to robotics and delivery robots as well, even where those robots are not specifically on a road (e.g., where robots in a packaging warehouse are operating near vulnerable people).
- The disclosed embodiments use one or more sub-layers of a multi-task model (e.g., multi-headed encoder) for storing meaningful information about a VRU (that is, a fingerprint) in latent space. However, alternative implementations are within the scope of this disclosure as well. For example, auto-encoders, such as variational auto-encoders may be used to store the fingerprint of a VRU. A variational auto-encoder may act as the backbone of the multi-task models described herein having different branches with different task heads. In an embodiment, an autoencoder compresses data by representing it on a latent space with fewer dimensions than the compressed data originally had. The learned latent variables of the auto-encoder may be used as a fingerprint. Where a variational auto-encoder is used, latent variables may be represented as probability distributions with particular parameters.
Tracking module 330 may compare different images to identify whether a same VRU is within them by comparing their latent probability distributions and identifying same VRUs based on how well those distributions match. - The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
- Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
- Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
- Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
- Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
- Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
Claims (20)
1. A method comprising:
capturing a series of images comprising a plurality of humans, a human of the plurality of humans at least partially occluded in at least some images of the series of images;
determining one or more respective bounding boxes for each respective image of the series of images, each respective bounding box for a respective human within the respective image;
inputting each bounding box into a multi-task model;
receiving as output from the multi-task model an embedding for each bounding box, the embedding produced from a shared layer of the multi-task model, the multi-task model comprising the shared layer and a plurality of branches each trained to predict a different activity, wherein the shared layer is trained using backpropagation from the plurality of branches; and
determining, using the embeddings for each bounding box across the series of images, an indication of which of the embeddings correspond to the human as opposed to a different human of the plurality of humans despite partial occlusion of the human.
2. The method of claim 1 , wherein capturing the series of images comprises receiving images captured by a camera installed on a vehicle, and wherein auxiliary data captured by sensors installed on the vehicle is received with the images.
3. The method of claim 2 , further comprising:
receiving coordinates in the context of each of the series of images of each bounding box, wherein determining the indication of which of the embeddings correspond to the human comprises using the coordinates in addition to the embeddings.
4. The method of claim 3 , wherein determining the indication of which of the embeddings correspond to the human further comprises using the auxiliary data.
5. The method of claim 1 , wherein each respective embedding acts as a fingerprint that tracks its respective human without assigning an identity to the respective human.
6. The method of claim 1 , wherein each human of the plurality of humans is a vulnerable road user (VRU).
7. The method of claim 1 , wherein determining the indication of which of the embeddings correspond to the human further comprises receiving, as part of the output, a confidence score corresponding to a confidence that each given embedding corresponds to its indicated cluster.
8. A non-transitory computer-readable medium comprising memory with instructions encoded thereon, the instructions causing one or more processors to perform operations when executed, the instructions comprising instructions to:
capture a series of images comprising a plurality of humans, a human of the plurality of humans at least partially occluded in at least some images of the series of images;
determine one or more respective bounding boxes for each respective image of the series of images, each respective bounding box for a respective human within the respective image;
input each bounding box into a multi-task model;
receive as output from the multi-task model an embedding for each bounding box, the embedding produced from a shared layer of the multi-task model, the multi-task model comprising the shared layer and a plurality of branches each trained to predict a different activity, wherein the shared layer is trained using backpropagation from the plurality of branches; and
determine, using the embeddings for each bounding box across the series of images, an indication of which of the embeddings correspond to the human as opposed to a different human of the plurality of humans despite partial occlusion of the human.
9. The non-transitory computer-readable medium of claim 8 , wherein the instructions to capture the series of images comprises receiving images captured by a camera installed on a vehicle, and wherein auxiliary data captured by sensors installed on the vehicle is received with the images.
10. The non-transitory computer-readable medium of claim 9 , the instructions further comprising instructions to:
receive coordinates in the context of each of the series of images of each bounding box, wherein determining the indication of which of the embeddings correspond to the human comprises using the coordinates in addition to the embeddings.
11. The non-transitory computer-readable medium of claim 10 , wherein the instructions to determine the indication of which of the embeddings correspond to the human further comprise instructions to use the auxiliary data.
12. The non-transitory computer-readable medium of claim 8 , wherein each respective embedding acts as a fingerprint that tracks its respective human without assigning an identity to the respective human.
13. The non-transitory computer-readable medium of claim 8 , wherein each human of the plurality of humans is a vulnerable road user (VRU).
14. The non-transitory computer-readable medium of claim 8 , wherein the instructions to determine the indication of which of the embeddings correspond to the human further comprise instructions to receive, as part of the output, a confidence score corresponding to a confidence that each given embedding corresponds to its indicated cluster.
15. A system comprising:
a non-transitory computer-readable medium comprising memory with instructions encoded thereon; and
one or more processors that, when executing the instructions, are caused to perform operations, the operations comprising:
capturing a series of images comprising a plurality of humans, a human of the plurality of humans at least partially occluded in at least some images of the series of images;
determining one or more respective bounding boxes for each respective image of the series of images, each respective bounding box for a respective human within the respective image;
inputting each bounding box into a multi-task model;
receiving as output from the multi-task model an embedding for each bounding box, the embedding produced from a shared layer of the multi-task model, the multi-task model comprising the shared layer and a plurality of branches each trained to predict a different activity, wherein the shared layer is trained using backpropagation from the plurality of branches; and
determining, using the embeddings for each bounding box across the series of images, an indication of which of the embeddings correspond to the human as opposed to a different human of the plurality of humans despite partial occlusion of the human.
16. The system of claim 15 , wherein capturing the series of images comprises receiving images captured by a camera installed on a vehicle, and wherein auxiliary data captured by sensors installed on the vehicle is received with the images.
17. The system of claim 16 , the operations further comprising:
receiving coordinates in the context of each of the series of images of each bounding box, wherein determining the indication of which of the embeddings correspond to the human comprises using the coordinates in addition to the embeddings.
18. The system of claim 17 , wherein determining the indication of which of the embeddings correspond to the human further comprises using the auxiliary data.
19. The system of claim 15 , wherein each respective embedding acts as a fingerprint that tracks its respective human without assigning an identity to the respective human.
20. The system of claim 15 , wherein each human of the plurality of humans is a vulnerable road user (VRU).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/215,075 US20230343062A1 (en) | 2020-04-24 | 2023-06-27 | Tracking users across image frames using fingerprints obtained from image analysis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/857,645 US11734907B2 (en) | 2020-04-24 | 2020-04-24 | Tracking vulnerable road users across image frames using fingerprints obtained from image analysis |
US18/215,075 US20230343062A1 (en) | 2020-04-24 | 2023-06-27 | Tracking users across image frames using fingerprints obtained from image analysis |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/857,645 Continuation US11734907B2 (en) | 2020-04-24 | 2020-04-24 | Tracking vulnerable road users across image frames using fingerprints obtained from image analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230343062A1 true US20230343062A1 (en) | 2023-10-26 |
Family
ID=75497968
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/857,645 Active 2040-06-20 US11734907B2 (en) | 2020-04-24 | 2020-04-24 | Tracking vulnerable road users across image frames using fingerprints obtained from image analysis |
US18/215,075 Pending US20230343062A1 (en) | 2020-04-24 | 2023-06-27 | Tracking users across image frames using fingerprints obtained from image analysis |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/857,645 Active 2040-06-20 US11734907B2 (en) | 2020-04-24 | 2020-04-24 | Tracking vulnerable road users across image frames using fingerprints obtained from image analysis |
Country Status (4)
Country | Link |
---|---|
US (2) | US11734907B2 (en) |
EP (1) | EP4139832A1 (en) |
JP (1) | JP7450754B2 (en) |
WO (1) | WO2021214542A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3706034A1 (en) * | 2019-03-06 | 2020-09-09 | Robert Bosch GmbH | Movement prediction of pedestrians useful for autonomous driving |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210201052A1 (en) * | 2019-12-27 | 2021-07-01 | Valeo North America, Inc. | Method and apparatus for predicting intent of vulnerable road users |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8315965B2 (en) * | 2008-04-22 | 2012-11-20 | Siemens Corporation | Method for object detection |
US8811670B2 (en) | 2012-09-28 | 2014-08-19 | The Boeing Company | Method and system for using fingerprints to track moving objects in video |
JP6833617B2 (en) | 2017-05-29 | 2021-02-24 | 株式会社東芝 | Mobile tracking equipment, mobile tracking methods and programs |
CN111133447B (en) | 2018-02-18 | 2024-03-19 | 辉达公司 | Method and system for object detection and detection confidence for autonomous driving |
US11132512B2 (en) * | 2019-11-08 | 2021-09-28 | International Business Machines Corporation | Multi-perspective, multi-task neural network model for matching text to program code |
US11531088B2 (en) * | 2019-11-21 | 2022-12-20 | Nvidia Corporation | Deep neural network for detecting obstacle instances using radar sensors in autonomous machine applications |
US20210286989A1 (en) * | 2020-03-11 | 2021-09-16 | International Business Machines Corporation | Multi-model, multi-task trained neural network for analyzing unstructured and semi-structured electronic documents |
-
2020
- 2020-04-24 US US16/857,645 patent/US11734907B2/en active Active
-
2021
- 2021-03-12 JP JP2022564115A patent/JP7450754B2/en active Active
- 2021-03-12 WO PCT/IB2021/000139 patent/WO2021214542A1/en unknown
- 2021-03-12 EP EP21718643.6A patent/EP4139832A1/en active Pending
-
2023
- 2023-06-27 US US18/215,075 patent/US20230343062A1/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210201052A1 (en) * | 2019-12-27 | 2021-07-01 | Valeo North America, Inc. | Method and apparatus for predicting intent of vulnerable road users |
Also Published As
Publication number | Publication date |
---|---|
JP7450754B2 (en) | 2024-03-15 |
JP2023522390A (en) | 2023-05-30 |
US20210334982A1 (en) | 2021-10-28 |
US11734907B2 (en) | 2023-08-22 |
WO2021214542A1 (en) | 2021-10-28 |
EP4139832A1 (en) | 2023-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11928965B2 (en) | Lane line reconstruction using future scenes and trajectory | |
JP7052663B2 (en) | Object detection device, object detection method and computer program for object detection | |
US10970871B2 (en) | Estimating two-dimensional object bounding box information based on bird's-eye view point cloud | |
Geiger et al. | 3d traffic scene understanding from movable platforms | |
Keller et al. | The benefits of dense stereo for pedestrian detection | |
US11308717B2 (en) | Object detection device and object detection method | |
CN108334081A (en) | Depth of round convolutional neural networks for object detection | |
US10776642B2 (en) | Sampling training data for in-cabin human detection from raw video | |
Ciberlin et al. | Object detection and object tracking in front of the vehicle using front view camera | |
US20230343062A1 (en) | Tracking users across image frames using fingerprints obtained from image analysis | |
Palazzo et al. | Domain adaptation for outdoor robot traversability estimation from RGB data with safety-preserving loss | |
US20120155711A1 (en) | Apparatus and method for analyzing video | |
Getahun et al. | A deep learning approach for lane detection | |
Valdenegro-Toro | I find your lack of uncertainty in computer vision disturbing | |
Santos et al. | Car recognition based on back lights and rear view features | |
CN115699103A (en) | Method and device for predicting behaviors by using interpretable self-focusing attention | |
Wachs et al. | Human posture recognition for intelligent vehicles | |
Riera et al. | Detecting and tracking unsafe lane departure events for predicting driver safety in challenging naturalistic driving data | |
Shirpour et al. | Driver's Eye Fixation Prediction by Deep Neural Network. | |
CN114445787A (en) | Non-motor vehicle weight recognition method and related equipment | |
Pan et al. | CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow | |
US20220381566A1 (en) | Techniques for detecting a tracking vehicle | |
CN116935074B (en) | Multi-target tracking method and device based on adaptive association of depth affinity network | |
US20230033243A1 (en) | Systems and methods for object proximity monitoring around a vehicle | |
US20240153107A1 (en) | System and method for three-dimensional multi-object tracking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: HUMANISING AUTONOMY LIMITED, GREAT BRITAIN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PRADEEP, YAZHINI CHITRA;EL YOUSSOUFI, WASSIM;NOY, DOMINIC;AND OTHERS;SIGNING DATES FROM 20200423 TO 20200424;REEL/FRAME:065209/0627 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |