CA2934102A1

CA2934102A1 - A system and a method for tracking mobile objects using cameras and tag devices

Info

Publication number: CA2934102A1
Application number: CA2934102A
Authority: CA
Inventors: Jorgen Nielsen; Phillip Richard GEE
Original assignee: Appropolis Inc
Current assignee: Appropolis Inc
Priority date: 2015-06-25
Filing date: 2016-06-22
Publication date: 2016-12-25
Also published as: WO2016205951A1

Abstract

A method and system for tracking mobile objects in a site are disclosed. The system comprises a computer cloud communicating with one or more imaging devices and one or more tag devices. Each tag device is attached to a mobile object, and has one or more sensors for sensing the motion of the mobile object. The computer cloud visually tracks mobile objects in the site using image streams captured by the imaging devices, and uses measurements obtained from tag devices to resolve ambiguity occurred in mobile object tracking. The computer cloud uses an optimization method to reduce power consumption of tag devices.

Description

2 CAMERAS AND TAG DEVICES

3

4 FIELD OF THE DISCLOSURE
The present invention relates generally to a system and a method for tracking mobile objects, and in particular, a system and a method for tracking 7 mobile objects using cameras and tag devices.

Outdoor mobile object tracking such as the Global Positioning System 11 (GPS) is known. In the GPS system of the U.S.A. or similar systems such as the system of Russia, the Doppler Orbitography and Radio-positioning Integrated by Satellite (DORIS) of France, the Galileo system of the European 14 Union and the BeiDou system of China, a plurality of satellites on earth orbits communicate with a mobile device in an outdoor environment to determine the location thereof. However, a drawback of these systems is that the satellite 17 communication generally requires line-of-sight communication between the satellites and the mobile device, and thus they are generally unusable in indoor 19 environments, except in restricted areas adjacent to windows and open doors.
Some indoor mobile object tracking methods and systems are also 21 known.
For example, in the Bluetooth Low Energy (BLE) technology, such as the iBeaconTM technology specified by Apple Inc. of Cupertino, CA, U.S.A. or Samsung's Proximity TM, a plurality of BLE access points are deployed in a site and communicate with nearby mobile BLE devices such as smartphones for locating the 2 mobile BLE devices using triangulation. Also indoor WiFi signals are becoming ubiquitous and commonly used for object tracking based on radio signal strength 4 (RSS) observables. However, the mobile object tracking accuracy of these systems is still to be improved. Moreover, these systems can only track the location of a 6 mobile object, and other information such as gestures of a person being tracked 7 cannot be determined by these systems.
8 It is therefore an object to provide a novel mobile object tracking 9 system and method with higher accuracy, robustness and that provides more information about the mobile objects being tracked.

13 There are a plethora of applications that desire extension of the location of a mobile device or a person in an indoor environment or in a dense .urban outdoor environment. According to one aspect Of this disclosure, an object tracking system and a method is disclosed for tracking mobile objects in a site, such 17 as a campus, a building, a shopping center or the like.
18 Herein, mobile objects are moveable objects in the site, such as 19 human being, animals, carts, wheelchairs, robots and the like, and may be moving or stationary from time to time, usually in a random fashion from a statistic point of 21 view.

According to another aspect of this disclosure, visual tracking in combination of tag devices are used for tracking mobile objects in the site.
One or 1 more imaging devices such as one or more cameras, are used for intermittently or 2 continuously, visually tracking the locations of one or more mobile objects using 3 suitable image processing technologies. One or more tag devices attached to 4 mobile objects may also be used for refining object tracking and for resolving ambiguity occurred in visual tracking of mobile objects.
6 As will be described in more detail later, herein, ambiguity occurred in 7 visual object tracking includes a variety of situations that cause visual object 8 tracking less reliable or even unreliable.
9 Each tag device is a uniquely identifiable, small electronic device attached to a mobile object of interest and moving therewith, undergoing the same 11 physical motion. However, some mobile objects may not have any tag device 12 attached thereto.
13 Each tag device comprises one or more .sensors, and is battery 14 powered and operable for an extended period of time, e.g., several weeks, between battery charges or replacements. The tag devices communicate with one or more 16 processing structures, such as one or more processing structures of one or more 17 server computers, e.g., a so-called computer cloud, using suitable wireless 18 communication methods. Upon receiving a request signal from the computer cloud, 19 a tag device uses its sensors to make measurements or observations of the mobile object associated therewith, and transmits these measurements wirelessly to the 21 system. For example, a tag device may make measurements of the characteristics 22 of the physical motion of itself. As the tag devices undergo the same physical 1 motion as the associated mobile object, the measurements made by the tag devices 2 represent the motion measurements of their associated mobile objects.
3 According to another aspect of this disclosure, the object tracking 4 system comprises a computer cloud having one or more servers, communicating with one or more imaging devices deployed in a site for visually detecting and 6 tracking moving and stationary mobile objects in the site.
7 The computer cloud accesses suitable image processing technologies 8 to detect foreground objects, denoted as foreground feature clusters (FFCs), from 9 images or image frames captured by the imaging devices, each FFC
representing a candidate mobile object in the field of view (FOV) of the imaging device. The 11 computer cloud then identifies and tracks the FFCs. =
12 When ambiguity occurs in identifying and tracking FFCs, the computer 13 cloud requests one or more candidate tag devices to make necessary tag 14 measurements. The computer cloud uses tag measurements to resolve any ambiguity and associates FFCs with tag devices for tracking.
16 According to another aspect of this disclosure, when associating FFCs 17 with tag devices, the computer cloud calculates a FFC-tag association probability, 18 indicating the correctness, reliability or belief in the determined association. In this 19 embodiment, the FFC-tag association probability is numerically calculated, e.g., by using a suitable numerical method to find a numerical approximation of the FFC-tag 21 association probability. The FFC-tag association probability is constantly updated as 22 new images and/or tag measurements are made available to the system. The 23 computer cloud attempts to maintain the FFC-tag association probability at or above 1 a predefined probability threshold. If the FFC-tag association probability falls below 2 the probability threshold, more tag measurements are requested. The tag devices, 3 upon request, make the requested measurements and send the requested 4 measurements to the computer cloud for establishing the F.FC-tag association.
Like any other systems, the system disclosed herein operates with 6 constraints such as power consumption. Generally, the overall power consumption 7 of the system comprises the power consumption of the tag devices in making tag 8 measurements and the power consumed by other components of the system 9 including the computer cloud and the imaging devices. While the computer cloud and the imaging devices are usually powered by relatively unlimited sources of 11 power, tag devices are usually powered by batteries having limited stored energy.
12 Therefore, it is desirable, although optional in some embodiments, to manage 13 power consumption of tag devices during mobile object tracking through using low 14 power consumption components known in the art, and by only triggering tag devices to conduct measurements when actually needed.
16 Therefore, according to another aspect of this disclosure, at least in 17 some embodiments, the system is designed using a constrained optimization 18 algorithm with an objective of minimizing tag device energy consumption for a 19 constraint of the probability of correctly associating the tag device with an FFC. The system achieves this objective by requesting tag measurements only when 21 necessary, and by determining the candidate tag devices for providing the required 22 tag measurements.

5 1 When requesting tag measurements, the computer cloud first determines a group of candidate tag devices based on the analysis of captured 3 images and determines required tag measurements based on the analysis of captured images and the knowledge of power consumption for making the tag measurements. The computer cloud then only requests the required tag

6 measurements from the candidate tag devices.

7 One objective of the object tracking system is to visually track mobile

8 objects and using measurements from tag devices attached to mobile objects to

9 resolve ambiguity occurred in visual object tracking. The system tracks the locations of mobile objects having tag devices attached thereto, and optionally and if possible, 11 tracks mobile objects having no tag devices attached thereto. The object tracking 12 system is the combination of:
13 1) Computer vision processing to visually track the mobile objects 14 as they move throughout the site;
2) Wireless messaging between the tag device and the computer 16 cloud to establish the unique identity of each tag device; herein, wireless messaging 17 refers to any suitable wireless messaging means such as messaging via 18 electromagnetic wave, optical means, acoustic telemetry, and the like;
19 3) Motion related observations or measurements registered by various sensors in tag devices, communicated wirelessly to the computer cloud;
and 21 4) Cloud or network based processing to correlate the measurements of motion and actions of the tag devices and the computer vision 23 based motion estimation and characterization of mobile objects such that the association of the tag devices and the mobile objects observed by the imaging 2 devices can be quantified through a computed probability of such association.
3 The object tracking system combines the tracking ability of imaging 4 devices with that of tag devices for associating a unique identity to the mobile object being tracked. Thereby the system can also distinguish between objects that appear similar, being differentiated by the tag. In another aspect, if some tag devices are associated with the identities of the mobile objects they attached to, the object tracking system can further identify the identities of the mobile objects and track 9 them.
In contradistinction, known visual object tracking technologies using 11 imaging devices can associate a unique identity to the mobile object being tracked 12 only if the image of the mobile object has at least one unique visual feature such as 13 an identification mark, e.g., an artificial mark or a biometrical mark, e.g., a face feature, which may be identified by computer vision processing methods such as face recognition. Such detailed visual identity recognition is not always available or 16 economically feasible.

According to one aspect of this disclosure, there is provided a system 18 for tracking at least one mobile object in a site. The system comprises: one or more 19 imaging devices capturing images of at least a portion of the site; and one or more tag devices, each of the one or more tag devices being associated with one of the 21 at least one mobile object and moveable therewith, each of the one or more tag 22 devices obtaining one or more tag measurements related to the mobile object associated therewith; and at least one processing structure combining the captured 1 images with at least one of the one or more tag measurements for tracking the at 2 least one mobile object.
3 In some embodiments, each of the one or more tag devices comprises 4 one or more sensors for obtaining the one or more tag measurements.
In some embodiments, the one or more sensors comprise at least one 6 of an Inertial Measurement Unit (IMU), a barometer, a thermometer, a magnetometer, a global navigation satellite system (GNSS) sensor, an audio frequency microphone, a light sensor, a camera, and a receiver signal strength 9 (RSS) measurement sensor.
In some embodiments, the RSS measurement sensor is a sensor for measuring the signal strength of a received wireless signal received from a 12 transmitter, for estimating the distance from the transmitter.
13 In some embodiments, the wireless signal is at least one of a 14 Bluetooth signal and a WiFi signal.
In some embodiments, the at least one processing structure analyzes 16 images captured by the one or more imaging devices for determining a set of candidate tag devices for providing said at least one of the one or more tag 18 measurements.
19 In some embodiments, the at least one processing structure analyzes images captured by the one or more imaging devices for selecting said at least one 21 of the one or more tag measurements.
22 In some embodiments, each of the tag devices provides the at least 23 one of the one or more tag measurements to the at least one processing structure 1 only when said tag device receives from the at least' one processing structure a 2 request for providing the at least one of the one or more tag measurements.
3 In some embodiments, each of the tag devices, when receiving from 4 the at least one processing structure a request for providing the at least one of the one or more tag measurements, only provides the requested the at least one of the 6 one or more tag measurements to the at least one processing structure.
7 In some embodiments, the at least one processing structure identifies 8 from the captured images one or more foreground feature clusters (FFCs) for 9 tracking the at least one mobile object.
In some embodiments, the at least one processing structure 11 determines a bounding box for each FFC.
12 In some embodiments, the at least one processing structure 13 determines a tracking point for each FFC.
14 In some embodiments, for each FFC, the at least one processing structure determines a bounding box and a tracking point therefor, said tracking 16 point being at a bottom edge of the bounding box.
17 In some embodiments, at least one processing structure associates 18 each tag device with one of the FFCs.
19 In some embodiments, when associating a tag device with a FFC, the at least one processing structure calculates an FFC-tag association probability 21 indicating the reliability of the association between the tag device and the FFC.
22 In some embodiments, said FFC-tag association probability is 23 calculated based on a set of consecutively captured images.

1 In some embodiments, said FFC-tag association probability is 2 calculated by finding a numerical approximation thereof.
3 In some embodiments, when associating a tag device with a FFC, the 4 at least one processing structure executes a constrained optimization algorithm for minimizing the energy consumption of the one or more tag devices while 6 maintaining the FFC-tag association probability above a target value.
7 In some embodiments, when associating a tag device with a FFC, the 8 at least one processing structure calculates a tag-image correlation between the tag 9 measurements and the analysis results of the captured images.
In some embodiments, when the tag measurements for calculating 11 said tag-image correlation comprise measurement obtained from an IMU.
12 In some embodiments, when the tag measurements for calculating 13 said tag-image correlation comprise measurements obtained from at least one of an 14 accelerometer, a gyroscope and a magnetometer for calculating a correlation between the tag measurements and the analysis results of the captured images to 16 determine whether a mobile object is changing its moving direction.
17 In some embodiments, the at least one processing structure maintains 18 a background image for each of the one or more imaging devices.
19 In some embodiments, when detecting FFCs from each of the captured images, the at least one processing structure generates a difference 21 image by calculating the difference between the captured image and the 22 corresponding background image, and detects one or more FFCs from the 23 difference image.

1 In some embodiments, when detecting one or more FFCs from the difference image, the at least one processing structure 'mitigates shadow from each 3 of the one or more FFCs.
4 In some embodiments, after detecting the one or more FFCs, the at least one processing structure determines the location of each of the one or more 6 FFCs in the captured image, and maps each of the one or more FFCs to a three-7 dimensional (3D) coordinate system of the site by using perspective mapping.
8 In some embodiments, the at least one processing structure stores a 9 3D map of the site for mapping each of the one or more FFCs to the 3D coordinate system of the site, and wherein in said map, the site includes one or more areas, 11 and each of the one or more areas has a horizontal, planar floor.
12 In some embodiments, the at least one processing structure tracks at 13 least one of the one or more FFCs based on the velocity thereof determined from 14 the captured images.
In some embodiments, each FFC corresponds to a mobile object, and 16 wherein the at least one processing structure tracks the FFCs using a first order 17 Markov process.
18 In some embodiments, the at least one processing structure tracks the 19 FFCs using a Kalman filter with a first order Markov Gaussian process.
In some embodiments, when tracking each of the FFCs, the at least 21 one processing structure uses the coordinates of the corresponding mobile object in 22 a 3D
coordinate system of the site as state variables, and the coordinates of the 23 FFC in a two dimensional (2D) coordinate system of the captured images as =

1 observations for the state variables, and wherein the at least one processing 2 structure maps the coordinates of the corresponding mobile object in a 3D
3 coordinate system of the site to the 2D coordinate system of the captured images.
4 In some embodiments, the at least one processing structure discretizes at least a portion of the site into a plurality of grid points, and wherein, 6 when tracking a mobile object in said discretized portion of the site, the at least one 7 processing structure uses said grid points for approximating the location of the 8 mobile object.
9 In some embodiments, when tracking a mobile object in said discretized portion of the site, the at least one processing structure calculates a 11 posterior position probability of the mobile object.
12 In some embodiments, the at least one processing structure identifies 13 at least one mobile object from the captured images using biometric observation 14 made from the captured images.
In some embodiments, the biometric observation comprise at least 16 one of face characteristics and gait, and wherein the at least one processing 17 structure makes the biometric observation using at least one of face recognition and 18 gait recognition.
19 In some embodiments, at least a portion of the tag devices store a first ID for identifying the type of the associated mobile object.
21 In some embodiments, at least one of said tag devices is a smart 22 phone.

1 In some embodiments, at least one of said tag devices comprises a 2 microphone, and wherein the at least one processing structure uses tag 3 measurement obtained from the microphone to detect at least one of room reverberation, background noise level and spectrum of noise, for establishing the FFC-tag association.
6 In some embodiments, at least one of said tag devices comprises a 7 microphone, and wherein the at least one processing structure uses tag measurement obtained from the microphone to detect motion related sound, for 9 establishing the FFC-tag association.
In some embodiments, said motion related sound comprises at least 11 one of brushing of clothes against the microphone, sound of a wheeled object 12 wheeling over a floor surface and sound of an object sliding on a floor surface.
13 In some embodiments, one or more first tag device broadcast an ultrasonic sound signature, and wherein at least a second tag device comprises a microphone for receiving and detecting the ultrasonic sound signature broadcast 16 from said one or more first tag devices, for establishing the FFC-tag association.
17 In some embodiments, the one or more processing structures are 18 processing structures of one or more computer servers.

According to another aspect of this disclosure, there is provided a method of tracking at least one mobile object in at least one visual field of view. The 21 method comprises: capturing at least one image of the at least one visual field of 22 view;
identifying at least one candidate mobile object in . the at least one image;

obtaining one or more tag measurements from at least one tag device, each of said 1 at least one tag device being associated with a mobile object and moveable therewith; and tracking at least one mobile object using the at least one image and 3 the one or more tag measurements.
4 In some embodiments, the method further comprises: analyzing the at least one image for determining a set of candidate tag devices for providing said 6 one or more tag measurements.
7 In some embodiments, the method further comprises: analyzing the at 8 least one image for selecting said at least one of the one or more tag 9 measurements.
In some embodiments, the method further comprises: identifying, from 11 the at least one image, one or more foreground feature clusters (FFCs) for tracking 12 the at least one mobile object, and determines a bounding box and a tracking point 13 therefor, said tracking point being at a bottom edge of the bounding box.
14 In some embodiments, the method further comprises: associating each tag device with one of the FFCs.
16 In some embodiments, the method further comprises: calculating an 17 FFC-tag association probability indicating the reliability of the association between 18 the tag device and the FFC.
19 In some embodiments, the method further comprises: tracking the FFCs using a first order Markov process.
21 In some embodiments, the method further comprises: discretizing at 22 least a portion of the site into a plurality of grid points; and tracking a mobile object =

1 in said discretized portion of the site by using said grid points for approximating the 2 location of the mobile object.

According to another aspect of this disclosure, there is provided a 4 non-transitory, computer readable storage device comprising computer-executable instructions for tracking at least one mobile object in a site, wherein the instructions, 6 when executed, cause a first processor to perform actions comprising: capturing at 7 least one image of the at least one visual field of view; identifying at least one candidate mobile object in the at least one image; obtaining one or more tag measurements from at least one tag device, each of said at least one tag device being associated with a mobile object and moveable therewith; and tracking at least 11 one mobile object using the at least one image and the one or more tag 12 measurements.
13 In some embodiments, the storage . device further comprises 14 computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising: calculating an FFC-tag association probability indicating the reliability of the association between the tag 17 device and the FFC.
18 In some embodiments, the storage device further comprises 19 computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising: analyzing the at least one 21 image for selecting said at least one of the one or more tag measurements.
22 In some embodiments, the storage device further comprises 23 computer-executable instructions, when executed, causing the one or more 1 processing structure to perform actions comprising: identifying, from the at least one 2 image, one or more foreground feature clusters (FFCs) for tracking the at least one 3 mobile object, and determines a bounding box and a tracking point therefor, said 4 tracking point being at a bottom edge of the bounding box.
In some embodiments, the storage device further comprises 6 computer-executable instructions, when executed, causing the one or more 7 processing structure to perform actions comprising: associating each tag device 8 with one of the FFCs.
9 In some embodiments, the storage device further comprises computer-executable instructions, when executed, causing the one or more 11 processing structure to perform actions comprising: calculating an FFC-tag 12 association probability indicating the reliability of the association between the tag 13 device and the FFC.
14 In some embodiments, the storage device further comprises computer-executable instructions, when executed, causing the one or more 16 processing structure to perform actions comprising: discretizing at least a portion of 17 the site into a plurality of grid points; and tracking a mobile object in said discretized 18 portion of the site by using said grid points for approximating the location of the 19 mobile object.
According to another aspect of this disclosure, there is provided a 21 system for tracking at least one mobile object in a site. The system comprises: at 22 least a first imaging device having a field of view (FOV) overlapping a first subarea 23 of the site and capturing images of at least a portion of the first subarea, the first 1 subarea having at least a first entrance; and one or more tag devices, each of the 2 one or more tag devices being associated with one of the at least one mobile object 3 and moveable therewith, each of the one or more tag devices having one or more 4 sensors for obtaining one or more tag measurements related to the mobile object associated therewith; and at least one processing structure for: determining one or 6 more initial conditions of the at least one mobile object entering the first subarea 7 from the at least first entrance; and combining the one or more initial conditions, the 8 captured images, and at least one of the one or more tag measurements for 9 tracking the at least one mobile object.
In some embodiments, the at least one processing structure builds a 11 birds-eye view based on a map of the site, for mapping the at least one mobile 12 object therein.
13 In some embodiments, said one or more initial conditions comprise 14 data determined from one or more tag measurements regarding the at least one mobile object before the at least one mobile object enters the first subarea from the 16 at least first entrance.
17 In some embodiments, the system further comprises: at least a 18 second imaging device having an FOV overlapping a second subarea of the site 19 and capturing images of at least a portion of the second subarea, the first and second subareas sharing the at least first entrance; and wherein the one or more 21 initial conditions comprise data determined from the at least second imaging device 22 regarding the at least one mobile object before the at least one mobile object enters 23 the first subarea from the at least first entrance.
=

1 In some embodiments, the first subarea comprises at least one obstruction in the FOV of the at least first imaging device; and wherein the at least 3 one processing structure uses a statistic model based estimation for resolving ambiguity during tracking when the at least one mobile object temporarily moves behind the obstruction.

According to another aspect of this disclosure, there is provided a 7 method for tracking at least one mobile object in a site. The method comprises:

obtaining a plurality of images captured by at least a first imaging device having a 9 field of view (FOV) overlapping a first subarea of the site, the first subarea having at least a first entrance; obtaining tag measurements from one or more tag devices, 11 each of the one or more tag devices being associated with one of the at least one 12 mobile object and moveable therewith, each of the one or more tag devices having 13 one or more sensors for obtaining one or more tag measurements related to the 14 mobile object associated therewith; determining one or more initial conditions of the at least one mobile object entering the first subarea from the at least first entrance;
16 and combining the one or more initial conditions, the captured images, and at least 17 one of the one or more tag measurements for tracking the at least one mobile object.
18 In some embodiments, the method further comprises: building a birds-19 eye view based on a map of the site, for mapping the at least one mobile object therein.
21 In some embodiments, the method further comprises: assembling said 22 one or more initial conditions using data determined from one or more tag measurements regarding the at least one mobile object before the at least one 2 mobile object enters the first subarea from the at least first entrance.
3 In some embodiments, the method further comprises: obtaining 4 images captured by at least a second imaging device having an FOV overlapping a second subarea of the site, the first and second subareas sharing the at least first entrance; and assembling the one or more initial conditions using data determined 7 from the at least second imaging device regarding the at least one mobile object 8 before the at least one mobile object enters the first subarea from the at least first 9 entrance.
In some embodiments, the first subarea comprises at least one obstruction in the FOV of the at least first imaging device; and the method further comprises: using a statistic model based estimation for resolving ambiguity during 13 tracking when the at least one mobile object temporarily moves behind the 14 obstruction.
According to another aspect of this disclosure, there is provided one 16 or more non-transitory, computer readable media storing computer executable code 17 for tracking at least one mobile object in a site. The computer executable code comprises computer executable instructions for: obtaining a plurality of images captured by at least a first imaging device having a field of view (FOV) overlapping a first subarea of the site, the first subarea having walls and at least a first entrance;

obtaining tag measurements from one or more tag devices, each of the one or more 22 tag devices being associated with one of the at least one mobile object and moveable therewith, each of the one or more tag devices having one or more 1 sensors for obtaining one or more tag measurements related to the mobile object 2 associated therewith; determining one or more initial conditions of the at least one 3 mobile object entering the first subarea from the at least first entrance; and 4 combining the one or more initial conditions, the captured images, and at least one of the one or more tag measurements for tracking the at least one mobile object.
6 In some embodiments, the computer executable code further 7 comprises computer executable instructions for: building a birds-eye view based on . 8 a map of the site, for mapping the at least one mobile object therein.
9 In some embodiments, the computer executable code further comprises computer executable instructions for: assembling said one or more initial 11 conditions using data determined from one or more tag measurements regarding 12 the at least one mobile object before the at least one mobile object enters the first 13 subarea from the at least first entrance.
14 In some embodiments, the computer executable code further comprises computer executable instructions for: obtaining images captured by at 16 least a second imaging device having an FOV overlapping a second subarea of the 17 site, the first and second subareas sharing the at least first entrance;
and 18 assembling the one or more initial conditions using data determined from the at 19 least second imaging device regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance.
21 In some embodiments, the first subarea comprises at least one 22 obstruction in the FOV of the at least first imaging device; and wherein the computer 23 executable code further comprises computer executable instructions for:
using a 1 statistic model based estimation for resolving ambiguity during tracking when the at 2 least one mobile object temporarily moves behind the obstruction.

Figure 1 is a schematic diagram showing an object tracking system 6 deployed in a site, according to one embodiment;
7 Figure 2 is a schematic diagram showing the functional structure of 8 the object tracking system of Fig. 1;
9 Figure 3 shows a foreground feature cluster (FFC) detected in a captured image;
11 Figure 4 is a schematic diagram showing the main function blocks of 12 the system of Fig. 1 and the data flow therebetween;
13 Figures 5A and 5B illustrate connected flowcharts showing steps of a 14 process of tracking mobile objects using a vision assisted hybrid location algorithm;
Figures 6A to 6D show steps of an example of establishing and 16 tracking an FFC-tag association following the process of Figs. 5A and 5B;
17 Figure 7 is a schematic diagram showing the main function blocks of 18 the system of Fig. 1 and the data flows therebetween, according to an alternative 19 embodiment;
Figure 8 is a flowchart showing the detail of FFC detection, according 21 to one embodiment;
22 Figures 9A to 9F show a visual representation of steps in an example 23 of FFC detection;

1 Figure 10 shows a visual representation of an example of a difference 2 image wherein the mobile object captured therein has a shadow;
3 Figure 11A is a three-dimensional (3D) perspective view of a portion 4 of a site;
Figure 11B is a plan view of the site portion of Fig. 11A;
6 Figures 11C and 11D show the partition of the site portion of Fig.

7 and 11A, respectively;
8 Figures 11E and 11F show the calibration processing for establishing 9 perspective mapping between the site portion of Fig. 11A and captured images;
Figure 12A shows a captured image of the site portion of Fig. 11A, the 11 captured image having an FFC of a person detected therein;
12 Figure 12B is a plan view of the site portion of Fig. 11A with the FFC
13 of Fig. 12A mapped thereto;
14 Figure 120 shows a sitemap having the site portion of Fig. 11A and the FFC of Fig. 12A mapped thereto;
16 Figure 13 shows a plot of the x-axis position of a bounding box 17 tracking point (BBTP) of an FFC in captured images, wherein the vertical axis 18 represents the BBTP's x-axis position (in pixel) in captured images, and the 19 horizontal axis represents the image frame index;
Figure 14 is a flowchart showing the detail of mobile object tracking 21 using an extended Kalman filter (EKF);
22 Figure 15A shows an example of two imaging devices CA and CB with 23 overlapping field of view (FOV) covering an L-shaped room;

1 Figure 15B shows a grid partitioning of the room of Fig. 15A;
2 Figure 16A shows an imaginary, one-dimensional room partitioned to 3 six grid points;
4 Figure 16B is a state diagram for the imaginary room of Fig. 16A;
Figures 17A and 17B are graphs for a deterministic example, where a 6 mobile object is moving left to right along the x-axis in the FOV of an imaging device, 7 wherein Fig. 17A is a state transition diagram, and Fig, 17B shows a graph of 8 simulation results;
9 Figures 18A to 18C show another example, where a mobile object is slewing to the right hand side along the x-axis in the FOV of an imaging device, 11 wherein Fig. 18A is a state transition diagram, and Fig. 18B and 18C are graphs of 12 simulation results of the mean and the standard deviation (STD) of x-and y-13 coordinates of the mobile object, respectively;
14 Figure 19 is a schematic diagram showing the data flow for determining a state transition matrix;
16 Figures 20A to 20E show visual representation of an example of 17 merging/occlusion of two mobile objects;
18 Figures 21A to 21E show visual representation of an example that a 19 mobile object is occluded by a background object;
Figure 22 shows a portion of the functional structure of a Visual 21 Assisted Indoor Location System (VAILS), according to an alternative embodiment, 22 the portion shown in Fig. 22 corresponding to the computer cloud of Fig.
2;

Figure 23 is a schematic diagram showing the association of a blob in 2 a camera view, a BV object in a birds-eye view of the site and a tag device;
3 Figure 24 is a schematic illustration of an example site, which is 4 divided into a number of rooms, with entrances/exits connecting the rooms;
Figure 25 is a schematic illustration showing. a mobile object entering 6 a room and moving therein;
7 Figure 26 is a schematic diagram showing data flow between the 8 imaging device, camera view processing submodule, internal blob track file (IBTF), 9 birds-eye view processing submodule, network arbitrator, external blob track file (EBTF) and object track file (OTF); =
11 Figures 27A to 27D are schematic illustrations showing possibilities 12 that may cause ambiguity;
13 Figure 28 is a schematic illustration showing an example, in which a 14 tagged mobile object moves in a room from a first entrance on the left-hand side of the room to the right-hand side thereof towards a second entrance, and an 16 untagged object moves in the room from the second entrance on the right-hand side 17 of the room to the left-hand side thereof towards the first entrance;
18 Figure 29 is a schematic diagram showing the relationship between 19 the IBTF, EBTF, OTF, Tag Observable File (TOF) for storing tag observations, network arbitrator and tag devices;
21 Figure 30 is a schematic diagram showing information flow between 22 camera views, birds-eye view and tag devices;
=

1 Figure 31 is a more detailed version of Fig. 30, showing information 2 flow between camera views, birds-eye view and tag devices, and the function of the 3 network arbitrator in the information flow;
4 Figures 32A shows an example of a type 3 blob having a plurality of sub-blobs;
6 Figure 32B is a diagram showing the relationship of the type 3 blob 7 and its sub-blobs of Fig. 32A;
8 Figure 33 shows a timeline history diagram. of a life span of a blob 9 from its creation event to its annihilation event;
Figure 34 shows a timeline history diagram of the blobs of Fig. 28;
11 Figures 35A shows an example of a type 6 blob merged from two 12 blobs;
13 Figure 35B is a diagram showing the relationship of the type 6 blob 14 and its sub-blobs of Fig. 35A;
Figure 36A is a schematic illustration showing two tagged objects 16 simultaneously entering a room from a same entrance and moving therein;
17 Figure 36B shows a timeline history diagram of a life span of a blob 18 from its creation event to its annihilation event, for tracking two tagged objects simultaneously entering a room from a same entrance and moving therein with different speeds;
21 Figure 37A is a schematic illustration showing an example wherein a 22 blob is split to two sub-blobs;

1 Figure 37B is a schematic illustration showing an example wherein a 2 person enters a room, moves therein, and later pushes a cart to exit the room;
3 Figure 37C is a schematic illustration showing an example wherein a 4 person enters a room, moves therein, sits down for a while, and then moves out of the room;
6 Figure 37D is a schematic illustration showing an example wherein a 7 person enters a room, moves therein, sits down for a while at a location already 8 having two person sitting, and then moves out of the room;
9 Figure 38 is a table listing the object activities and the performances of the network arbitrator, camera view processing and tag devices that may be 11 triggered by the corresponding object activities; =
12 Figures 39A and 39B show two consecutive image frames, each 13 having detected blobs;
14 Figure 39C shows the maximum correlation of image frames of Figs.
39A and 39B;
16 Figure 40 shows an image frame having two blobs;
17 Figure 41A is a schematic illustration showing an example wherein a 18 mobile object is moving in a room and is occluded by an obstruction therein;
19 Figure 41B is a schematic diagram showing data flow in tracking the mobile object of Fig. 41A, 21 Figure 42 shows a timeline history diagram of the blobs of Fig.
41A;
22 Figure 43 shows an alternative possibility that may give rise to same 23 camera view observations of Fig. 41A;

1 Figure 44 shows an example of a blob with a BBTP ambiguity region 2 determined by the system;
3 Figures 45A and 45B show a BBTP in the camera view and mapped =
4 into the birds-eye view, respectively;
Figures 46A and 46B show an example of an ambiguity region of a 6 BBTP (not shown) in the camera view and mapped into the birds-eye view, 7 respectively;
8 Figure 47 shows a simulation configuration having an imaging device 9 and an obstruction in the FOV of the imaging device;
Figure 48 shows the results of the DBN prediction of Fig. 47 without 11 velocity feedback;
12 Figure 49 shows the prediction likelihood over time in tracking the 13 mobile object of Fig. 47 without velocity feedback;
14 Figure 50 shows the results of the DBN prediction in tracking the mobile object of Fig. 47 with velocity feedback;
16 Figure 51 shows the prediction likelihood over time in tracking the 17 mobile object of Fig. 47 with velocity feedback;
18 Figures 52A to 52C show another example of a simulation configuration, the simulated prediction likelihood without velocity feedback, and the simulated prediction likelihood with velocity feedback, respectively;
21 Figure 53A shows a simulation configuration for simulating the tracking of a first mobile object (not shown) with an interference object nearby the trajectory of the first mobile object and an obstruction between the imaging device 2 and the trajectory;
3 Figure 53B shows the prediction likelihood of Fig. 53A;
4 Figures 54A and 54B show another simulation example of tracking a first mobile object (not shown) with an interference object nearby the trajectory of 6 the first mobile object and an obstruction between the imaging device and the 7 trajectory;
8 Figure 55 shows the initial condition flow and the output of the network 9 arbitrator;
Figure 56 is a schematic illustration showing an example wherein two 11 mobile object moves across a room but the imaging device therein reports only one 12 mobile object exiting from an entrance on the right-hand side of the room;
13 Figure 57 shows another example, wherein the network arbitrator may 14 delay the choice among candidate routes if the likelihoods of candidate routes are still high, and make a choice when one candidate route exhibits sufficiently high 16 likelihood;
17 Figure 58A is a schematic illustration showing an example wherein a 18 mobile object moves across a room;
19 Figure 58B is a schematic diagram showing the initial condition flow and the output of the network arbitrator in a mobile object tracking example of Fig.
21 58A;
22 Figure 59 is a schematic illustration showing an example wherein a 23 tagged object is occluded by an untagged object;

1 Figure 60 shows the relationship between the camera view processing submodule, birds-eye view processing submodule, and the network arbitrator/tag 3 devices;
4 Figure 61 shows a 3D simulation of a room having an indentation representing a portion of the room that is inaccessible to any mobile objects;
6 Figure 62 shows the prediction probability based on arbitrary building 7 wall constraints of Fig. 61, after sufficient number of iterations to approximate a 8 steady state;
9 Figures 63A and 63B show a portion of the MATLABO code used in a simulation;
11 Figure 64 shows a portion of the MATLABO code for generating a 12 Gaussian shaped likelihood kernel;
13 Figures 65A to 650 show the plotting of the initial probability subject to 14 the site map wall regions, the measurement probability kernel, and the probability after the measurement likelihood has been applied, respectively;
16 Figure 66 shows a steady state distribution reached in a simulation;
17 Figures 67A to 67D show the mapping between a world coordinate 18 system and a camera coordinate system;
19 Figure 68A is an original picture used in a simulation;
Figure 68B is an image of the picture of Fig. 68A captured by an 21 imaging device;
22 Figure 69 show a portion of MATLABO code for correcting the 23 distortion in Fig. 68B; and 1 Figure 70 shows the distortion-corrected image of Fig. 68B.

3 DETAILED DESCRIPTION' =
4 Glossary:
Global Positioning System (GPS) 6 Doppler Orbitography and Radio-positioning Integrated by Satellite 7 (DORIS) 8 Bluetoothe Low Energy (BLE) 9 foreground feature clusters (FFCs) field of view (FOV) 11 Inertial Measurement Unit (IMU) 12 a global navigation satellite system (GNSS), 13 a receiver signal strength (RSS) 14 two dimensional (2D) three-dimensional (3D) 16 bounding box tracking point (BBTP) 17 Kalman filter (EKF) 18 standard deviation (STD) 19 Visual Assisted Indoor Location System (VAILS) internal blob track file (IBTF), 21 external blob track file (EBTF) 22 object track file (OTF) 23 Tag Observable File (TOF) 1 central processing units (CPUs) 2 input/output (I/O) 3 frames per second (fps) 4 personal data assistant (PDA) universally unique identifier (UUID) =
6 security camera system (SCS) 7 Radio-frequency identification (RFID) 8 probability density function (PDF) 9 mixture of gaussians (MoG) model singular value decomposition (SVD) 11 access point (AP) 12 standard deviation (STD) of x- and y-coordinates of the mobile object, 13 denoted as STDx and STDy 14 a birds-eye view (BV) camera view processing and birds-eye view processing (CV/BV) 16 camera view (CV) objects 17 birds-eye view (CV) objects 18 object track file (OTF) In the following, a method and system for, tracking mobile objects in a 21 site are disclosed. The system comprises one or more computer servers, e.g., a so-22 called computer cloud, communicating with one or more imaging devices and one 23 or more tag devices. Each tag device is attached to a mobile object, and has one or 1 more sensors for sensing the motion of the mobile object. The computer cloud 2 visually tracks mobile objects in the site using image streams captured by the 3 imaging devices, and uses measurements obtained from tag devices to resolve 4 ambiguity occurred in mobile object tracking. The computer cloud uses an optimization method to reduce power consumption of tag devices.

7 System Overview 8 Turning to Fig. 1, an object tracking system is shown, and is generally 9 identified using numeral 100. The object tracking system 100 comprises one or more imaging devices 104, e.g., security cameras or other camera devices, 11 deployed in a site 102, such as a campus, a building, a shopping center or the like.
12 Each imaging device 104 is communicated with a computer network or cloud 13 via suitable wired communication means 106, such, as Ethernet, serial cable, 14 parallel cable, USB cable, HDMI cable or the like, and/or via suitable wireless communication means such as Wi-Fi , BluetoothO, ZigBee , 3G or 4G wireless 16 telecommunications or the like. In this embodiment, the computer cloud 108 is also 17 deployed in the site 102, and comprises one or more server computers 110 18 interconnected via necessary communication infrastructure.
19 One or more mobile objects 112, e.g., one or more persons, enter the site 102, and may move to different locations therein. From time to time, some 21 mobile objects 112 may be moving, and some other mobile objects 112 may be 22 stationary. Each mobile object 112 is associated with a tag device 114 movable 23 therewith. Each tag device 114 communicates with the computer cloud 108 via 1 suitable wireless communication means 116, such as Wi-Fi , Bluetoothe, ZigBee0, 2 3G or 4G wireless telecommunications, or the like. The tag devices 114 may also 3 communicate with other nearby tag devices using suitable peer-to-peer wireless 4 communication means 118. Some mobile objects may not have a tag device associated therewith, and such objects cannot benefit fully from the embodiments 6 disclosed herein.
7 The computer cloud 108 comprises one or more server computers 8 110 connected via suitable wired communication means 106. As those skilled in the 9 art understand, the server computers 110 may be any computing devices suitable for acting as servers. Typically, a server computer may comprise one or more 11 processing structures such as one or more single-core or multiple-core central 12 processing units (CPUs), memory, input/output (I/O) interfaces including suitable 13 wired or wireless networking interfaces, and control circuits connecting various 14 computer components. The CPUs may be, e.g., Intel microprocessors offered by Intel Corporation of Santa Clara, CA, USA, AMD0 microprocessors offered by 16 Advanced Micro Devices of Sunnyvale, CA, USA, ARM microprocessors 17 manufactured by a variety of manufactures under the ARM architecture developed 18 by ARM Ltd. of Cambridge, UK, or the like. The memory may be volatile and/or non-19 volatile, non-removable or removable memory such as RAM, ROM, EEPROM, solid-state memory, hard disks, CD, DVD, solid-state memory, flash memory, or the 21 like. The networking interfaces may be wired networking interfaces such as 22 Ethernet interfaces, or wireless networking interfaces such as WiFi0, Bluetooth0, 23 3G or 4G mobile telecommunication, ZigBee , or the like. In some embodiments, parallel ports, serial ports, USB connections may also be used for networking although they are usually considered as input/output interfaces for connecting input/output devices. The I/O interfaces may also comprise keyboards, computer 4 mice, monitors, speakers and the like.
The imaging devices 104 are usually deployed in the site 102 covering 6 most or all of the common traffic areas thereof, and/or other areas of interest. The 7 imaging devices 104 capture images of the site 102 in their respective field of views 8 (F0V5).
Images captured by each imaging device 104 may comprise the images of 9 one or more mobile objects 112 within the FOV thereof..
Each captured image is sometimes called an image frame. Each 11 imaging device 104 captures images or image frames at a designated frame rate, 12 e.g., in some embodiments, 30 frames per second (fps), i.e., capturing 30 images 13 per second. Of course, those skilled in the art understand that the imaging devices 14 may capture image streams at other frame rates. The frame rate of an imaging device may be a predefined frame rate, or a frame rate adaptively designated by the computer cloud 108. In some embodiments, all imaging devices have the same 17 frame rate. In some other embodiments, imaging devices may have different frame 18 rate.
19 As the frame rate of each imaging device is known, each image frame is thus captured at a known time instant, and the time interval between each pair of consecutively captured image frames is also known. As will be described in more 22 detail later, the computer cloud 108 analyses captured image frames to detect and 23 track mobile objects. In some embodiments, the computer cloud 108 detects and =

=
1 tracks mobile objects in the FOV of each imaging device by individual analyzing 2 each image frame captured therefrom (i.e., without using historical image frames).
3 In some alternative embodiments, the computer cloud .108 detects and tracks 4 mobile objects in the FOV of each imaging device by analyzing a set of consecutively captured images, including the most recently captured image and a 6 plurality of previously consecutively captured images. In some other embodiments, 7 the computer cloud 108 may combine image frames captured by a plurality of 8 imaging devices for detecting and tracking mobile objects.
9 Ambiguity may occur during visual tracking of mobile objects.
Ambiguity is a well-known issue in visual object tracking, and includes a variety of 11 situations that cause visual object tracking less reliable or even unreliable.
12 Ambiguity may occur in a single imaging device capturing images of a 13 single mobile object. For example, in a series of images captured by an imaging 14 device, a mobile object is detected moving towards a bush, disappeared and then appearing from the opposite side of the bush. Ambiguity may occur as it may be 16 uncertain whether the images captured a mobile object passing the bush from 17 behind, or the images captured a first mobile object moved behind the bush and 18 stayed therebehind, and then a second mobile object previously staying behind the 19 bush now moved out thereof.
Ambiguity may occur in a single imaging device capturing images of 21 multiple mobile objects. For example, in a series of image frames captured by an 22 imaging device, two mobile objects are detected moving towards each other, 23 merging to one object, and then separating to two objects again and moving apart 1 from each other. Ambiguity occurs in this situation as it may be uncertain whether 2 the two mobile objects are crossing each other or the two mobile objects are moving 3 towards each other to a meeting point (appearing in the captured images as one 4 object), and then turning back to their respective coming directions.
Ambiguity may occur across multiple imaging devices. For example, in 6 images captured by a first imaging device, a mobile object moves and disappears 7 from the field of view (FOV) of the first imaging device. Then, in images captured by 8 a second, neighboring imaging device, a mobile object appears in the FOV
thereof.
9 Ambiguity may occur in this situation as it may be uncertain whether it was a same mobile object moving from the FOV of the first imaging device into that of the 11 second imaging device, or a first mobile object moved out of the FOV the first 12 imaging device and a second mobile object moved into of the FOV the second 13 imaging device.
14 Other types of ambiguity in visual object tracking are also possible.
For example, when determining the location of a mobile object in the site 102 based 16 on the location of the mobile object in a captured image, ambiguity may occur as 17 the determined location may not have sufficient precision required by the system.
18 In embodiments disclosed herein, when ambiguity occurs, the system 19 uses tag measurements obtain from tag devices to associate objects detected in captured images and the tag devices for resolving the ambiguity.
21 Each tag device 114 is a small, battery-operated electronic device, 22 which in some embodiments, may be a device designed specifically for mobile 23 object tracking, or alternatively may be a multi-purpose mobile device suitable for =

1 mobile device tracking, e.g., a smartphone, a tablet, a smart watch and the like.
2 Moreover, in some alternative embodiments, some tag devices may be integrated 3 with the corresponding mobile objects such as carts, wheelchairs, robots and the 4 like.
Each tag device comprises a processing structure, one or more 6 sensors and necessary circuit connecting the sensors to the processing structure.
7 The processing structure controls the sensors to collect data, also called tag 8 measurements or tag observations, and establishes communication with the 9 computer cloud 108. In some embodiments, the processing structure may also establish peer-to-peer communication with other tag devices 114. Each tag device 11 also comprises a unique identification code, which is used by the computer cloud 12 108 for uniquely identifying the tag devices 114 in the site 102.
13 In different embodiments, the tag device 114 may comprise one or 14 more sensors for collecting tag measurements regarding the mobile object 112. The number and types of sensors used in each embodiment depend on the design 16 target thereof, and may be selected by the system designer as needed and/or 17 desired. The sensors may include, but not limited to, an inertial Measurement Unit 18 (IMU) having accelerometers and/or gyroscopes (e.g., . rate gyros) for motion 19 detection, a barometer for measuring atmospheric pressure, a thermometer for measuring temperature external to the tag 114, a magnetometer, a global 21 navigation satellite system (GNSS) sensor, e.g., a Global Positioning System (GPS) 22 receiver, an audio frequency microphone, a light sensor, a camera, and an RSS

measurement sensors for measuring the signal strength of a received wireless 2 signal.
3 An RSS
measurement sensor is a sensor for measuring the signal strength of a received wireless signal received from a transmitter, for estimating the distance from the transmitter. The RSS measurement may be useful for estimating 6 the location of a tag device 114. As described above, a tag device 114 may communicate with other nearby tag devices 114 using peer-to-peer communications 8 118. For example, some tag devices 114 may comprise a short-distance communication device such as a Bluetoothe Low Energy (BLE) device. Examples of BLE devices include transceivers using the iBeaconTM technology specified by 11 Apple Inc. of Cupertino, CA, U.S.A. or using Samsung's ProximityTM technology. As 12 those skilled in the art understand, a BLE device broadcasts a BLE signal (so-called beacon), and/or receives BLE beacons transmitted from nearby BLE devices.

device may be a mobile device such as a tag device 114, a smartphone, a tablet, a laptop, a personal data assistant (PDA) or the like that uses a BLE

technology. A BLE device may also be a stationary device such as a BLE
17 transmitter deployed in the site 102.

device may detect BLE beacons transmitted from nearby BLE

devices, determine their identities using the information embedded in the BLE
beacons, and establish peer-to-peer link therewith. A BLE beacon usually includes 21 a universally unique identifier (UUID), a Major ID and = a Minor ID. The UUID

generally represents a group, e.g., an organization, a firm, a company or the like, 23 and is the same for all BLE devices in a same group. The Major ID represents a 1 subgroup, e.g., a store of a retail company, and is the same for all BLE
devices in a 2 same subgroup. The Minor ID represents the BLE device in a subgroup. The 3 combination of the UUID, Major ID and Minor ID, i.e., (UUID, Major ID, Minor ID), 4 then uniquely determines the identity of the BLE device..
The short-distance communication device may comprise sensors for 6 wireless receiver signal strength (RSS) measurement, e.g., Bluetoothe RSS
7 measurement. As those skilled in the art appreciate, a BLE beacon may further 8 include a reference transmit signal power indicator. Therefore, a tag device 114, 9 when detects a BLE beacon broadcast from a nearby transmitter BLE device (which may be a nearby tag device 114 or a different BLE device such as a BLE
transmitter 11 deployed in the site 102), may measure the received signal power of the BLE
12 beacon obtaining a RSS measurement, and compare the RSS measurement with 13 the reference transmit signal power embedded in the BLE beacon to estimate the =
14 distance from the transmitter BLE device.
The system 100 therefore may use the RSS measurement obtained 16 by a target tag device regarding the BLE beacon of a transmitter BLE
device to 17 determine that two mobile objects 112 are in close proximity such as two persons in 18 contact, conversing, or the like (if the transmitter BLE device is another tag device 19 114), or to estimate the location of the mobile object 112 associated with the target tag device (if the transmitter BLE device is a BLE transmitter deployed at a known 21 location), which may be used to facilitate the detection and tracking of the mobile 22 object 112.

Alternatively, in some embodiments, the system may comprise a map 2 of the site 102 indicative of the transmitter signal strength of a plurality of wireless 3 signal transmitters, e.g., Bluetooth and/or WiFi access points, deployed at known locations of the site 102. The system 100 may use this wireless signal strength map and compare with the RSS measurement of a tag device 114 to estimate the location of the tag device 114. In these embodiments, the wireless signal transmitters do not need to include a reference transmit.signal power indicator in the 8 beacon.
9 The computer cloud 108 tracks the mobile objects 112 using information obtained from images captured by the one or more imaging devices 11 and from the above-mentioned sensor data of the tag devices 114. In particular, the computer cloud 108 detects foreground objects or foreground feature clusters 13 (FFCs) from images captured by the imaging devices 104 using image processing 14 technologies.
Herein, the imaging devices 104 are located at fixed locations in the 16 site 102, generally oriented toward a fixed direction (except that in some embodiments an imaging device may occasionally pan to a different direction), and focused, to provide a reasonably static background. Moreover, the lighting in the 19 FOV of each imaging device is generally unchanged for the time intervals of interest, or the lighting changing slowly that it may be considered unchanged among a finite 21 number of consecutively captured images. Generally, the computer cloud 108 maintains a background image for each imaging device 104, which typically comprising image of permanent features of the site such as floor, ceiling, walls and 1 the like, and semi-permanent structures such as furniture, plants, trees and the like.
2 The computer cloud 108 periodically updates the background images.
3 Mobile objects, being moving or stationary, generally appear in the 4 captured images as foreground objects or FFCs that occlude the background. Each FFC is an identified area in the captured images corresponding to a moving object 6 that may be associated with a tag device 114. Each FFC is bounded by a bounding 7 box. A mobile object being stationary for an extended period of time, however, may 8 become a part of the background and undetectable from the captured images.
9 The computer cloud 108 associates detected FFCs with tag devices 114 using the information of the captured images and information received from the 11 tag devices 114, for example, both evidencing motion of 1 meter per second. As 12 each tag device 114 is associated with a mobile object 112, an FFC
successfully 13 associated with a tag device 114 is then considered an identified mobile object 112, 14 and is tracked in the site 102.
Obviously, there may exist mobile objeCts = in the site 102 but not 16 associated with any tag device 114, which cannot be identified. Such unidentified 17 mobile objects may be robots, animals, or may be people without a tag device. In 18 this embodiment, unidentified mobile objects are ignored by the computer cloud 108.
19 However, those skilled in the art appreciate that, alternatively, the unidentified mobile objects may also be tracked, to some extent, 'solely by using images 21 captured by the one or more imaging devices 104.
22 Fig. 2 is a schematic diagram showing the functional structure 140 of 23 the object tracking system 100. As shown, the computer cloud 108 functionally 1 comprises a computer vision processing structure 146 and a network arbitrator 2 component 148. Each tag device 114 functionally comprises one or more sensors 3 150 and a tag arbitrator component 152.
4 The network arbitrator component 148 and the tag arbitrator component 152 are the central components of the system 100 as they "arbitrate"
6 the observations to be done by the tag device 114. The network arbitrator 7 component 148 is a master component and the tag arbitrator components 152 are 8 slave components. Multiple tag arbitrator components .152 may communicate with 9 the network arbitrator component 148 at the same time and observations therefrom may be jointly processed by the network arbitrator component 148.
11 The network arbitrator component 148 manages all tag devices 114 in 12 the site 102. When a mobile object 112 having a tag device 114 enters the site 102, 13 the tag arbitrator component 152 of the tag device 114 automatically establishes 14 communication with the network arbitrator component 148 of the computer cloud 108, via a so called "handshaking" process. With handshaking, the tag arbitrator 16 component 152 communicates its unique identification code to the network 17 arbitrator component 148. The network arbitrator component 148 registers the tag 18 device 114 in a tag device registration table (e.g., a table in a database), and 19 communicates with the tag arbitrator component 152 of the tag device 114 to understand what types of tag measurements can be provided by the tag device 21 and how much energy each tag measurement will consume.
22 During mobile object tracking, the network arbitrator component 23 maintains communication with the tag arbitrator components 152 of all tag devices 1 114, and may request one or more tag arbitrator component 152 to provide one or 2 more tag measurements. The tag measurements that a tag device 114 can provide 3 depend on the sensors installed in the tag device. For example, accelerometers 4 have an output triggered by magnitude of change of acceleration, which can be used for sensing the moving of the tag device 114. The accelerometer and rate gyro 6 can provide motion measurement of the tag device 114 or the mobile object 7 associated therewith. The barometer may provide air pressure measurement 8 indicative of the elevation of the tag device 114.
9 With the information of each tag device 114 obtained during handshaking, the network arbitrator component 148 can dynamically determine, 11 which tag devices and what tag measurements therefrom are needed to facilitate 12 mobile object tracking with minimum power consumption incurred to the tag devices 13 (described in more detail later).
14 When the network arbitrator component 148 is no longer able to communicate with the tag arbitrator component 152 of a tag device 114 for a 16 predefined period of time, the network arbitrator component 148 considers that the 17 tag device 114 has left the site 102 or has been deactivated or turned off. The 18 network arbitrator component 148 then deletes the tag device 114 from the tag 19 device registration table.
Shown in Fig. 2, a camera system 142 such as a security camera 21 system (SOS) controls the one or more imaging devices 104, collects images 22 captured by the imaging devices 104, and sends captured images to the computer 23 vision processing structure 146.

1 The computer vision processing structure 146 processes the received 2 images for detecting FFCs therein. Generally, the computer vision processing structure 146 maintains a background image for each imaging device 104. When an 4 image captured by an imaging device 104 is sent to the computer vision processing structure 146, the computer vision processing structure 146 calculates the difference between the received image and the stored background image to obtain 7 a difference image. With suitable image processing technology, the computer vision processing structure 146 detects the FFCs from the difference image. In this embodiment, the computer vision processing structure 148 periodically updates the background image to adapt to the change of the background environment, e.g., the 11 illumination change from time to time.
12 Fig. 3 shows an FFC 160 detected in a captured image. As shown, a bounding box 162 is created around the extremes of the blob of the FFC 160. In this embodiment, the bounding box is a rectangular bounding box, and is used in image analysis unless detail, e.g., color, pose and other features, of the FFC is required.

centroid 164 of FFC 160 is determined. Here, the centroid 164 is not 17 necessarily the center of the bounding box 162.

bounding box tracking point (BBTP) 166 is determined at a location 19 on the lower edge of the bounding box 162 such that a. virtual line between the centroid 164 and the BBTP 166 is perpendicular to the lower edge of the bounding 21 box 162. The BBTP 166 is used for determining the location of the FFC 160 (more precisely the mobile object represented by FFC 160) in the site 102. In some 1 alternative embodiments, both the centroid 164 and the BBTP 166 are used for 2 determining the location of the FFC 160 in the site 102..
3 In some embodiments, the outline of the FFC 160 may be reduced to 4 a small set of features based on posture to determine, e.g., if the corresponding mobile object 112 is standing or walking. Moreover, analysis of the FFC 160 6 detected over a group of sequentially captured images may show that the 7 is walking and may further provide an estimate of the gait frequency. As will be 8 described in more detail later, a tag-image correlation between the tag 9 measurements, e.g., gait frequency obtained by tag devices, and the analysis results of the captured images may be calculated for establishing FFC-tag 11 association.
12 The computer vision processing structure 146 sends detected FFCs to 13 the network arbitrator component 148. The network arbitrator component 148 14 associate the detected FFCs with tag devices 114, and, if needed, communicates with the tag arbitrator components 152 of the tag devices 114 to obtain tag 16 measurements therefrom for facilitating FFC-tag association.
17 The tag arbitrator component 152 of a tag device 114 may 18 communicate with the tag arbitrator components 152 of other nearby tag devices 19 114 using peer-to-peer communications 118.
Fig. 4 is a schematic diagram showing the main function blocks of the 21 system 100 and the data flows therebetween. As shown, the camera system 22 feeds images captured by the cameras 104 in the site 102 into the computer vision 23 processing block 146. The computer vision processing block 146 processes the 1 images received from the camera system 142 such as necessary filtering, image 2 corrections and the like, and isolates or detects a set of FFCs in the images that 3 may be associated with tag devices 114.
4 The set of FFCs and their associated bounding boxes are then sent to the network arbitrator component 148. The network arbitrator component 148 6 analyzes the FFCs and may request the tag arbitrator components 152 of one or 7 more tag devices 114 to report tag measurements for facilitating FFC-tag 8 association.
9 Upon receiving a request from the network arbitrator component 148, the tag arbitrator component 152 in response makes necessary tag measurements 11 from the sensors 150 of the tag device 114, and sends tag measurements to the 12 network arbitrator component 148. The network arbitrator component 148 uses 13 received tag measurements to establish the association between the FFCs and the 14 tag devices 114. Each FFC associated with a tag device 114 is considered as an identified mobile object 112 and is tracked by the system 100.
16 The network arbitrator component 148 stores each FFC-tag 17 association and an association probability thereof (FFC-tag association probability, 18 described later) in a tracking table 182 (e.g., a table in a database).
The tracking 19 table 182 is updated every frame as required.
Data of FFC-tag associations in the tracking table 182, such as the 21 height, color, speed and other feasible characteristics of the FFCs, is fed back to 22 the computer vision processing block 146 for facilitating the computer vision 23 processing block 146 to better detect the FFC in subsequent images.
46 =

1 Figs. 5A and 5B illustrate a flowchart 200, in two sheets, showing 2 steps of a process of tracking mobile objects 112 using a vision assisted hybrid 3 location algorithm. As described before, a mobile object 112 is considered by the 4 system 100 as an FFC associated with a tag device 114, or an "FFC-tag association"
for simplicity of description.
6 The process starts when the system is started (step 202). After start, 7 the system first go through an initialization step 204 to ensure that all function 8 blocks are ready for tracking mobile objects. For ease of illustration, this step also 9 includes tag device initialization that will be executed whenever a tag device enters the site 102.
11 As described above, when a tag device 114 is activated, e.g., entering 12 the site 102, or upon turning on, it automatically establishes communication with the 13 computer cloud 108, via the "handshaking" process, to register itself in the computer 14 cloud 108 and to report to the computer cloud regarding what types of tag measurements can be provided by the tag device 114 and how much energy each 16 tag measurement will consume.
17 As the newly activated tag device 114 does not have any prior 18 association with an FFC, the computer cloud 108, during handshaking, requests the 19 tag device 114 to conduct a set of observations or measurements to facilitate the subsequent FFC-tag association with a sufficient FFC-tag association probability.
21 For example, in an embodiment, the site 102 is a building, with a Radio-frequency 22 identification (RFID) reader and an imaging device 104 installed at the entrance 23 thereof. A mobile object 112 is equipped with a tag device 114 having an RFID tag.

1 When the mobile object 112 enters the site 102 through the entrance thereof, the 2 system detects the tag device 114 via the RFID reader. The detection of the tag 3 device 114 is then used for associating the tag device with the FFC
detected in the 4 images captured by the imaging device at the entrance of the site 102.
Alternatively, facial recognition using images captured by the imaging 6 device at the entrance of the site 102 may be used to = establish initial FFC-tag 7 association. In some alternatively embodiments, other biometric sensors coupled to 8 the computer cloud 108, e.g., iris or fingerprint scanners, may be used to establish 9 initial FFC-tag association.
After initialization, each imaging device 104 of the camera system 142 11 captures images of the site 102, and send a stream of captured images to the 12 computer vision processing block 146 (step 206).
13 The computer vision processing block 146 detects FFCs from the 14 received image streams (step 208). As described before, the computer vision processing structure 146 maintains a background image for each imaging device 16 104. When a captured image is received, the computer vision processing structure 17 146 calculates the difference between the received image and the stored 18 background image to obtain a difference image, and detects FFCs from the 19 difference image.
The computer vision processing block 146 then maps the detected 21 FFCs into a three-dimensional (3D), physical-world coordinate system of the site by 22 using, e.g., a perspective mapping or perspective transform technology (step 210).
23 With the perspective mapping technology, the computer vision processing block 146 1 maps points in a two-dimensional (2D) image coordinate system (i.e., a camera 2 coordinate system) to points in the 3D, physical-world coordinate system of the site 3 using a 3D model of the site. The 3D model of site is generally a description of the 4 site and comprises a plurality of localized planes connected by stairs and ramps.
The computer vision processing block 146 determines the location of the 6 corresponding mobile object in the site by mapping the BBTP and/or the centroid of 7 the FFC to the 3D coordinate system of the site.
8 The computer vision processing block 146 sends detected FFCs, 9 including their bounding boxes, BBTPs, their locations in the site and other relevant information, to the network arbitrator component 148 (step 212). The network 11 arbitrator component 148 then collaborates with the tag arbitrator components 152 12 to associate each FFC with a tag device 114 and track the FFC-tag association, or, 13 if an FFC cannot be associated with any tag device 114, mark it as unknown 14 (steps 214 to 240).
In particular, the network arbitrator component 148 selects an FFC, 16 and analyzes the image streams regarding the selected FFC (step 214).
Depending 17 on the implementation, in some embodiments, the image stream from the imaging 18 device that captures the selected FFC is analyzed. In some other embodiments, 19 other image streams, such as image streams from neighboring imaging devices, are also used in the analysis.
21 In this embodiment, the network arbitrator component 148 uses a 22 position estimation method based on a suitable statistic model, such as a first order 23 Markov process, and in particular, uses a Kalman filter with a first order Markov 1 Gaussian process, to analyze the FFCs in the current images and historical images 2 captured by the same imaging device to associate the FFCs with tag devices 114 3 for tracking. Motion activities of the FFCs are estimated, which may be compared 4 with tag measurements for facilitating the FFC-tag association.
Various types of image analysis may be used for estimating motion 6 activity and modes of the FFCs.
7 For example, analyzing the BBTP of an FFC and background may 8 determine whether the FFC is stationary or moving in foreground. Usually, a slight 9 movement is detectable. However, as the computer vision processing structure 146 periodically updates the background image, a long-term stationary object 112 may 11 become indistinguishable from background, and no FFC corresponding to such 12 object 112 would be reliably detected from captured images. In some embodiments, 13 if an FFC that has been associated with a tag device disappears at a location, i.e., 14 the FFC is no longer detectable in the current image, but have been detected as stationary in historical images, the computer cloud 108 then assumes that a "hidden"
16 FFC is still at the last known location, and maintains the association of the tag 17 device with the "hidden" FFC.
18 By analyzing the BBTP of an FFC and background, it may be detected 19 that an FFC spontaneously appears from the background, if the FFC is detected in the current image but not in historical images previously captured by the same 21 imaging device. Such a spontaneous appearance of FFC may indicate that a long-22 term stationary mobile object starts to move, that a mobile object enters the FOV of 23 the imaging device from a location undetectable by the imaging device (e.g., behind 1 a door) if the FFC appears at an entrance location such as a door, or that a mobile 2 object enters the FOV of the imaging device from the FOV of a neighboring imaging 3 device if the FFC appears at about the edge of the captured image. In some 4 embodiments, the computer cloud 108 jointly processes the image streams from all imaging devices. If an FFC FA associated with a tag device TA disappears from the 6 edge of the FOV of a first imaging device, and a new FFC FB spontaneously 7 appears in the FOV of a second, neighboring imaging device at a corresponding 8 edge, the computer cloud 108 may determine that the mobile object previously 9 associated with FFC FA has moved from the FOV of the first imaging device into the FOV of the second imaging device, and associates the FFC FB with the tag 11 device TA.
12 By determining the BBTP in a captured image and mapping it into the 13 3D coordinate system of the site using perspective mapping, the location of the 14 corresponding mobile object in the site, or its coordinate in the 3D
coordinate system of the site, may be determined.
16 A BBTP may be mapped from a 2D image coordinate system into 3D, 17 physical-world coordinate system of the site using perspective mapping, and 18 various inferences can then be extracted therefrom.
19 For example, as will be described in more detail later, a BBTP may appear to suddenly "jump", i.e., quickly move upward, if the mobile object moves 21 partially behind a background object and is partially occluded, or may appear to 22 quickly move downwardly if the mobile object is moving out of the occlusion. Such a 23 quick upward/downward movement is unrealistic from a Bayesian estimation. As will 1 be described in more detail later, the system 100 can detect such unrealistic 2 upward/downward movement of the BBTP and correctly identify occlusion.
3 Identifying occlusion may be further facilitated by a 3D site map with 4 identified background structures, such as trees, statues, posts and the like, that may cause occlusion. By combining the site map and the tracking information mapped 6 thereinto, a trajectory of the mobile object passing possible background occlusion 7 objects may be derived with a high reliability.
8 If it is detected that the height of the bounding box of the FFC
is 9 shrinking or increasing, it may be determined that the mobile object corresponding the FFC is moving away from or moving towards the imaging device, respectively.
11 The change of scale of the FFC bounding box may be combined with the position 12 change of the FFC in the captured images to determine the moving direction of the 13 corresponding mobile object. For example, if the FFC is stationary or slightly moving, 14 but the height of the bounding box of the FFC is shrinking, it may be determined that the mobile object corresponding the FFC is moving radially away from the 16 imaging device.
17 The biometrics of the FFC, such as height, width, face, stride length of 18 walking, length of arms and/or legs, and the like may be detected using suitable 19 algorithms for identification of the mobile object. For example, an Eigenface algorithm may be used for detecting face features of an FFC. The detected face 21 features may be compared with those registered in a database to determine the 22 identity of the corresponding mobile object, or be used to compare with suitable tag 23 measurements to identify the mobile object.

1 The angles and motion of joints, e.g., elbows and knees, of the FFC
2 may be detected using segmentation methods, and correlated with plausible motion 3 as mapped into the 3D coordinate system of the site. The detected angles and 4 motion of joints may be used for sensing the activity of the corresponding mobile object such as walking, standing, dancing or the like. For example, in Fig. 3, it may 6 be detected that the mobile object corresponding to FFC 160 is running by analyzing the angles of the legs with respect to the body. Generally, this analysis requires at least some of the joints of the FFC is unobstructed in the captured 9 images.
Two mobile objects may merge into one FFC in captured images. By 11 using a Bayesian model, it may be detected that an FFC corresponding to two or 12 more occluding objects. As will be described in more detail later, when establishing 13 FFC-tag association, the FFC is associated with the tag devices of the occluding 14 mobile objects.
Similarly, two or more FFCs may emerge from a previously single FFC, 16 which may be detected by using the Bayesian model. As will be described in more 17 detail later, when establishing FFC-tag association, each of these FFCs is 18 associated with a tag device with an FFC-tag association probability.
19 As described above, based on the perspective mapping, the direction of the movement of an FFC may be detected. With the assumption that the corresponding mobile object is always facing the direction of the movement, the 22 heading of the mobile object may be detected by tracking the change of direction of 23 the FFC
in the 3D coordinate system. If the movement trajectory of the FFC

1 changes direction, the direction change of the FFC would be highly correlated with 2 the change of direction sensed by the IMU of the corresponding tag device.
3 Therefore, tag measurements comprising data obtained from the IMU
4 (comprising accelerometer and/or gyroscope) may be used to for calculating a tag-image correlation between the IMU data, or data obtained from the accelerometer 6 and/or gyroscope, and the FFC analysis of captured images to determine whether 7 the mobile object corresponding to the FFC is changing its moving direction. In an 8 alternative embodiment, data obtained from a magnetometer may be used and 9 correlated with the FFC analysis of captured images to determine whether the mobile object corresponding to the FFC is changing its moving direction.
11 The colors of the pixels of the FFC may also be tracked for 12 determining the location and environment of the corresponding mobile object. Color 13 change of the FFC may be due to lighting, the pose of the mobile object, the 14 distance of the mobile object from the imaging device, and/or the like.
A Bayesian model may be used for tracking the color attributes of the FFC.
16 By analyzing the FFC, a periodogram of walking gait of the 17 corresponding mobile object may be established. The periodicity of the walking gait 18 can be determined from the corresponding periodogram of the bounding box 19 variations.
For example, if a mobile object is walking, the bounding box of the 21 corresponding FFC will undulate with the object's walking. The bounding box 22 undulation can be analyzed in terms of it frequency and depth for obtaining an 23 indication of the walking gait.

1 The above list of analysis is non-exhaustive, and may be selectively 2 included in the system 100 by a system designer in various embodiments.

Referring back to Fig. 5A, at step 216, the network arbitrator component uses the image analysis results to calculate an FFC-tag association probability between the selected FFC and each of one or more candidate tag 6 devices 114, e.g., the tag devices 114 that have not been associated with any FFCs.
7 At this step, no tag measurements are used in calculating the FFC-tag association 8 probabilities.
9 Each calculated FFC-tag association probability is an indicative measure of the reliability of associating the FFC with a candidate tag device.
If any 11 of the calculated FFC-tag association probabilities is higher than a predefined threshold, the selected can be associated with a tag device without using any tag 13 measurements.
14 In some situations, an FFC may be associated with a tag device 114 and tracked by image analysis only and without using any tag measurements. For example, if a captured image comprises only one FFC, and there is only one tag 17 device 114 registered in the system 100, the FFC may be associated with the tag 18 device 114 without using any tag measurements.
19 As another example, the network arbitrator component 148 may analyze the image stream captured by an imaging device, including the current 21 image and historical images captured by the same imaging device, to associate an 22 FFC in the current image with an FFC in previous images such that the associated 23 FFCs across these images represent a same object. If such object has been =

=
1 previously associated with a tag device 114, then the FFC in the current image may 2 be associated with the same tag device 114 without using any tag measurements.
3 As a further example, the network arbitrator component 148 may 4 analyze a plurality of image streams, including the current images and historical images captured by the same and neighboring imaging devices, to associate an 6 FFC with a tag device. For example, if an identified FFC in a previous image 7 captured by a neighboring imaging device appears to be leaving the FOV
thereof 8 towards the imaging device that captures the current image, and the FFC
in the 9 current image with an FFC appears to enter the FOV thereof from the neighboring imaging device, then the FFC in the current image may be considered the same 11 FFC in the previous image captured by the neighboring imaging device, and can be 12 identified, i.e., associated with the tag device that was associated with the FFC in 13 the previous image captured by the neighboring imaging device.
14 At step 218, the network arbitrator component 148 uses the calculated FFC-tag association probabilities to check if the selected FFC can be associated 16 with a tag device 114 and tracked without using any tag measurements. If any of 17 the calculated FFC-tag association probabilities is higher than a predefined 18 threshold, the selected can be associated with a tag device without using any tag 19 measurements, the process goes to step 234 in Fig. 5B (illustrated in Figs. 5A and 5B using connector C).
21 However, if at step 218, none of the calculated FFC-tag association 22 probabilities is higher than a predefined threshold, the selected FFC
can only be 23 associated with a tag device if further tag measurements are obtained.
The network 1 arbitrator component 148 then determines, based on the analysis of step 214, a set 2 of tag measurements that may be most useful for establishing the FFC-tag 3 association with a minimum tag device power consumption, and then requests the 4 tag arbitrator components 152 of the candidate tag devices 114 to activate only the related sensors to gather the requested measurements, and report the set of tag 6 measurements (step 220).
7 Depending on the sensors installed on the tag device 114, numerous 8 attributes of a mobile object 112 may be measured.
9 For example, by using the accelerometer and rate gyro of the IMU, a mobile object in a stationary state may be detected. In particular, a motion 11 measurement is first determined by combining and weighting the magnitude of the 12 rate gyro vector and the difference in the accelerometer vector magnitude output. If 13 the motion measurement does not exceed a predefined motion threshold for a 14 predefined time threshold, then the tag device 114, or the mobile object associated therewith, is in a stationary state. There can be different levels of static 16 depending on how long the threshold has not been exceeded. For example, one 17 level of static may be sitting still for 5 seconds, and another level of static may be 18 lying inactively on a table for hours.
19 Similarly, a mobile object 112 transition from stationary to moving may be detected by using the accelerometer and rate gyro of the IMU. As described 21 above, the motion measurement is first determined. If the motion measurement 22 exceeds the predefined motion threshold for a predefined time threshold, the tag 23 device 114 or mobile object 112 is in motion.

1 Slight motion, walking or running of a mobile object 112 may be detected by using the accelerometer and rate gyro of the IMU. While being non-stationary, a tag device 114 or mobile object 112 in motion of slight motion while standing in one place, walking at a regular pace, running or jumping may be further determined using outputs of the accelerometer and rate gyro. Moreover, the outputs 6 of the accelerometer and rate gyro may also be used for recognizing gestures of the 7 mobile object 112.

Rotating of a mobile object 112 while walking or standing still may be detected by using the accelerometer and rate gyro of the IMU. Provided that attitude of the mobile object 114 does not change during the rotation, the angle of rotation is approximately determined from the magnitude of the rotation vector, which may be 12 determined from the outputs of the accelerometer and rate gyro.

mobile object 112 going up/down stairs may be detected by using 14 the barometer and accelerometer. Using output of the barometer, pressure changes may be resolvable almost to each step going up or down stairs, which may be 16 confirmed by the gesture detected from the output of the accelerometer.

mobile object 112 going up/down elevator may be detected by using 18 the barometer and accelerometer. The smooth pressure changes between each 19 floor as elevator ascends and descends may be detected from the output of the barometer, which may be confirmed by a smooth change of the accelerometer 21 output.

mobile object 112 going in or out of a doorway may be detected by 23 using the thermometer and barometer. Going from outdoor to indoor or from indoor 1 to outdoor causes a change in temperature and pressure, which may be detected 2 from the outputs of the thermometer and barometer. Going from one room through 3 a doorway to another room also causes change in, temperature and pressure 4 detectable by the thermometer and barometer.
Short term relative trajectory of a mobile object 112 may be detected 6 by using the accelerometer and rate gyro. Conditioned on an initial attitude of the 7 mobile object 114, the short term trajectory may be detected based on the 8 integration and transformation of the outputs of the accelerometer and rate gyro.
9 Initial attitudes of the mobile object 114 may need to be taken into account in detection of short term trajectory.
11 Periodogram of walking gait of a mobile object 112 may be detected 12 by using the accelerometer and rate gyro.
13 Fingerprinting position and trajectory of a mobile object 112 based on 14 magnetic vector may be determined by using magnetometer and accelerometer. In some embodiments, the system 100 comprises a magnetic field map of the site 102.
16 Magnetometer fingerprinting, aided by the accelerometer outputs, may be used to 17 determine the position of the tag device 114/ mobile object 112. For example, by 18 expressing the magnetometer and accelerometer measucements as two vectors, 19 respectively, the vector cross-product of the magnetometer measurement vector and the accelerometer measurement vector can be calculated. With suitable time 21 averaging, deviations of such a cross-product is approximately related to the 22 magnetic field anomalies. In an indoor environment or environment surrounded by 23 magnetic material (such as iron rods in construction), the magnetic field anomaly 1 will vary significantly. Such magnetic field variation due to the building structure and furniture can be captured or recorded in the magnetic field site map during a calibration process. Thereby, the likelihood of the magnetic anomalies can be determined by continuously sampling the magnetic and accelerometer vectors over time and comparing the measured anomaly with that recorded in the magnetic field 6 site map.

Fingerprinting position and trajectory of a mobile object 112 based on 8 RSS may be determined by using RSS measurement sensors, e.g., RSS

measurement sensors measuring Bluetooth and/or WiFi signal strength. By using the wireless signal strength map or reference transmit signal power indicator in the 11 beacon as described above, the location of a tag device 114 may be approximately determined using RSS fingerprinting based on the output of the RSS measurement 13 sensor.

single sample of the RSS measurement taken by a tag device 114 can be highly ambiguous as it is subjected to multipath distortion of the electromagnetic radio signal. However, a sequence of samples taken by the tag 17 device 114 as it is moving with the associated mobile object 112 will provide an 18 average that can be correlated with an RSS radio map of the site. Consequently 19 the trend of the RSS measurements as the mobile is moving is related to the mobile position. For example, an RSS measurement may indicate that the mobile object is 21 moving closer to an access point at a known position. Such RSS measurement may 22 be used with the image based object tracking for resolving ambiguity. Moreover, 23 some types of mobile objects, such as human body, will absorb wireless 1 electromagnetic signals, which may be leveraged from obtaining more inferences 2 from RSS measurement.
3 Motion related sound, such as periodic rustling of clothes items 4 brushing against the tag device, a wheeled object wheeling over a floor surface, sound of an object sliding on a floor surface, and the like, may be detected by using 6 an audio microphone. Periodogram of the magnitude of the acoustic signal captured 7 by a microphone of the tag device 114 may be used to detect walking or running 8 gait.
9 Voice of the mobile object or voice of another nearby mobile object may be detected by using an audio microphone. Voice is a biometric that can be 11 used to facilitate tag-object association. By using voice detection and voice 12 recognition, analysis of voice picked up by the microphone can be useful for 13 determining the background environment of the tag device 114 / mobile object 112, 14 e.g., in a quiet room, outside, in a noisy cafeteria, in a room with reverberations and the like. Voice can also be used to indicate approximate distance between two 16 mobile objects 112 having tag devices 114. For example, if the microphones of two 17 tag devices 114 can mutually hear each other, the system 100 may establish that 18 the two corresponding mobile objects are at a close distance.
19 Proximity of two tag devices may be detected by using audio microphone and ultrasonic sounding. In some embodiments, a tag device 114 can 21 broadcast an ultrasonic sound signature using the microphone, which may be 22 received and detected by another tag device 114 using microphone, and used for 23 establishing the FFC-tag association and ranging.

1 The above list of tag measurements is non-exhaustive, and may be 2 selectively included in the system 100 by a system designer in various embodiments. Typically there is ample information for tag devices to measure for 4 positively forging the FFC-tag association.
The operation of the network arbitrator component 148 and the tag arbitrator component 152 is driven by an overriding optimization objective. In other 7 words, a constrained optimization is conducted with the objective of minimizing the 8 tag device energy expenditure (e.g., minimizing battery consumption such that the 9 battery of the tag device can last for several weeks). The constraint is that the estimated location of the mobile object equipping with the tag device (i.e., the tracking precision) is needed to be within an acceptable error range, e.g., within a 12 two-meter range, and that the association probability between an FFC, i.e., an observed object, and the tag device is required to be above a pre-determined 14 threshold.
In other words, the network arbitrator component 148, during above-mentioned handshaking process with each tag device 114, understands what types 17 of tag measurements can be provided by the tag device 114 and how much energy 18 each tag measurement will consume. The network arbitrator component 148 then 19 uses the image analysis results obtained at step 214 to determine which tag measurement would likely give rise to a sufficient FFC-tag association probability 21 higher than the predefined probability threshold with a smallest power consumption.
22 In some embodiments, one of the design goals of the system is to 23 reduce the power consumption of the battery-driven tag devices 114. On the other 1 hand, the power consumption of the computer cloud 108 is not constrained.
In these 2 embodiments, the system 100 may be designed in such a way that the computer 3 cloud 108 takes as much computation as possible to reduce the computation need 4 of the tag devices 114. Therefore, the computer cloud 108 may employ complex vision-based object detection methods such as face recognition, gesture recognition 6 and other suitable biometrics detection methods, and jointly processing the image 7 streams captured by all imaging devices, to identify as many mobile objects as 8 feasible, within their capability. The computer cloud 108 requests tag devices to 9 report tag measurements only when necessary.
Referring back to Fig. 5A, at step 222, the tag arbitrator components 11 152 of the candidate tag devices 114 receive the tag measurement request from the 12 network arbitrator component 148. In response, each tag arbitrator component 152 13 makes requested tag measurements and report tag measurements to the network 14 arbitrator component 148. The process then goes to step 224 of Fig. 5B
(illustrated =
in Figs. 5A and 5B using connector A).
16 In this embodiment, at step 222, the tag arbitrator component 152 17 collects data from suitable sensors 150 and processes collected data to obtain tag 18 measurements. The tag arbitrator component 152 sends tag measurements, rather 19 than raw sensor data, to the network arbitrator component 148 to save transmission bandwidth and cost.
21 For example, if the network arbitrator component 148 requests a tag 22 arbitrator component 152 to report whether its associated mobile object is stationary 23 or walking, the tag arbitrator component 152 collects data and the IMU
and 63 =

1 processes collected IMU data to calculate a walking probability indicating the 2 likelihood of the associated mobile object being walking. The tag arbitrator 3 component 152 then sends the calculated walking = probability to the network 4 arbitrator component 148. Comparing to transmitting the raw IMU data, transmitting the calculated walking probability of course consumes much less communication 6 bandwidth and power.
7 At step 224 (Fig. 5B), the network arbitrator component 148 then 8 correlates the image analysis results of the FFC and. the tag measurements 9 received thererfrom and calculates an FFC-tag association probability between the FFC and each candidate tag device 114.
11 At step 226, the network arbitrator component 148 checks if any of the 12 calculated FFC-tag association probabilities is greater than the predefined 13 probability threshold. If a calculated FFC-tag association probability is greater than 14 the predefined probability threshold, the network arbitrator component 148 associates the FFC with the corresponding tag device 114 (step 234).
16 At step 236, the network arbitrator component 148 stores the FFC-tag 17 association in the tracking table 182, together with data related thereto such as the 18 location, speed, moving direction, and the like, if the tag device 114 has not yet 19 been associated with any FFC, or update the FFC-tag association in the tracking table if the tag device 114 has already associated with an FFC in previous 21 processing. The computer vision processing block 146 tracks the FFCs/mobile 22 objects.

1 In this way, the system continuously detects and tracks the mobile 2 objects 112 in the site 102 until the tag device 114 is no longer detectable, implying 3 that the mobile object 112 has been stationary for an extended period of time or has 4 moved out of the site 102, or until the tag device 114 cannot be associated with any FFC, implying that the mobile object 112 is at an undetectable location in the site 6 (e.g., a location beyond the FOV of all imaging devices). .
7 After storing/updating the FFC-tag association, the network arbitrator component 148 sends data of the FFC-tag association, such as the height, color, 9 speed and other feasible characteristics of the FFCs, to the computer vision processing block 146 (step 238) for facilitating the computer vision processing block 11 146 to better detect the FFC in subsequent images, e.g.,- facilitating the computer 12 vision processing block 146 in background differencing and bounding box 13 estimation.
14 The process then goes to step 240, and the network arbitrator component 148 checks if all FFCs have been processed. If yes, the process goes to 16 step 206 of Fig. 5A (illustrated in Figs. 5A and 5B using 'connector E) to process 17 further images captured by the imaging devices 104. If not, the process loops to 18 step 214 of Fig. 5A (illustrated in Figs. 5A and 5B using connector D) to select 19 another FFC for processing.
If, at step 226, the network arbitrator component 148 determines that 21 no calculated FFC-tag association probability is greater than the predefined threshold, the network arbitrator component 148 then checks if the candidate tag 23 devices 114 can provide further tag measurements helpful in leading to a sufficiently 1 high FFC-tag association probability (step 228), and if yes, requests the candidate 2 tag devices 114 to provide further tag measurements (step 230). The process then 3 loops to step 222 of Fig. 5A (illustrated in Figs. 5A and 5B using connector B).
4 If, at step 228, it is determined that no further tag measurements would be available for leading to a sufficiently high FFC-tag association probability, 6 the network arbitrator component 148 marks the FFC as an unknown object 7 (step 232). As described before, unknown objects are omitted, or alternatively, 8 tracked up to a certain extent. The process then goes to step 240.
9 Although not shown in Figs. 5A and 5B, the process 200 may be terminated upon receiving a command from an administrative user.
11 Figs. 6A to 6D show an example of establishing and tracking an FF0-12 tag association following the process 200. As shown, the computer vision 13 processing block 146 maintains a background image 250 of an imaging device.
14 When an image 252 of captured by the imaging device is received, the computer vision processing block 146 calculates a difference image 254 using suitable image 16 processing technologies. As shown in Fig. 60, two FFCs 272 and 282 are detected 17 from the difference image 254. The two FFCs 272 and 282 are bounded by their 18 respective bounding boxes 274 and 284. Each bounding box 274, 284 comprises a 19 respective BBTP 276, 286. Fig. 60 shows the captured image 252 with detected FFCs 272 and 282 as well as their bounding boxes 274 and 284 and BBTPs 276 21 and 286.
22 When processing the FFC 272, the image analysis of image 252 and 23 historical images show that the FFC 272 is moving by a walking motion and the 1 FFC 282 is stationary. As the image 252 comprises two FFCs 272 and 282, 2 tag association cannot be established by using the image analysis results only.
3 Two tag devices 114A and 114B have been registered in the system 4 100, neither of which have been associated with an FFC. Therefore, both tag devices 114A and 114B are candidate tag devices.
6 The network arbitrator component 148 then requests the candidate 7 tag devices 114A and 114B to measure certain characteristics of the motion of their 8 corresponding mobile objects. After receiving the tag measurements from tag 9 devices 114A and 114B, the network arbitrator component 148 compares the motion tag measurements of each candidate tag device with that obtained from the 11 image analysis to calculate the probability that the object is undergoing a walking 12 activity. One of the candidate tag devices, e.g., tag device 114A, may obtain a 13 motion tag measurement leading to an FFC-tag association probability higher than 14 the predefined probability threshold. The network arbitrator component 148 then associates FFC 272 with tag device 114A and store this FFC-tag association in the 16 tracking table 182. Similarly, the network arbitrator component 148 determines that 17 the motion tag measurement from tag device 114B indicates that its associated 18 mobile object is in a stationary state, and thus associates tag device 114B with FFC
19 284. The computer vision processing block 146 tracks the FFCs 272 and 282.
With the process 200, the system 100 tracks the FFCs that are 21 potentially moving objects in the foreground. The system 100 also tracks objects 22 disappearing from the foreground, i.e., tag devices not associated with any FFC, 23 which implies that the corresponding mobile objects may be outside the FOV of any =

1 imaging device 104, e.g., in a washroom area or private office where there is no 2 camera coverage. Such disappearing objects, i.e., those corresponding to tag 3 devices with no FFC-tag association, are still tracked based on tag measurements 4 they provide to the computer cloud 108 such as RSS measurements.
Disappearing objects may also be those Who have become static for 6 an extended period of time and therefore part of the background and hence not part 7 of a bounding box 162. It is usually necessary for the system 100 to track all tag 8 devices 114 because in many situations only a portion of the tag devices can be 9 associated with FFCs. Moreover, not all FFCs or foreground objects can be associated with tag devices. The system may track these FFCs based on image 11 analysis only, or alternatively, ignore them.
12 With the process 200, an FFC may be associated with one or more 13 tag device 114. For example, when a mobile object 1120 having a tag device 1140 14 is sufficiently distant from other mobile objects in the FOV of an imaging device, the image of the mobile object 112C as an FFC is distinguishable from other mobile 16 objects in the captured images. The FFC of the mobile object 1120 is then 17 associated with the tag device 1140 only.
18 However, when a group of mobile objects 112D are close to each one, 19 e.g., two persons shaking hands, they may be detected. as one FFC in the captured images. In this case, the FFC is associated with all tag devices of the mobile 21 objects 112D.
22 Similarly, when a mobile object 112E is partially or fully occluded in 23 the FOV of an imaging device by one or more mobile objects 112F, the mobile =

1 objects 112E and 112F may be indistinguishable in the captured images, and be 2 detected as one FFC. In this case, the FFC is associated with all tag devices of the =
3 mobile objects 112E and 112F.
4 Those skilled in the art understand that an FFC associated with multiple tag devices is usually temporary. Any ambiguity caused therefrom may be 6 automatically resolved in subsequent mobile object detection and tracking when the 7 corresponding mobile objects are separated in the FOV of the imaging devices.
8 While the above has described a number of embodiments, those 9 skilled in the art appreciate that other alternative embodiments are also readily available. For example, although in above embodiments, data of FFC-tag 11 associations in the tracking table 182 is fed back to the computer vision processing 12 block 146 for facilitating the computer vision processing block 146 to better detect 13 the FFC in subsequent images (Fig. 4), in an alternative.embodiment, no data of 14 FFC-tag associations is fed back to the computer vision processing block 146.
Fig. 7 is a schematic diagram showing the main function blocks of the system 16 and the data flows therebetween in this embodiment. The object tracking process in 17 this embodiment is the same as the process 200 of Figs. 5A and 5B, except that, in 18 this embodiment, the process does not have step 238 of Fig. 5B.
19 In above embodiments, the network arbitrator component 148, when needing further tag measurements for establishing FFC-tag association, only 21 checks if the candidate tag devices 114 can provide further tag measurements 22 helpful in leading to a sufficiently high FFC-tag association probability (step 228 of 23 Fig 5B). In an alternative embodiment, when needing further tag measurements of a 1 first mobile object, the network arbitrator component 148 can request tag 2 measurements from the tag devices near the first mobile object, or directly use the 3 tag measurements if they are already sent to the computer cloud 108 (probably 4 previously requested for tracking other mobile objects). The tag measurements obtained from these tag devices can be used as inference to the location of the first 6 mobile object. This may be advantageous, e.g., for saving tag device power 7 consumption if the tag measurements of the nearby tag devices are already 8 available in the computer cloud, or when the battery power of the tag device 9 associated with the first object is low.
In another embodiment, the tag devices constantly send tag 11 measurements to the computer cloud 108 without being requested.
12 In another embodiment, each tag device attached to a non-human 13 mobile object, such as a wheelchair, a cart, a shipping box or the like, stores a 14 Type-ID indicating the type of the mobile object. In this embodiment, the computer cloud 108, when requesting tag measurements, can request tag devices to provide 16 their stored Type-ID, and then uses object classification to determine the type of the 17 mobile object, which may be helpful for establishing FFC-tag association. Of course, 18 alternatively, each tag device associated with a human object may also store a 19 Type-ID indicating the type, i.e., human, of the mobile object.
In another embodiment, each tag device .is associated with a mobile 21 object, and the association is stored in a database of the computer cloud 108. In 22 this embodiment, when ambiguity occurs in the visual tracking of mobile objects, the computer cloud 108 may request tag devices to provide their ID, and checks the 2 database to determine the identity of the mobile object for resolving the ambiguity.
3 In another embodiment, contour segmentation can be applied in 4 detecting FFCs. Then, motion of the mobile objects can be detected using suitable classification methods. For example, for individuals, after detecting an FFC, the 6 outline of the detected FFC can be characterized to a small set of features based on 7 posture for determining if the mobile object is standing or walking.
Furthermore, the 8 motion detected over a set of sequential image frames can give rise to an estimate 9 of the gait frequency, which may be correlated with the gait determined from tag measurements.
11 In above embodiments, the computer cloud 108 is deployed at the site 12 102, e.g., at an administration location thereof. However, those skilled in the art 13 appreciate that, alternatively, the computer cloud 108 may be deployed at a location 14 remote to the site 102, and communicates with imaging devices 104 and tag devices 114 via suitable wired or wireless communication means. In some other 16 embodiments, a portion of the computer cloud 108, including one or more server 17 computers 110 and necessary network infrastructure, may be deployed on the site 18 102, and other portions of the computer cloud 108 may be deployed remote to the 19 site 102. Necessary network infrastructure known in the art is required for communication between different portions of the computer cloud 108, and for 21 communication between the computer cloud 108 and the imaging devices 104 and 22 tag devices 114.

1 Implementation 2 The above embodiments show that the system and method disclosed 3 herein are highly customizable, providing great flexibility to a system designer to 4 implement the basic principles ye design the system in a way as desired, and adapt to the design target that the designer has to meet and to the resources that the 6 designer has, e.g., available sensors in tag devices, battery capacities of tag 7 devices, computational power of tag devices and the computer cloud, and the like.
8 In the following, several aspects in implementing the above described 9 system are described.
11 I. Imaging Device Frame Rates 12 In some embodiments, the imaging devices 104 may have different 13 frame rates. For imaging devices with higher frame rates than others, the computer 14 cloud 108 may, at step 206 of the process 200, reduce their frame rate by time-sampling images captured by these imaging devices, or by commanding these 16 imaging devices to reduce their frame rates. Alternatively, the computer cloud 108 17 may adapt to the higher frame rates thereof to obtain better real-time tracking of the 18 mobile objects in the FOVs of these imaging devices.

II. Background Images 21 The computer cloud 108 stores and periodically updates a background 22 image for each imaging device. In one embodiment, the computer cloud 108 uses a 23 moving average method to generate the background image for each imaging device.

1 That is, the computer cloud 108 periodically calculates the average of N
2 consecutively captured images to generate the background image. While the N
3 consecutively captured images may be slightly different to each other, e.g., having 4 different lighting, foreground objects and the like, the differences between these images tend to disappear in the calculated background image when N is sufficiently 6 large.

8 III. FFC Detection 9 In implementing step 208 of detecting FFCs, the computer vision processing block 146 may use any suitable imaging processing methods to detect 11 FFCs from captured images. For example, Fig. 8 is a flowchart showing the detail of 12 step 208 in one embodiment, which will be described together with the examples of 13 Figs. 9A to 9F.
14 At step 302, a captured image is read into the computer vision processing block 146. In this embodiment, the capture image is an RGB color 16 image. Fig. 9A is a line-drawn illustration of a captured color image having two 17 facing individuals as two mobile objects.
18 At step 304, the captured image is converted to a greyscale image 19 (current image) and a difference image is generated by subtracting the background image, which is also a greyscale image in this embodiment, from the current image 21 on a pixel by pixel basis. The obtained difference image is converted to a binary 22 image by applying a suitable threshold, e.g., pixel value being equal to zero or not.

1 Fig. 9B
shows the difference image 344 obtained from the captured 2 image 342. As can be seen, two images 346 and 348 of the mobile objects in the 3 FOV of the imaging device have been isolated from the background. However, the difference image 344 has imperfections. For example, images 346 and 348 of the mobile objects are incomplete as some regions of the mobile objects appear in the 6 image with colors or grey intensities insufficient for differentiating from the background. Moreover, the difference image 344 also comprises salt and pepper 8 noise pixels 350.
9 At step 306, the difference image is processed using morphological operations to compensate imperfections. The morphological operations use Morphology techniques that process images based on shapes. The morphological operations apply a structuring element to the input image, i.e., the difference image 13 in this case, creating an output image of the same size. In morphological operations, 14 the value of each pixel in the output image is determined based on a comparison of the corresponding pixel in the input image with its neighbors. Imperfections are then 16 compensated to certain extents.
17 In this embodiment, the difference image 344 is first processed using morphological opening and closing. As shown in Fig. 9C, salt and pepper noise is 19 removed.
The difference image 344 is then processed using erosion and dilation operations. As shown in Fig. 9D the shapes of the mobile object images 346 and 22 348 are improved. However, the mobile object image 346 still contains a large 23 internal hole 354.

1 After erosion and dilation operations, a flood fill operation is applied to 2 the difference image 344 to close up any internal holes. The difference image 344 3 after flood fill operation is shown in Fig. 9E.
4 Also shown in Fig. 9E, the processed difference image 344 also comprises small spurious FFCs 356 and 358. By applying suitable size criteria such 6 small spurious FFCs 356 and 358 are rejected as their sizes are smaller than a predefined threshold. Large spurious FFCs, on the other hand, may be retained as 8 FFCs.
However, they may be omitted later for not being able to be associated with 9 any tag device. In some cases, a large spurious FFC, e.g., a shopping cart, may be associated with another FFC, e.g., a person, already associated with a tag device, 11 based on similar motion between the two FFCs over time.

Referring back to Fig. 8, at step 308, the computer vision processing 13 block 146 extracts FFCs 346 and 348 from processed difference image 344, each 346, 348 being a connected region in the difference image 344 (see Fig. 9F).
The computer vision processing block 146 creates bounding boxes 356 and 358 16 and their respective BBTPs (not shown) for FFCs 346 and 348, respectively. Other 17 FFC characteristics as described above are also determined.
18 After extracting FFCs from the processed difference image, the 19 process then goes to step 210 of Fig. 5A.
The above process converts the captured color images to greyscale 21 images for generating greyscale difference images and detecting FFCs. Those 22 skilled in the art appreciate that in an alternative embodiment, color difference 23 images may be generated for FFC detection by calculating the difference on each =

=
1 color channel between the captured color image and the background color image.
2 The calculated color channel differences are then weighted and added together to 3 generate a greyscale image for FFC detection.
4 Alternatively, the calculated color channel differences may be enhanced by, e.g., first squaring the pixel values in each color channel, and then 6 adding together the squared values of corresponding pixels in all color channels to 7 generate a greyscale image for FFC detection.

9 IV. Shadows It is well known that shadow may be cast adjacent an object in some 11 lighting conditions. Shadows of a mobile object captured in an image may interfere 12 with FFC detection, the FFC centroid determination and BBTP
determination. For 13 example, Fig. 10 shows a difference image 402 having the image 404 of a mobile 14 object, and the shadow 406 thereof, which is shown in the image 402 under the mobile object image 404. Clearly, if both the mobile object image 404 and the 16 shadow 406 were detected as an FFC, an incorrect bounding box 408 would be 17 determined, and the BBTP would be mistakenly determined at a much lower 18 position 410, compared to the correct BBTP location 412, As a consequence, the 19 mobile object would be mapped to a wrong location in the 3D coordinate system of the site, being much closer to the imaging device.
21 Various methods may be used to mitigate the impact of shadow in 22 detecting FFC and in determining the bounding box, centroid and BBTP of the FFC.
23 For example, in one embodiment, one may leverage the fact that the color of 1 shadows are usually different than that of the mobile object, and filters different 2 color channels of a generated color difference image to eliminate the shadow or 3 reduce the intensity thereof. This method would be less effective if the color of the 4 mobile object is poorly distinguishable from the shadow.
In another embodiment, the computer vision processing block 146 6 considers the shadow as a random distribution, and analyses shadows in captured 7 images to differentiate shadows from mobile object images. For example, for an 8 imaging device facing a well-lit environment, where the lighting is essentially diffuse 9 and that all the background surfaces are Lambertian surfaces, the shadow cast by a mobile object consists of a slightly reduced intensity in a captured image comparing 11 to that of the background areas in the image, as the mobile object only blocks a 12 portion of the light that is emanating from all directions. The intensity reduction is 13 smaller with the shadow point being further from the mobile object.
Hence the 14 shadow will have an intensity distribution scaled with the distance between shadow points and the mobile object while the background has a deterministic intensity 16 value. As the distance from the mobile object to the imaging device is initially 17 unknown, the intensity of the shadow can be represented as a random distribution.
18 The computer vision processing block 146 thus analyses shadows in images 19 captured by this imaging device using a suitable random process method to differentiate shadows from mobile object images.
21 Some imaging devices may face an environment having specular light 22 sources and/or that the background surfaces are not Lambertian surfaces.
Shadows 23 in such environment may not follow the above-mentioned characteristics of the 1 diffuse lighting. Moreover, lighting may change with time, e.g., due to sunlight 2 penetration of room, electrical lights turned off or on, doors opened or closed, and 3 the like. Light changes will also affect the characteristics of shadows.
4 In some embodiments, the computer vision processing block 146 considers the randomness of the intensities of both the background and the shadow 6 in each color channel, and considers that generally the background varies slowly 7 and the foreground, e.g., a mobile object, varies rapidly. Based on such 8 considerations, the computer vision processing block 146 uses a pixel-wise high 9 pass temporal filtering to filtering out shadows of mobile, objects.
In some other embodiments, the computer vision processing block 11 146 determines a probability density function (PDF) of the background to adapt to 12 the randomness of the lighting effects. The intensity of background and shadow 13 components follows a mixture of gaussians (MoG) model, and a foreground, e.g., a 14 mobile object, is then discriminated probabilistically. As there are a large number of neighboring pixels making up the foreground region, then a spatial MoG
16 representation of the PDF of the foreground intensity can be calculated for 17 determining how different it is from the background or shadow.
18 In some further embodiments, the computer vision processing block 19 146 weights and combines the pixel-wise high pass temporal filtering and the spatial MoG models to determine if a given pixel is foreground, e.g., belonging to a mobile 21 object, with higher probability.
22 In still some further embodiments, the computer vision processing 23 block 146 leverages the fact that, if a shadow is not properly eliminated, the BBTP

1 of an FFC shifts from the correct location in the difference images and may shift 2 with the change of lighting. With perspective mapping,. such a shift of BBTP in the 3 difference images can be mapped to a physical location shift of the corresponding 4 mobile object in the 3D coordinate system of the site. The computer vision processing block 146 calculates the physical location shift of the corresponding 6 mobile object in the physical world, and requests the tag device to make necessary 7 measurement using, e.g., the IMU therein. The computer vision processing block 8 146 checks if the calculated physical location shift of the mobile object is consistent 9 with the tag measurement, and compensates for the location shift using the tag measurement.

12 V. Perspective Mapping 13 As described above, at step 210 of Fig. 5A, the extracted FFCs are 14 mapped to the 3D physical-world coordinate system of the site 102.
In one embodiment, the map of the site is partitioned into one or more horizontal, planes L1, Ln, each at a different elevation. In other words, in the 3D
17 physical world coordinate system, points in each plane have the same z-coordinate.
18 However, points in different planes have different z-coordinates. The FOV of each 19 imaging device covers one or more horizontal planes.
A point (xõõ,i, yw,i, 0) on a plane Li at an elevation Zi=0 and falling within 21 the FOV of an imaging device can be mapped to a point (xc, yc) in the images 22 captured by the imaging device:

fx x1 bfyl = Hi lYw,i1 , (1) y 1 fx xc = ¨ P (2) Iv fy (3) 1 wherein H11,1 H12,1 H13,1 Hi = H21,1 H22,1 1123,i F
(4) H31i H32,1 H33,1 2 is a 9-by-9 perspective-transformation matrix.
3 The above relationship between point (xõ,,i, .yõõ,i, 0) in physical world 4 and point (x,, yc) in a captured image may also be written as:
E
H31,1x,..x.w,1 + 1132,1xcyw1 + 113--,-,cc H31,iYcxw,i H32 ,iYCYW,i 31 = H11,1XW,1 1112,01W,i +
H13,1, 5) + H33,1Yc = H21 1x1 H22,iYw,i + H23,1= ( For each imaging device, a perspective-transformation matrix Hi 6 needs to be determined for each plane Li falling within the FOV thereof. The computer vision processing block 146 uses a calibration process to determine a perspective-transformation matrix for each plane in the FOV of each imaging device.
9 In particular, for a plane Li, 1n, falling within the FOV of an imaging device, the computer vision processing block 146 first selects a set of four (4) or 11 more points on plane Li with known 3D physical-world coordinates, such as corners 12 of a floor tile, corners of doors and/or window openings, of which no three points are 13 in the same line, and sets their z-values to zero. The computer vision processing 14 block 146 also identifies the set of known points from the background image and 80 .

determines their 2D coordinates therein. The computer vision processing block 2 then uses a suitable optimization method such as a singular value decomposition 3 (SVD) method to determine a perspective-transformation matrix Hi for plane Li in the 4 FOV of the imaging device. After determining the perspective-transformation matrix Hi, a point on plane Li can be mapped to a point in an image, or a point in an image 6 can be mapped to a point on plane Li by using equation (5).
7 The calibration process may be executed for an imaging device only 8 once at the setup of the system 100, periodically such as during maintenance, as 9 needed such as when repairing or replacing the imaging device. The calibration process is also executed after the imaging device is reoriented or zoomed and 11 focused.
12 During mobile object tracking, the computer vision processing block detects FFCs from each captured image as described above. For each detected FFC, the computer vision processing block 146 determines coordinates (xe, yc) of the BBTP of the FFC in the captured image, and determines the plane, e.g., 16 Lk, that the BBTP of the FFC falls within, with the assumption that the BBTP of the 17 FFC, when mapping to the 3D physical world coordinate system, is on plane Lk, i.e., 18 the z-coordinate of the BBTP equals to that of plane Lk. The computer vision processing block 146 then calculates the coordinates (xw,k, Yw,k, 0) of the BBTP in a 3D physical world coordinate system with respect to the imaging device and plane 21 Lk (denoted as a "local 3D coordinate system") using above equation (5), and translate the coordinates of the BBTP into a location (xw,k+Ax, yw,k+Ay, zk) in the 3D

physical world coordinate system of the site (denoted as the "global 3D
coordinate system"), wherein Ax and Ay are the difference between the origins of the local 3D

coordinate system and the global 3D coordinate system, and zk is the elevation of 3 plane Lk.
4 For example, Fig. 11A is a 3D perspective view of a portion 502 of a site 102 falling with the FOV of an imaging device, and Fig. 11B a plan view of the 6 portion 502. For ease of illustration, the axes of a local 3D physical world coordinate 7 system with respect to the imaging device is also shown, with Xw and Yw representing the two horizontal axes and Zw representing the vertical axis. As 9 shown, the site portion 502 comprises a horizontal, planar floor 504 having a plurality of tiles 506, and a horizontal, planar landing 508 at a higher elevation than 11 the floor 504.
12 As shown in Figs. 11C and 11D, the site portion 502 is partitioned into 13 two planes L1 and L2, with plane L2 corresponding to the floor 504 and plane L1 14 corresponding to the landing 508. Plane L1 has a higher elevation than plane L2.
As shown in Fig. 11E, during calibration of the imaging device, the computer vision processing block 146 uses the corners Al, A2, A3 and A4 of the 17 landing 508, whose physical world coordinates (xwl,ywi,zwi), (xw2, Yw2, zwi), (xw3, 18 yw3, zwi) and (xwa, Ywa, zwi), respectively, are known with zwi also being the elevation 19 of plane L1, to determine a perspective-transformation matrix H1 for plane L1 in the imaging device. Fig. 11F shows a background image 510 captured by the imaging 21 device.
22 As described above, the computer vision processing block 146 set zwi 23 to zero, i.e., set the physical world coordinates of the corners Al, A2, A3 and A4 to 1 (Xw1, Ywl, 0), (Xw2, Yw2,0), (xw3, Yw3, 0) and (xwa, Ywa, 0), respectively, determines their 2 image coordinates (xci, yci), (xc2, Yc2), (xc3, yc3) and (xc4, yc,4), respectively, in the 3 background image 510, and then determines a perspective-transformation matrix H1 4 for plane L1 in the imaging device by using these physical world coordinates (xwl, Ywi, 0), (xw2, Yw2, 0), (Xw3, Yw3, 0) and (xwa, Ywa, 0), and corresponding image 6 coordinates (xci, Yci), (xc2, Ye2), (xc3, )1c3) and (xcA., ye,4)=
7 Also shown in Figs. 11E and 11F, the computer vision processing 8 block 146 uses the four corners Q1, Q2, Q3 and Q4 of a tile 506A to determine a 9 perspective-transformation matrix H2 for plane L2 in the imaging device in a similar manner.
11 After determining the perspective-transformation matrices H1 and H2, 12 the computer vision processing block 146 starts to track mobile objects in the site 13 102. As shown in Fig. 12A, the imaging device captures an image 512, and the 14 computer vision processing block 146 identifies therein an FFC 514 with a bounding box 516, a centroid 518 and a BBTP 520. The computer vision processing block 16 146 determines that the BBTP 520 is within the plane L2, and then uses equation (5) 17 with the perspective-transformation matrix H2 and the coordinates of the 18 in the captured image 512 to calculate the x- and y-coordinates of the BBTP 520 in 19 the 3D physical coordinate system of the site portion 502 (Fig. 12B). As shown in Fig. 120, the computer vision processing block 146 may further translate the 21 calculated x- and y-coordinates of the BBTP 520 to a pair of x- and y-coordinates of 22 the BBTP 520 in the site 102.

CA 02934102 2016-06-22 =
VI. FFC tracking 2 The network arbitrator component 148 updates FFC-tag association 3 and the computer vision processing block 146 tracks an identified mobile object at 4 step 236 of Fig. 5B. Various mobile object tracking methods are readily available in different embodiments.
6 For example, in one embodiment, each FFC in captured image stream 7 is analyzed to determine FFC characteristics, e.g., the motion of the FFC. If the FFC
8 cannot be associated with a tag device without the assistance of tag measurements, 9 the network arbitrator component 148 requests candidate tag devices to obtain required tag measurements over a predefined period of time. While the candidate 11 tag devices are obtaining tag measurements, the imaging devices continue to 12 capture images and the FFCs therein are further analyzed. The network arbitrator 13 component 148 then calculates the correlation between the determined FFC

characteristics and the tag measurements received from each candidate tag device.
The FFC is then associated with the tag device whose tag measurements exhibit 16 highest correlation with the determined FFC characteristics.
17 For example a human object in the FOV of the imaging device walks 18 for a distance along the x-axis of the 2D image coordinate system, pauses, and 19 then turns around and walks back retracing his path. The person repeats this walking pattern for four times. The imaging device captures the person's walking.

Fig. 13 shows a plot of the BBTP x-axis position in captured images.
22 The vertical axis represents the BBTP's x-axis position (in pixel) in captured images, 23 and the horizontal axis represents the image frame index. It can be expected that, if 1 the accelerometer in the person's tag device records the acceleration measurement 2 during the person's walking, the magnitude of the acceleration will be high when the 3 person is walking, and when the person is stationary, the magnitude of the 4 acceleration is small. Correlating the acceleration measurement with FFC
observation made from captured images thus allows the system 100 to establish 6 FFC-tag association with high reliability.
7 Mapping an FFC from the 2D image coordinate system into the 3D
8 physical world coordinate system may be sensitive to noise and errors in 9 analyzation of captured images and FFC detection. For example, mapping the BBTP and/or the centroid of an FFC to the 3D physical world coordinate system of 11 the site may be sensitive to errors such as the errors in determining the BBTP and 12 centroid due to poor processing of shadows; mobile objects may occlude each other;
13 specular lighting results in shadow distortions that may cause more errors in BBTP
14 and centroid determination. Such errors may cause the perspective mapping from a captured image to the 3D physical world coordinate system of the site noisy, and 16 even unreliable in some situations.
17 Other mobile object tracking methods using imaging devices exploit 18 the fact that the motions of mobile objects are generally smooth across a set of 19 consecutively captured images, to improve the tracking accuracy.
With the recognition that perspective mapping may introduce errors, in 21 one embodiment, no perspective mapping is conducted and the computer vision 22 processing block 146 tracks FFCs in the 2D image coordinate system. The 23 advantage of this embodiment is that the complexity and ambiguities of the 2D to 1 3D perspective mapping is avoided. However, the disadvantage is that the object 2 morphing as the object moves in the camera FOV may give rise to errors in object 3 tracking. Modelling object morphing may alleviate the errors caused therefrom, but it 4 requires additional random variables for unknown parameters in the modelling of object morphing or additional variables as ancillary state variables, increasing the 6 system complexity.
7 In another embodiment, the computer vision processing block 146 8 uses an extended Kalman filter (EKF) to track mobile 'objects using the FFCs 9 detected in the captured image streams. When ambiguity occurs, the computer vision processing block 146 requests candidate tag devices to provide tag 11 measurements to resolve the ambiguity. In this embodiment, the random state 12 variables of the EKE are the x- and y-coordinates of the mobile object in the 3D
13 physical world coordinate system following a suitable random motion model such as 14 a random walk model if the mobile object is in a relatively open area, or a more deterministic motion model with random deviation around a nominal velocity if the 16 mobile object is in a relatively directional area, e.g., as a hallway.
17 Following the EKF theory, observations ,are made on discrete time 18 steps, each time step corresponds to a captured image. Each observation is the 19 BBTP of the corresponding FFC in a captured image. In other words, the x-and y-coordinates of the mobile object in the 3D physical world coordinate system are 21 mapped to the 2D image coordinate system, and the compared with the BBTP
22 using EKF for predicting the motion of the mobile object.

=

1 Mathematically, the random state variables, collectively denoted as a 2 state vector, for the nth captured image of a set of consecutively captured images 3 is:
sn = kw,n,Yw,nr., (6) 4 where [-] represents a matrix, and NT represents matrix transpose. The BBTP of corresponding FFC is thus the observation of sn in captured images.
6 In the embodiment that the motion of the mobile object is modelled as 7 random walk, the movement of each mobile object is modelled as an independent 8 first order Markov process with a state vector of sr,. Each captured image 9 corresponds to an iteration of the EKF, wherein a white or Gaussian noise is added to each component Xw,n, Yw,n Of sn. The state vector sn is then modelled based on a 11 linear Markov Gaussian model as:
sn = Asn_i + Bun, (7) 12 with 13 and un being a Gaussian vector with the update covariance of n-2 0 1 Qn = E[unuTi] =(8) 14 In other words, the linear Markov Gaussian model may be written as:
ixw ,n = xw,n-1 ux,n Yw,n = Yw,n-1 uy ,n (9) where .
rux,ni _N (roi , r 0 ,0, 0 o_u2D, (10) ,u ]
y ,n =
1 i.e., each of ux,n and uy,n is a zero-mean normal distribution with a standard 2 deviation of GU.
3 Equation (7) or (9) gives the state transition function. The values of 4 matrix A and B in Equation (7) depends on the system design parameters and the characteristics of the site 102. In this embodiment, A = B = oi Lo (11) 7 The state vector sn is mapped to a position vector [)(c,n, Yc,nfr in the 8 image coordinate system of the capture image using perspective mapping 9 (equations (1) to (3)), i.e., rx,n xw,n fy,n1= H [Yw,n1 (12) fv,n 1 fx,n xc n = , (13) fv,n fy,n Yc,n =(14) v,n Then, the observation, i.e., the position of the BBTP in the 20 image 11 coordinate system, can be modelled as:
= h(s) + wn, (15) 12 where zn = [z1, z2]T is the coordinates of the BBTP with z1 and z2 representing the x-13 and y-coordinates thereof, [h(S)1 _ rxc,n1 _ [fx,ni fv,1 (16) h(s) =
[hy(sn) 13/c,n1 fy,n1 fv, 1 is a nonlinear perspective mapping function, which may be approximated using a 2 first order Talylor series thereof, and Wx,n 0 w ] [ 2 Crz n = [ wy ¨N ([ 0 , 2 (17) 1), 0 0-, 3 i.e., each of the x-component wx,n and the y-component wy,n of the noise vector wn is 4 a zero-mean normal distribution with a standard deviation of az.
The EKF can then be started with the state transition function (7) and 6 the observation function (15). Fig. 14 is a flowchart 700 showing the steps of mobile 7 object tracking using EKF.
8 At step 702, to start the EKF, the initial state vector s(010) and the 9 corresponding posteriori state covariance matrix, M(010), are determined.
The initial state vector corresponds to the location of a mobile object before the imaging 11 device captures any image. In this embodiment, if the location of a mobile object is 12 unknown, its initial state vector is set to be at the center of the FOV
of the imaging 13 device with a zero velocity, and the corresponding posteriori state covariance matrix 14 M(010) is set to a diagonal matrix with large values, which will force the EKF to disregard the initial information and base the first iteration entirely on the FFCs 16 detected in the first captured image. On the other hand, if the location of a mobile 17 object is unknown, e.g., via a RFID device at an entrance as described above, the 18 initial state vector s(010) is set to the known location, and the corresponding 19 posteriori state covariance matrix M(010) is set to a zero matrix (a matrix with all elements being zero).
21 At step 704, a prediction of the state vector is made:

= .
s (ran - 1) = s(n - tin - 1). (18) 1 At step 706, the prediction state covariance is determined:
M (nin ¨ 1) = M (n ¨ 1In ¨ 1) + Qu (19) 2 where =
Qu = [(ILI. 0 ]
j. (20) I_ 0 a 3 At step 708, the Kalman gain is determined:
K (n) = Wnin ¨ 1)H (n)T (H (n) M (nin ¨ 1)H (n)T + Qw)-1- (21) 4 where H(n) is the Jacobian matrix of h(s(nin-1)), Fah.,(s(nin ¨ 1)) ali.,(s(nin ¨ 1)) . .
axõ aymn a hy(s(nin - 1)) ahy(s(nin - 1)) (22) a Yw,n At step 710, prediction correction is conducted. The prediction error is 6 determined based on difference between the predicted location and the BBTP
7 location in the captured image: =
2n = - h(s (nin ¨ 1)). (23) z,n 8 Then, the updated state estimate is given as:
s(nin) = s(nin - 1) + K (n)2n. (24) 9 At step 712, the posterior state covariance is calculated as:
M (nin) = (I ¨ K (n)H (n))M (nin ¨ 1), (25) with I representing an identity matrix.
11 An issue of using the random walk model is that mobile object tracking 12 may fail when the object is occluded. For example, if a mobile object being tracked =

1 is occluded in the FOV of the imaging device, the EKF would receive no new 2 observations from consequent images. The EKF tracking would then stop at the last 3 predicted state, which is the state determined in the. previous iteration, and the 4 Kalman gain will go instantly to zero (0). The tracking thus stops.
This issue can be alleviated by choosing a different 2D model of pose 6 being a random walk model and using the velocity magnitude (i.e., the speed) as an 7 independent state variable. The speed will also be a random walk but with a 8 tendency towards zero (0), i.e., if no observations are made related to speed then it 9 will exponentially decay towards zero (0).
Now consider the EKF update when the object is suddenly occluded 11 such that there are no new measurements. In this case speed state will slowly 12 decay towards zero with settable decay parameter, but generally with high 13 probability. When the object emerges from the occlusion, it would not be too far 14 from the EKF tracking point such that, with the restored measurement quality, accurate tracking can resume. The velocity decay factor used in this model is 16 heuristically set based on the nature of the moving objects in the FOV.
For example, 17 if the mobile objects being tracked are travelers moving in an airport gate area, the 18 change in velocity of bored travelers milling around killing time will be higher and 19 less predictable than people walking purposively down a long corridor.
As each imaging device is facing an area with known characteristics, model parameters can 21 be customized and refined according to the known characteristics of the area and 22 past experience.

=

1 Those skilled in the art appreciate that the above EKF tracking is 2 merely one example of implementing FFC tracking, and other tracking methods are 3 readily available. Moreover, as FFC tracking is conducted in the computer cloud 4 108, the computational cost is generally of less concern, and other advanced tracking methods, such as Bayesian filters, can be used. If the initial location of a 6 mobile object is accurately known, then a Gaussian kernel. may be used. However, 7 if a mobile object is likely in the FOV but its initial location of is unknown, a particle 8 filter (PF) may be used, and once the object becomes more accurately tracked, the 9 PF can be switched to an EKF for reducing computational complexity. When multiple mobile objects are continuously tracked, computational resources can be 11 better allocated by dynamically switching object tracking between PF and EKF, i.e., 12 using EKF to track the mobile objects that have been tracked with higher accuracy, 13 and using PF to track the mobile objects not yet being tracked, or being tracked but 14 with low accuracy.
A limitation of the EKF as established earlier is that the site map is not 16 easily accounted for. Neither are the inferences which are only very roughly 17 approximated as Gaussian as required for the EKF.
18 In an alternative embodiment, non-parametric Bayesian processing is 19 used for FFC tracking by leveraging the knowledge of the site.
In this embodiment, the location of a mobile object in room 742 is represented by a two dimensional probability density function (PDF) p,,,y. If the area 22 in the FOV of an imaging device is finite with plausible boundaries, the area is discretized into a grid, and each grid point is considered to be a possible location for 1 mobile objects. The frame rates of the imaging devices are sufficiently high such 2 that, from one captured image to the next, a mobile object would appear therein 3 either stay at the same grid point or move from a grid point to an adjacent grid point.
4 Fig.
15A shows an example of two imaging devices CA and CB with overlapping FOVs covering an L-shaped room 742. As shown, the room 742 is connected to rooms 744 and 746 via doors 748 and 750, respectively. Rooms 744 7 and 746 are uncovered by imaging devices CA and CB. Moreover, there exist areas uncovered by both CA and CB. An access point (AP) is installed in this room 9 742 for sensing tag devices using RSS measurement.
When a mobile object having a tag device enters room 742, the RSS

measurement indicates that a tag device/mobile object is in the room. However, 12 before processing any captured images, the location of the mobile device is 13 unknown.
14 As shown in Fig. 15B, the area of the room 742 is discretized into a grid having a plurality of grid points 762, each representing a possible location for 16 mobile objects. In this embodiment, the distance between two adjacent grid points along the x- or y- axis is a constant. In other words, each grid point may be expressed as: (iAx, jAy) with Ax and Ay being constants and i and j being integers.
19 Ax and Ay are design parameters that depend on the application and implementation.
21 The computer vision processing block 146 also builds a state diagram 22 of the grid points described the transition of a mobile object from one grid point to another. The state diagram of the grid points is generally a connected graph whose 1 properties change with observations made from the imaging device and the tag 2 device. A state diagram for room 742 would be too complicated to show herein. For 3 ease of illustration, Fig. 16A shows an imaginary, one-dimensional room partitioned 4 to 6 grid points, and Fig. 16B shows the state diagram for the imaginary room of Fig.
16A. In this example, the walls are considered reflective, i.e., a mobile object in grid 6 point 1 can only choose to stay therein or move to grid point 2, and a mobile object 7 in grid point 6 can only choose to stay therein or move to grid point 5.
8 Referring back to Figs. 15A and 15B, as the room 742 is discretized 9 into a plurality of grid points 762, the computer vision processing block associates a belief probability with each grid point as the possibility that the mobile 11 object to be tracked is at that point. The computer vision processing block 146 then 12 considers that the motion of mobile objects follows a first order Markov model, and 13 uses a Minimum Mean Square Error (MMSE) location estimate method to track the 14 mobile object.
Let pf j denote the location probability density function (pdf) or 16 probability mass function (pmf) that the mobile object is at the location (OA, jAy) at 17 the time step t. Initially, if the location of the mobile object is unknown, the location 18 pdf pf:j is set to be uniform over all grid points, i.e., pid = x¨y , for i = 1, ..., X, and] = 1, Y
(26) 19 where X is the number of grid points along the x-axis and Y is the number of grid points along the y-axis.

1 Based on the Markov model, pr.1 is only dependent on the previous probability pfil, the current update and the current BBTP position zt, pit .1 may be computed using a numerical procedure. The minimum variance estimate of the 4 mobile object location is then based on the mean of this pdf.
From one time step to the next, the mobile object may stay at the 6 same grid point or move to one of the adjacent grid points, each of which is associated with a transition probability. Therefore, the expected (i.e., not yet compared with any observations) transition of the mobile object from time step t to 9 time step t+1, or equivalently, from time step t-1 to time.step t, may be described by a transition matrix consisting of these transition probabilities:
pL = Tpt-1, (27) 11 where pf, is a vector consisting of expected location pdfs at time step t, pt-1 is a 12 vector consisting of the location pdfs KJ at time step t-1, and T is the state transition 13 matrix.
14 Matrix T describes the probabilities that mobile object transiting from one grid point to another. Matrix T describes boundary conditions, including reflecting boundaries and absorbing boundaries. A reflecting boundary such as a 17 wall means that a mobile object has to turn back when approaching the boundary.
18 An absorbing boundary such as a door means that a 'mobile object can pass 19 therethrough, and the probability of being in the area diminishes accordingly.
When an image of the area 742 is captured and a BBTP is determined therein, the location of the BBTP is mapped via perspective mapping to the 3D

physical world coordinate system of the area 742 as an observation. Such an 1 observation may be inaccurate, and its pdf, denoted as pnp,i, j, may be modelled 2 as a 2D Gaussian distribution.
3 Therefore, the location pdfs p, or the matrix rot thereof, at time step t 4 may be updated from that at time step t-1 and the BBTP observation as:
Pt = 7/Pit3sTpTPL-1, (28) where Pit3BTp is a vector of P[3BTp,i j at time step t, and rl is a scaler to ensure the 6 updated location pdf pTj can be added to one (1).
7 Equation (28) calculates the posterior location probability pdf pt based 8 on the BBTP data obtained from the imaging device. The peak or maximum of the 9 updated pdf p1, or pt in matrix form, indicates the most likely location of the mobile object. In other words, if the maximum of the updated pdf p1 is at i=ik and j=jk, the 11 mobile object is most likely at the grid point (ikAx, jkAy). With more images being 12 captured, the mobile location pdf pj is further updated using equation (28) to obtain 13 updated estimate of the mobile object location.
14 With this method, if the BBTP is of high certainty then the posterior location probability pdf pt quickly becomes a delta function, giving rise to high 16 certainty of the location of the mobile object.
17 For example, if a mobile object at (ix, jay) is static from time step t =
18 1 to time step t = k, then equation (28) becomes = Hmt3BTp,iipp,i, (29) 1 which becomes a "narrow spike" with the peak at (i,j) after several iterations, and 2 the variance of the MMSE estimate of the object location diminishes.
3 Figs. 17A and 17B show a deterministic example, where a mobile 4 object is moving to the right hand side along the x-axis in the FOV of an imaging device. Fig. 17A is the state transition diagram, showing that the mobile object is 6 moving to the right with probability of one (1). The computer vision processing block 7 146 tests the first assumption that the mobile object is stationary and the second 8 assumption that the mobile object is moving, by using a set of consecutively 9 captured image frames and equation (28). The test results are show in Fig. 17B. As can be seen, while at first several image frames or iterations, both assumptions 11 show similar likelihood, the assumption of a stationary object quickly diminishes to 12 zero probability but the assumption of a moving object grows to a much higher 13 probability. Thus, the computer vision processing block 146 can decide that the 14 object is moving, and may request candidate tag devices to provide IMU
measurements for establishing FFC-tag association. =
16 Figs. 18A to 18E show another example, where a mobile object is 17 slewing, i.e., moving with uncertainty, to the right hand side along the x-axis in the 18 FOV of an imaging device. Fig. 18A is the state transition diagram, showing that, in 19 each transition from one image to another, the mobile object may stay at the same grid point with a probability of q, and may move to the adjacent grid point on the 21 right hand side with a probability of (1-q). Hence the average slew velocity is:
Ax Vavg = (1 ¨ q) ¨At' (30) =

2 Figs.
18B and 18C show the tracking results using equation (28) with 3 q =
0.2. Fig. 18B shows the mean of x- and y-coordinates of the mobile object, 4 which accurately tracked the movement of the mobile object. Fig. 18C shows the standard deviation (STD) of x- and y-coordinates of the mobile object, denoted as 6 STDx and STDy. As can be seen, both STDx and STDy start with a high value (because the initial location PDF is uniformly distributed). STDy quickly reduced to 8 about zero (0) because, in this example, no uncertainty exists along the y-axis 9 during mobile object tracking. STDx quickly reduced from a large initial value to a steady state with a low but non-zero probability due to the non-zero probability q.
11 Other grid based tracking methods are also readily available. For example, instead of using a Gaussian model for the BBTP, a different model designed with consideration of the characteristics of the site, such as its geometry, lighting and the like, and the FOV of the imaging device may be used to provide accurate mobile object tracking.
16 In above embodiment, the position (x, y) of the mobile object is used 17 as the state variables. In an alternative embodiment, the position (x, y) and the velocity (vx, vy) of the mobile object are used as the sta. te=variables. In yet another 19 embodiment, speed and pose may be used as state variables.
In above embodiments, the state transition matrix T is determined without assistance of any tag devices. In an alternative embodiment, the network 22 arbitrator component 148 requests tag devices to provide necessary tag measurement for assistance in determining the state transition matrix T. Fig.
19 is a 1 schematic diagram showing the data flow for determining the state transition matrix 2 T. The computer vision processing block uses computer vision technology to 3 process (block 802) images captured from imaging devices 104, and tracks (block 4 804) FFC using above described BBTP based tracking. The BBTPs are sent to the network arbitrator component 148, and the network arbitrator component 148 6 accordingly requests tag arbitrator components 146. to provide necessary tag 7 measurements. A state transition matrix T is then generated based on obtained tag 8 measurements, and is sent to the computer vision processing block 146 for mobile 9 object tracking.
The above described mobile object tracking using a first order Markov 11 model and grid discretization is robust and computationally efficient.
Ambiguity 12 caused by object merging/occlusion may be resolved using a prediction-observation 13 method (described later). Latency in mobile object tracking (e.g., due to the 14 computational load) is relatively small (e.g., several seconds), and is generally acceptable.
16 The computer vision processing structure 146 provides information 17 regarding the FFCs observed and extracts attributes thereof, including observables 18 such as the bounding box around the FFC, color histogram, intensity, variations 19 from one image frame to another, feature points within the FFC, associations of adjacent FFCs that are in a cluster and hence are part of the same mobile object, 21 optical flow of the FFC and velocities of the feature points, undulations of the overall 22 bounding box and the like. The observables of the FFCs are stored for facilitating, if 23 needed, the comparison with tag measurements.

1 For example, the computer vision processing structure 146 can 2 provide a measurement of activity of the bounding box of an FFC, which is used to 3 compare with similar activity measurement obtained the tag device 114. After normalization a comparison is made resulting in a numerical value for the likelihood indicating whether the activity observed by the computer vision processing structure 6 146 and tag device 114 are the same. Generally a Gaussian weighting is applied 7 based on parameters that are determined experimentally. As another example, the position of the mobile object corresponding to an FFC in the site, as determined via 9 the perspective mapping or transformation from the captured image, and the MMSE
estimate of the mobile object position can be correlated with observables obtained 11 from the tag device 114. For instance, the velocity observed from the change in the position of a person indicates walking, and the tag device reveals a gesture of 13 walking based on IMU outputs. However, such as gesture may be weak if the tag 14 device is attached to the mobile object in such a manner that the gait is weakly detected, or may be strong if the tag device is located in the foot of the person.
16 Fuzzy membership functions can be devised to represent the gesture. This fuzzy 17 output can be compared to the computer vision analysis result to determine the 18 degree of agreement or correlation of the walking activity. In some embodiments, 19 methods based on fuzzy logic may be used for assisting mobile object tracking.
In another example, the computer vision processing structure 146 determines that the bounding box of an FFC has become stationary and then 22 shrunk to half the size. The barometer of a tag device reveals a step change in 23 short 'term averaged air pressure commensurate with an altitude change of about =

=
1 two feet. Hence the tag measurement from the tag device's barometer would 2 register a sit down gesture of the mobile object. However, due to noise and 3 barometer drift as well as spurious changes in room air pressure the gesture is 4 probabilistic. The system thus correlates the tag measurement and computer vision analysis result, and calculates a probability representing the degree of certainty that 6 the tag measurement and computer vision analysis result match regarding the 7 sitting activity.
8 With above examples, those skilled in the art appreciate that, the 9 system determines a degree of certainty of a gesture or activity based on the correlation between the computer vision (i.e., analysis of captured images) and the 11 tag device (i.e., tag measurements). The set of such correlative activities or 12 gestures are then combined and weighted for calculating the certainty, represented 13 by a probability number, that the FFC may be associated with the tag device.

Object merging and occlusion 16 Occlusion may occur between mobile objects, and between a mobile 17 object and a background object. Closely positioned mobile objects may be detected 18 as a single FFC.
19 Figs. 20A to 20E show an example of merging/occlusion of two mobile objects 844 and 854. As shown in Fig. 20A, the two mobile objects 844 and 854 are 21 sufficiently apart and they show in a captured image 842A as separate FFCs 844 22 and 854, having their own bounding box 846 and 856 and BBTPs 848 and 858, 23 respectively.

1 As shown in Figs. 20B to 200, when mobile objects 844 and 854 are 2 moving close to each other, they are detected as a single FFC 864 with a bounding 3 box 866 and a BBTP 868. The size of the single FFC 854. may vary depending the 4 occlusion between the two mobile objects and/or the distance therebetween.
Ambiguity may occur as it may appear that the two previously detected mobile 6 objects 844 and 854 disappear with a new mobile object 864 appearing at the same 7 location.
8 As shown in Fig. 20E, when the two mobile objects have moved apart 9 with sufficient distance, two FFCs are again detected. Ambiguity may occur as it may appear that the previously detected mobile object 864 disappears with two new 11 mobile objects 844 and 854 appearing at the same location.
12 Figs.
21A to 21E show an example that a mobile object is occluded by 13 a background object.
14 Fig.
21A shows a background image 902A having a tree 904 therein as a background object.

mobile object 906A is moving towards the background object 904, 17 and passes the background object 904 from behind. As shown in Fig. 21B, in the 18 captured image 902B, the mobile object 906A is not yet occluded by the background object 904, and the entire image of mobile object 906 is detected as an FFC 906A with a bounding box 908A and a BBTP 910A. In Fig. 21C, the mobile 21 object 906A is slightly occluded by the background object 904 and the FFC 906A, bounding box 908A and BBTP 910A are essentially the same as those detected in 23 the image 902B (except position difference).

1 In Fig.
21D, the mobile object is significantly occluded by the background object 904. The detected FFC 906B is now significantly smaller that the 906A in images 902B and 902C. Moreover, the BBTP 910B is at a much 4 higher position than 910A in images 902B and 902C. Ambiguity may occur as it may appear that the previously detected mobile object 906A disappears and a new 6 mobile object 906B appears at the same location.
7 As shown in Fig. 21E, when the mobile object 906A walks out of the occlusion of the background object 904, a "full" FFC 906A much larger than 906B is detected. Ambiguity may occur as it may appear that the previously detected mobile object 906B disappears and a new mobile object 906A appears at 11 the same location.
12 As described before, the frame rate of the imaging device is sufficiently high, and the mobile object movement is therefore reasonably smooth.
14 Then, ambiguity caused by object merging/occlusion can be resolved by a prediction-observation method, i.e., predicting the action of the mobile object and comparing the prediction with observation obtained from captured images and/or 17 tag devices.
18 For example, the mobile object velocity and/or trajectory may be used 19 as random state variables, and above described tracking methods may be used for prediction. For example, the system may predict the locations and time instants that 21 a mobile object may appear during a selected period of future time, and monitor the 22 FFCs during the selected period of time. If the FFCs appear to largely match the prediction, e.g., the observed velocity and/or trajectory highly correlated with the =

1 prediction (e.g., their correlation higher than a predefined or dynamically set 2 threshold), then the FFCs are associated with the same tag device even if in some 3 moments/images abnormity of FFC occurred, such as size of the FFC
significantly 4 changed, BBTP significantly moved off the trajectory, FFC disappeared or appeared, and the like.
6 If the ambiguity cannot be resolved solely from captured images, tag 7 measurements may be requested to obtain further observation to resolve the 8 ambiguity.

VII. Some alternative embodiments 11 In an alternative embodiment, the system 100 also comprises a map 12 of magnetometer abnormalities (magnetometer abnormality map). The system may 13 request tag devices having magnetometers to provide magnetic measurements and 14 compare with the magnetometer abnormality map for tracking resolving ambiguity occurred during mobile object tracking.
16 In above embodiments, tag devices 114 comprise sensors for 17 collecting tag measurements, and tag devices 114 transmit tag measurements to 18 the computer cloud 108. In some alternative embodiments, at least some tag 19 devices 114 may comprise a component broadcasting, continuously or intermittently, a detectable signal. Also, one or more sensors for detecting such detectable signal 21 are deployed in the site. The one or more sensors detect the detectable signal and 22 obtain measurements of one or more characteristics of the tag device 114, and 23 transit the obtained measurements to the computer cloud 108 for establishing FF0-1 tag association and resolving ambiguity. For example, in one embodiment, each tag 2 device 114 may comprise an RFID transmitter transmitting an RFID
identity, and 3 one or more RFID readers are deployed in the site 102, e.g., at one or more 4 entrances, for detecting the RFID identity of the tag devices in proximity therewith.
As another example, each tag device 114 may broadcast a BLE beacon. One or 6 more BLE access points may be deployed in the site 102, detecting the BLE
beacon 7 of a tag device, and determine an estimated location using RSS. The estimated 8 location, although inaccurate, may be transmitted to the computer cloud for 9 establishing FFC-tag association and resolving ambiguity.
11 VIII. Visual Assisted Indoor Location System (VAILS) 12 In an alternative embodiment, a Visual Assisted Indoor Location 13 System (VAILS) is modified from the above described systems and used for 14 tracking mobile objects in a site being a complex environment such as an indoor environment.

17 VIII-1. VAILS system structure 18 Similar to the systems described above, the VAILS in this embodiment 19 uses imaging devices, e.g., security camera, and, if necessary, tag devices for tracking mobile objects in an indoor environment such as a building. Again, the 21 mobile objects are entities moving or stationary in the indoor environment. At least 22 some mobile objects are each associated with a mobile tag device such that the tag 23 device generally undergoes the same activity as the mobile object it is associated 1 therewith. Hereinafter, such mobile objects associated with tag devices are 2 sometimes denoted as tagged objects, and objects with no tag devices are 3 sometimes denoted as untagged objects. While untagged objects may exist in the 4 system, both tagged and untagged objects may be jointly tracked for higher reliability.
6 While sharing many common features with the systems described 7 above, VAILS faces more tracking challenges such as identifying mobile objects 8 more often entering and exiting the FOV of an imaging device and more often being 9 occluded by background objects (e.g., poles, walls and the like) and/or other mobile objects, causing ambiguity.
11 In this embodiment, VAILS maintains a map of the site, and builds a 12 birds-eye view of a building floor-space view generally, by recording the locations of 13 mobile objects onto the map. Conveniently, the system comprises a birds-eye view 14 processing sub-module (as a portion of a camera view processing and birds-eye view processing module, described below) for maintaining the birds-eye view of the 16 site (denoted the "birds-eye view (BV)" hereinafter for ease of description) and for 17 updating the locations of mobile objects therein based on the tracking results. Of 18 course, such a birds-eye view module may be combined with any other suitable 19 module(s) to form a single module have the combined functionalities.
The software and hardware structures of VAILS are similar to those of 21 the above described systems. Fig. 22 shows a portion of the functional structure of 22 VAILS, corresponding to the computer cloud 108 of Fig. 2. As shown, the computer 23 vision processing module 108 of Fig. 2 is replaced with a camera view processing 1 and birds-eye view processing (CV/BV) module 1002, having a camera view processing submodule 1002A and a birds-eye view processing submodule 1002B.
3 The submodules are implemented using suitable programming languages and/or libraries such as the OpenCV open-source computer vision library offered by opencv.org, MATLABO offered by MathWorks, C++, and the like. Those skilled in 6 the art appreciate that MATLABO may be used for prototyping and simulation of the 7 system, and C++ and/or OpenCV may be used for implementation in practice.

Hereinafter, the term "computer vision processing" is equivalent to the phrase 9 "camera view processing" as the computer vision processing is for processing camera-view images.
11 In some alternative embodiments, the camera view processing and 12 birds-eye view processing submodules 1002A and 1002B may be two separate 13 modules.
14 The camera view processing submodule 1002A receives captured image streams (also denoted as camera views hereinafter) from imaging devices 16 104, processes captured image streams as described above, and detects FFCs therefrom. The FFCs may also be denoted as camera view (CV) objects or blobs 18 hereinafter.
19 The birds-eye view processing sub-module 1002B uses the site map 1004 to establish a birds-eye view of the site and to map each detected blob into 21 the birds-eye view as a BV object. Each BV object thus represents a mobile object 22 in the birds-eye view, and may be associated with a tad device. In other words, 1 blobs are in captured images (i.e., in camera view) and BV objects are in the birds-2 eye view of the site.
3 As shown in Fig. 23, a blob is associated with a tag device via a BV
4 object.
Of course, some BV objects may not be associated with any tag 6 devices if their corresponding mobile object do not have any tag devices associated 7 therewith.

Referring back to Fig. 22, the blob and/or BV object attributes are sent 9 from the CV/BV module 1002 to the network arbitrator 148 for processing and solving any possible ambiguity.
11 Similar to the description above, the network arbitrator 148 may 12 request tag devices 114 to report observations, and use observations received from 13 tag devices 114 and the site map 1004 to solve ambiguity and associate CV objects 14 with tag devices. The CV object/tag device associations are stored in a CV
object/tag device association table 1006. Of course, the network arbitrator 148 may 16 also use the established CV object/tag device associations in the CV object/tag 17 device association table 1006 for solving ambiguity. As will be described in more 18 detail later, the network arbitrator 148 also leverages known initial conditions in 19 establishing or updating CV object/tag device associations.
After processing, the network arbitrator 148 sends necessary data, including state variables, tag device information, and known initial conditions 22 (described later) to the CV/BV module 1002 for updating the birds-eye view.

1 In this embodiment, the data representing the birds-eye view and 2 camera view are stored and processed in a same computing device. Such an arrangement avoids frequent data transfer (or, in some implementations, file transfer) between the birds-eye view and camera views that may otherwise be required. The CV/BV module 1002 and the network arbitrator 148, on the other 6 hand, may be deployed and executed on separate computing devices for improving 7 the system performance and for avoiding heavy computational load to be otherwise 8 applied to a single computing device. As the data transfer between the CV/BV
9 module 1002 and the network arbitrator 148 is generally small, deploying the two modules 1002 and 148 to separate computing devices would not lead to high data transfer requirement. Of course, in embodiments that multi-core or multi-processor computing devices are used, the CV/BV module 1002 and the network arbitrator 13 148 may be deployed on a same multi-processor computing device but executed as 14 separate threads for improving the system performance.
One important characteristic of an indoor site is that the site is usually 16 divided into a number of subareas, e.g., rooms, hallways, separated by predetermined structural components such as walls. Each subarea has one or more 18 entrances and/or exits.
19 Fig. 24 is a schematic illustration of an example site 1020, which is divided into a number of rooms 1022, with entrances/exits 1024 connecting the 21 rooms 1022. The site configuration, including the configuration of rooms, entrances/exits, predetermined obstacles and occlusion structures, is known to the 23 system and is recorded in the site map. Each subarea 1022 is equipped with an 1 imaging device 104. The FOV of each imaging device .104 is generally limited with 2 the respective subarea 1022.
3 A mobile object 112 may walk from one subarea 1022 to another 4 through the entrances/exits 1024, as indicated by the arrow 1026 and trajectory 1028. The cameras 104 in the subareas 1022 capture image streams, which are 6 processed by the CV/BV processing module 1002 and the network arbitrator 148 for 7 detecting the mobile object 112, mapping the detected mobile object 112 into a 8 birds-eye view as a BV object, and determining the trajectory 1028 for tracking the 9 mobile object 112.
When a "new" blob appears in the images captured by an imaging 11 device 104, the system uses initial conditions that are likely related to the new blob 12 to try to associate the new blob in the camera view with a BV object in the birds-eye 13 view and with a mobile object (in the real world). Herein, the initial conditions 14 include data already known by the system prior to the appearance of the new blob.
Initial conditions may include data regarding tagged mobile devices, and may also 16 include data regarding untagged devices.
17 For example, as shown in Fig. 25, a mobile object 112A enters room 18 1022A from the entrance 1024A and moves along the trajectory 1028 towards the 19 entrance 1024B.
The mobile object 112A, when entering room 1022A, appears as a 21 new blob (also referred using numeral 112A for ease of description) in the images 22 captured by the imaging device 104A of room 1022A. As the new blob 112A
23 appears at the entrance 1024A, it is likely that the corresponding mobile object originated from the adjacent room 1022B, sharing the same entrance 1024A with 2 room 1022B.
3 As the network arbitrator 148 is tasked with overall process control 4 and tracking the object using the camera view and tag device observations as input, the network arbitrator 148 in this embodiment has tracked the object outside of the 6 FOV of the imaging device 104A (i.e., in room 1022B). Thus, in this example, when 7 the mobile object 112A enters the FOV of the imaging device 104A, the network arbitrator 148 checks if there exists known data prior to the appearance of the new 9 blob 112A regarding a BV object in room 1022B disappearing from the entrance 1024A. If the network arbitrator 148 finds such data, the network arbitrator collects the found data as a set of initial conditions and sends them as an IC
packet 12 to the CV/BV processing module 1002, or in particular the camera view processing submodule 1002A, and requests the camera view processing submodule 1002A to 14 track the mobile object 112A, which is now shown in the FOV of the imaging device 104A as a new blob 112A in room 1022A.
16 The CV/BV module 1002, or more particularly, the camera view processing submodule 1002A, continuously processes the image streams captured 18 by the imaging device 104A for detecting blobs (in this example, the new blob 112A) 19 and pruning detected blobs for to establishing a blob/BV object, or a blob/BV
object/tag device association for the new blob 112A. For example, the blob 21 may exhibit in the camera view of imaging device 104A as a plurality of sub-blobs repeatedly separating and merging (fission and fusion) due to the imperfection of 23 image processing. Such fission and fusion can be simplified by pruning. The 1 knowledge of the initial conditions allows the camera view processing submodule 2 1002A to further prune and filter the blobs.
3 The pruned graph of blobs is then recorded in an internal blob track 4 file (IBTF). The data in the IBTF records the history of each blob (denoted as a blob track), which may be used to construct a timeline history diagram such as Fig.

6 (described later), and is searchable by the birds-eye view processing submodule 7 1002B or network arbitrator 148. However, the IBTF contains no information that 8 cannot be abstracted directly from the camera-view image frames directly.
In other 9 words, the IBTF does not contain any information from the network arbitrator 148 as initial conditions, nor any information from the birds-eye view fed back to the 11 camera view. As described above, the camera view processing submodule 12 processes captured images using background/foreground differentiation, 13 morphological operations and graph based pruning, and detects foreground blobs 14 representing mobile objects such as human objects, robots and the like.
The camera view stores all detected and pruned blob tracks in the IBTF. Thus, the 16 camera view processing submodule 1002A operates autonomously without 17 feedback from the network arbitrator 148 acts as an autonomous sensor, which is 18 an advantage in at least some embodiments. On the other hand, a disadvantage is 19 that the camera view processing submodule 1002A does not benefit from the information of the birds-eye view processing submodule 1002B or network arbitrator 21 148.
22 The network arbitrator 148 tracks the tagged objects in a maximum 23 likelihood sense, based on data from the camera view and tag sensors.
Moreover, 1 the network arbitrator 148 has detailed information of the site stored in the site map 2 of the birds-eye view processing submodule 1002B. In the example of Fig.
25, the 3 network arbitrator 148 puts together the initial conditions of the tagged object 112A
4 entering the FOV of imaging device 104A, and requests the CV/BV processing module 1002 to track the object 112A. That is, the tracking request is sent from the 6 network arbitrator 148 via the initial conditions.
7 The birds-eye view processing submodule 1002B parses the initial 8 conditions and search for data of object 112A in the IBTF to start tracking thereof in 9 room 1022A. When the birds-eye view processing submodule 1002B finds a blob or a set of sub-blobs that match the initial conditions, the birds-eye view processing 11 submodule 1002B extracts the blob track data from the IBTF and places extracted 12 blob track data into an external blob track file (EBTF). An EBTF record is generated 13 for each request from the network arbitrator 148. In the example of Fig.
25, there is 14 only one EBTF record as there is only one unambiguous object entering the FOV of imaging device 104A. However, if the birds-eye view processing submodule 1002B
16 determines ambiguities resulting from other blob tracks then they can also be 17 extracted into the EBTF.
18 In this embodiment, the system does not comprise any specific 19 identifier to identify whether a mobile object is a human, a robot or another type of object, although in some alternative embodiments, the system may comprise such 21 an identifier for facilitating object tracking.
22 The birds-eye view processing module 1002B processes the request 23 from the network arbitrator 148 to track the blob identified in the initial conditions =

1 passed from the network arbitrator 148. The birds-eye view processing module 2 1002B also processes the IBTF with the initial conditions and the EBTF.
The birds-3 eye view processing module 1002B computes the perspective transformation of the 4 blob in the EBTF and determines the probability kernel of where the mobile object is.
The birds-eye view processing module 1002B also applies constraints of the 6 subarea such as room dimensions, locations of obstructions, walls and the like, and 7 determines the probability of the object 112A exiting the room 1022A
coincident with 8 a blob annihilation event in the EBTF. The birds-eye view processing module 1002B
9 divides the subarea into a 20 floor grid as describe before, and calculates a 2D floor grid probability as a function of time, stored in an object track file (OTF).
The OTF is 11 then made available to the network arbitrator 148. The data flow between the 12 imaging device 104A, camera view processing submodule 1002A, IBTF 1030, birds-13 eye view processing submodule 1002B, the network arbitrator 148, EBTF
1034 and 14 OTF 1034 is shown in Fig. 26.
The above described process is an event driven process and is 16 updated in real time. For example, when the network arbitrator 148 requires an 17 update, the birds-eye view processing submodule 1002B then assembles a partial 18 EBTF based on the accrued data in the IBTF, and provides an estimate of location 19 of the mobile object to the network arbitrator 148. The above described processes can track mobile objects with a latency of a fraction of a second.
21 Referring back to Fig. 25, the camera view processing submodule 22 1002A detects and processes the blob 112A as the mobile object 112A
moves in 23 room 1022A from entrance 1024A to entrance 1024B. The birds-eye view processing module 1002B records the mobile object's trajectory 1028 in the birds-2 eye view.
3 In the example of Fig. 25, there are no competing blobs in the image 4 frames captured by the imaging device 104A and the image processing technology used by the system is sufficiently accurate to avoid blob fragmentation, the IBTF
6 thus consists only a creation event and an annihilation event joined by a single edge 7 that has one or more image frames (see Fig. 31, described later). Also, as the initial conditions from the network arbitrator 148 is unambiguous regarding the tagged 9 object 112A, the IBTF has a single blob coincident with the initial conditions, meaning no ambiguity. The EBTF is therefore the same as the IBTF.
11 The birds-eye view processing submodule 1002B converts the blob in 12 the camera view into a BV object in the birds-eye view, and calculates a floor grid probability, based on the subarea constraint, and the location of the imaging device 14 (hence the computed H matrix, described later). The probability of the BV object location, or in other words, the mobile object location in the site, is updated as 16 described before.
17 The OTF
comprises a summary description of the trajectory of each 18 object location PDF as a function of time. The OTF is interpreted by the network arbitrator 148, and registers no abnormalities or potential ambiguities. The OTF is used for generating the initial conditions for the next adjoining imaging device FOV
21 subarea.
22 The example of Fig. 25 shows an ideal scenario in which there exist 23 no ambiguities in the initial conditions from the network arbitrator 148, and there =
1 exist no ambiguities in the camera view blob tracking. Hence the blob/BV
object/tag 2 device association probability remains at 1 throughout the entire period that the 3 mobile object 112A moves from entrance 1024A to entrance 1024B until the mobile 4 object exits from entrance 1024B.
When the mobile object disappears at entrance 1024B, the system 6 may use the data of the mobile object 112 at the entrance 1024B, or the data of the 7 mobile object 112 in room 1022A, for establishing a blob/BV object/tag device 8 association for a new blob appearing in room 1022C at the entrance 1024B.
9 As another example, if a new blob appearing in a subarea, e.g., a room, but not adjacent any entrance or exit, the new blob may be a mobile object 11 previously being stationary for a long time but now starting to move.
Thus, previous 12 data of mobile objects in that room may be used as initial conditions for the new 13 blob.

VIII-2. Initial conditions 16 By determining and using initial conditions for a new blob appearing in 17 the FOV of an imaging device, the network arbitrator 148 is then able to solve 18 ambiguities that may occur in mobile object tracking. Such ambiguities may arise in 19 many situations, and may not be easily solvable without initial conditions.
Using Fig. 25 as an example, when the imaging device 104A captures 21 a moving blob 112A in room 1022A, and the system detects a tag device in room 22 1022A, it may not be readily determinative to associate the blob 112A
with the tag 23 device due to possible ambiguities. In fact, there exist several possibilities.

=

1 As shown in Fig. 27A, one possibility is that there is indeed only one 2 tagged mobile object 112B in room 1022A moving from entrance 1024A to the exit 3 1024B.
However, as shown in Fig. 27B, a second possibility is that an untagged 4 mobile object 112C is moving in room 1022A from entrance 1024A to the exit 1024B, but there is also a stationary, tagged mobile object 112B in room 1022A
6 outside the FOV of the imaging device 104A.
7 The possibility of Fig. 27B may be confirmed by requesting the tag 8 device to provide motion related observations. If the tag device reports no movement, then, the detected blob 112A must be an untagged mobile object 112C
in the FOV of the imaging device 104A, and there is also a tagged device 112B
in 11 the room 1022A, likely outside the FOV of the imaging device 104A.
12 On the other hand, if the tag device reports movement, then, Fig. 27B
13 is untrue. However, the system may still not be unable to confirm whether Fig.

14 is true, as there exists another possibility as shown in Fig. 270.
As shown in Fig. 27C, there may be an untagged mobile object 112C
16 in room 1022A moving from entrance 1024A to the exit 1024B, and a tagged mobile 17 object 112B outside the FOV of the imaging device 104A and moving.

Referring back to Fig. 25B, the ambiguity between Figs. 27A and 27C
19 may be solved by using the initial conditions likely related to blob 112A that the system has previously determined in adjacent room 1022B. For example, if the 21 system determines that the initial conditions obtained in room 1022B indicate that, immediately before the appearance of blob 112A, an untagged mobile object disappeared from room 1022B at the entrance 1024A, the system can easily =
1 associate the new blob 112A with the untagged mobile object that has disappeared 2 from room 1022B, and the tag device must be associated with a mobile object not 3 detectable in images captured by the imaging device 104A.
4 It is worth to note that there still exists another possibility that a tagged mobile object 112B is moving in room 1022A from entrance 1024A to exit 1024B, 6 and there is also a stationary, untagged mobile object 112C in room 1022A
outside 7 the FOV of the imaging device 104A. Fig. 270 may be confirmed if previous data 8 regarding an untagged mobile object is available; otherwise, the system would not 9 be able to determine if there is any untagged mobile object undetectable from the image stream of the imaging device 104A, and simply ignore such possibilities.
11 Fig. 28 shows another example, in which a tagged mobile object 12 moves in room 1022 from the entrance 1024A on the left-hand side to the right-13 hand side towards the entrance 1024B, and an untagged object 1120 moves in 14 room 1022 from the entrance 1024B on the right-hand side to the left-hand side towards the entrance 1024A. The system knows that there is only one tag device in 16 room 1022.
17 The imaging device 104 in room 1022 detects two blobs 112B and 18 1120, one of which has to be associated with the tag device. Both blobs 122B and 19 1120 show walking motion with some turnings.
Many information and observations may be used to associate the tag 21 device with one of the two blobs 112B and 1120. For example, the initial conditions 22 may show that a tagged mobile object enters from the entrance 1024A on the left-23 hand side, and an untagged mobile object enters from the entrance 1024B
on the 1 right-hand side, indicating that blob 112B shall be associated with the tag device, 2 and blob 1120 corresponds to an untagged device. The accelerometer/rate gyro of 3 the IMU
may provide observations showing periodic activity matching the pattern of 4 the walking activity of the blob 112B, indicating the same as above. Further, short term trajectory estimation based on IMU observations over time may be used to 6 detect turns, which may then be used to compare with camera view detections to establish above described association. Moreover, if the room 1022 is also equipped 8 with a wireless signal transmitter near the entrance 10248 on the right-hand side, 9 and the tag device comprises a sensor for RSS measurement, the RSS
measurement may also indicate an increasing RSS over time, indicating the blob approaching the entrance 1024B is a tagged mobile object. With these example, those skilled in the art appreciate that, during the movement of blobs and 1120 in the FOV of the imaging device 104, the system can obtain sufficient motion related information and observations to determine that blobs and 1120, respectively, are tagged and untagged mobile objects with high 16 likelihood.
17 In some embodiments, if the tag device is able to provide observations, 18 e.g., IMU observations, with sufficient accuracy, a trajectory may be obtained and compared with the camera view detections to establish above described association.
On the other hand, it may be difficult to obtain the trajectory with sufficient accuracy using captured images due to the limited optical resolution of the 22 imaging device 104 and the error introduced in mapping the blob in captured image 23 to the birds-eye view. In many practical scenarios, the images captured by an 1 imaging device may only be used to reliably indicate which mobile object is in front 2 of others.
3 By using relevant initial conditions, image streams captured by one or 4 more imaging devices, and observations from tag devices, the system establishes blob/BV object/tag device associations, tracks tagged mobile objects in the site, and, 6 if possible, tracks untagged mobile objects. An important target of the system is to 7 track and record summary information regarding the locations and main activities of 8 mobile objects, e.g., which subareas and when the mobile objects have been to.
9 One may then conclude a descriptive scenario story such as "the tagged object #123 entered room #456 from port #3 at time t1 and exited port #5 at time t2.
Its 11 main activity was walking". The detailed trajectory of a mobile object and/or quantitative details of the trajectory may not be required in some alternative 13 embodiments.
14 When ambiguity exists, the initial conditions from the network arbitrator 148 may not be sufficient to affirmatively establish a blob/BV
object/tag 16 device association. In other words, the probability of such a blob/BV object/tag 17 device association is less than 1. In this situation, the birds-eye view processing submodule 1002B then starts extracting the EBTF from the IBTF immediately, and 19 considers observations for object/tag device activity correlation.
For example, if the camera view processing module 1002A detects 21 that a blob exhibits a constant velocity indicative of a human walking, the birds-eye 22 view processing submodule 1002B then begins to fill the OTF with the information obtained by the camera view processing submodule 1002A, which is the information =

observed by the imaging device. The network arbitrator 148 analyzes the (partial) 2 OTF and determines an opportunity for object/tag device activity correlation. Then, 3 the network arbitrator 148 requests the tag device to provide observations such as 4 the accelerometer data, RSS measurement, magnetometer data and the like. The network arbitrator 148 also generates additional, processed data such as walking/stationary activity classifier based on tag observations, e.g., the IMU output.
7 The tag observations and the processed data generated by the network arbitrator 8 based on tag observations have been described above. Below lists some of the 9 observations again for illustration purposes:
= walking activity ¨ network arbitrator processed gesture;
11 =
walking pace (compared to undulations of camera-view bounding 12 box);
13 = RSS
multipath activity commensurate with BV object velocity calculated based on the perspective mapping of a blob in the camera view to the birds-eye view.;
16 = RSS
longer term change commensurate with the RSS map (i.e., a 17 map of the site showing RSS distribution therein);
18 = rate gyro activity indicative of walking; and 19 =
magnetic field variations indicative of motion (no velocity may be estimated therefrom).
21 The network arbitrator 148 sends object activity data, which is the data describing object activities, and may be tag observations or above described data =

=

generated by the network arbitrator 148 based on tag observations received tag 2 observations, to the birds-eye view processing submodule 1002B.
3 The birds-eye view processing submodule 1002B then calculates 4 numeric activity correlations between the object activity data and the camera view observations, e.g., data of blobs. The calculated numeric correlations are stored in 6 the OTF, forming correlation metrics.
7 The network arbitrator 148 uses these correlation metrics and weights 8 them to update the blob/BV object/tag device association probability. With sufficient 9 camera view observations and tag observations, ambiguity can be resolved and the blob/BV object/tag device association may be confirmed with an association probability larger than a predefined probability threshold. Fig. 29 shows the relationship between the IBTF 1030, EBTF 1032, OTF 1034, Tag Observable File 13 (TOF) 1036 for storing tag observations, network arbitrator 148 and tag devices 114.
14 With above description, those skilled in the art appreciate that the camera view processing submodule 1002A processes image frames captured by 16 the imaging devices to detect blobs and to determine the attributes of detected 17 blobs.
18 The birds-eye view processing submodule 1002B does not directly 19 communicate with the tag devices. Rather, the birds-eye view processing submodule 1002B calculates activity correlations based on the object activity data provided by the network arbitrator 148. The network arbitrator 148 checks the partial 22 OTF, and, based on the calculated activity correlations, determines if the BV
object 23 can be associated with a tag device.

=

1 Those skilled in the art also appreciate that the network arbitrator 148 2 has an overall connection diagram of the various subareas, i.e., the locations of the subareas and the connections therebetween, but does not have the details of each 4 of the subareas. The details of the subareas are stored in the site map, and, if available, the magnetometer map and the RSS map. These maps are fed to the 6 birds-eye view processing submodule 1002B.
7 When relevant magnetometer and/or RSS data is available from the 8 tag devices, the network arbitrator 148 can relay these data as tag observations 9 (stored in the TOF 1036) to the birds-eye view processing submodule 1002B. As the birds-eye view processing submodule 1002B knows the probability of the tag 11 device being in a specific location, it can update the magnetometer and/or RSS map 12 accordingly.

Generally, the system can employ many types of information for tracking mobile objects, including the image streams captured by the imaging devices in the site, tag observations and initial conditions regarding mobile objects appearing in the FOV of each imaging device. In some embodiments, the system 17 may further exploit additional constraints for establishing blob/BV object/tag device association and tracking mobile objects. Such additional constraints include, but not 19 limited to, realistic object motion constraints. For example, the velocity and acceleration of a mobile object relative to a floor space cannot realistically exceed 21 certain predetermined limits. There may establish justifiable assumption of no object occlusion in birds-eye view. In some embodiments, there may exist a plurality of 23 imaging devices with overlapping FOVs, e.g., monitoring a common subarea; the =

1 image streams captured by these imaging devices thus may be collectively 2 processed to detect and track mobile objects with higher accuracy. The site 3 contains barriers or constraints, e.g., walls, at known locations that mobile objects 4 cannot realistically cross, and the site contains ports or entrances/exits at known locations allowing mobile objects to move from one subarea to another.
6 The above described constraints may be more conveniently 7 processed in the birds-eye view than in the camera view. Therefore, as shown in 8 Fig. 29, the birds-eye view 1042 may be used as a hub for combining data obtained 9 from one or more imaging devices or camera views 104, observations from one or more tag devices 104, and the constraints 1044, for establishing blob/BV
object/tag 11 device association. Some data such as camera view observations of imaging device 12 104 and tag observations of tag devices 114 may be sent to the birds-eye view 13 1042 via intermediate components such as the camera view processing submodule 14 1002A and the network arbitrator 148, respectively. However, such intermediate components are omitted in Fig. 30 for ease of illustration.
16 With the information flow shown in Fig. 30, in a scenario of Fig.

17 where the initial conditions indicate a tagged mobile object 112B
entering the 18 entrance 1024A with steady walking activity, no ambiguity arises. The camera view 19 information, i.e., the blob 112B, and the tag device observations can be corroborated with each other directly without the aid of the additional constraints. In 21 other words, the camera view produces a single blob of very high probability and 22 with no issue of blob association from one image frame to another. A
trajectory of 23 the corresponding mobile object is determined and mapped into the birds-eye view 1 as an almost deterministic path with small trajectory uncertainty. The CV/BV module 2 checks the mapped trajectory to ensure that its correctness (e.g., the trajectory 3 does not cross a wall). After determining the correctness of the trajectory, a BV
4 object is assigned to the blob, and a blob/BV object/tag device association is then established.
6 As there is no issue with the correctness and uniqueness of the established association, the CV/BV module then informs the network arbitrator to establish the probability of the blob/BV object association. The network arbitrator 9 checks the initial conditions likely related to the blob, and calculates the probability of the blob/BV object/tag device association. If the calculated association probability 11 is sufficiently high, e.g., higher than a predefined probability threshold, then the 12 network arbitrator does not request for any further tag observations from tag 13 devices.
14 If, however, the calculated association probability is not sufficiently high, then the network arbitrator requests observations from the tag device.
As described before, the requested observations are those most suitable for increasing 17 the association probability with minimum energy expenditure incurred to the tag 18 device.
In this example, the requested tag observations may be those suitable for confirming walking activity consistent with camera view observations (e.g., walking activity observed from the blob 112B).
21 After receiving the tag observations, the received tag observations are 22 sent to the CV/BV module for re-establishing the blob/BV object/tag device association. The association probability is also re-calculated and compared with the 1 probability threshold to determine whether the re-established associated is 2 sufficiently reliable. This process may be repeated until a sufficiently high 3 association probability is obtained.
4 Fig. 31 is a more detailed version of Fig. 30, showing the function of the network arbitrator 148 in the information flow. As shown, initial conditions 1046 6 are made available to the camera views 104, birds-eye view 1042 and network 7 arbitrator 148. The network arbitrator 148 handles all communications with the tag 8 devices 114 based on the need of associating the tag devices 114 with BV
objects.
9 Tag information and decisions made by the network arbitrator 148 are sent to the camera views 104 and the birds-eye view 1042. The main output 1048 of the 11 network arbitrator 148 is the summary information regarding the locations and main 12 activities of mobile objects, i.e., the scenario stories, which may be used as initial 13 conditions for further mobile object detection and tracking, e.g., for detecting and 14 tracking mobile objects entering an adjacent subarea. The summary information is updated every time when an object exits a subarea.

17 VIII-3. Camera view processing 18 It is common in practice that a composite blob of a mobile object may 19 comprise a plurality of sub-blobs as a cluster. The graph in the IBTF
thus may comprise a plurality of sub-blobs. Many suitable image processing technologies, 21 such as morphological operations, erosion, dilation, flood-fill, and the like, can be 22 used to generate such a composite blob from a set of sub-blobs, which, on the other 23 hand, implies that the structure of a blob is dependent on the image processing 1 technology being used. While under ideal conditions a blob may be decomposed 2 into individual sub-blobs, such decomposition is often practically impossible unless 3 other information, such as clothes color, face detection and recognition, and the like, 4 are available. Thus, in this embodiment, sub-blobs are generally considered hidden with inference only from the uniform motion of the feature points and optical flow.
6 In some situations, the camera view processing submodule 1002A
7 may not have sufficient information from the camera view to determine that a cluster 8 of sub-blobs are indeed associated with one mobile object. As there is no feedback 9 from the birds-eye view processing module 1002B to the camera view processing submodule 1002A, the camera view processing submodule 1002A cannot use the 11 initial conditions to combine a cluster of sub-blobs into a blob.
12 The birds-eye view processing module 1002B, on the other hand, may 13 use initial conditions to determine if a cluster of sub-blobs shall be associated with 14 one BV object. For example, the birds-eye view processing module 1002B
may determine that the creation time of the sub-blobs is coincident with the timestamp of 16 the initial conditions. Also the initial conditions may indicate a single mobile object 17 appearing in the FOV of the imaging device. Thus, the probability that the sub-blobs 18 in the captured image frame are associated with the same mobile object or BV
19 object is one (1).
In some embodiments, a classification system is used for classifying 21 different types of blobs with a classification probability indicating the reliability of 22 blob classification. The different types of blobs include, but not limited to, the blobs 23 corresponding to:

1 = Blob type 1: single adult human object, diffuse lighting, no 2 obstruction;
3 = Blob type 2: single adult human object, diffuse lighting, with 4 obstruction;
= Blob type 3: single adult human object, non-diffuse lighting, no 6 obstruction;
7 = Blob type 4: single adult human object, diffuse lighting, partial 8 occlusion but recoverable;
9 = Blob type 5: two adult humans in one object, diffuse lighting, ambiguous occlusion; and 11 = Blob type 6: two adult humans in one object, specular lighting, 12 ambiguous occlusion.
13 Other types of blobs, e.g., those corresponding to child objects may 14 also be defined. Each of the above types of blobs may be processed using different rules. In some embodiments, the classification system may further identify non-16 human objects such as robots, carts, wheelchairs and the like, based on 17 differentiating the shapes thereof.
18 Fig.
32A shows an example of a blob 1100 of above described type 3, 19 i.e., a blob of a single adult human object under non-diffuse lighting and with no obstruction. The type 3 blob 1100 comprises three (3) sub-blobs or bloblets, including the head 1102, the torso 1104 and the shadow 1106. Fig. 32B
illustrates 22 the relationship between the type 3 blob 1100 and its sub-blobs 1102 to 1106.

1 With the classification system, the camera view processing submodule can then combine a cluster of sub-blobs into a blob, which facilitates the 3 camera view pruning of the graph in the IBTF.
4 The camera view processing submodule 1002A sends classified sub-blobs and their classification probabilities to the birds-eye view processing module 6 1002B for facilitating mobile object tracking.
7 For example, the initial conditions from the network arbitrator 148 indicate a single human object, and the birds-eye view processing submodule 9 1002B, upon reading the initial conditions, expects a human object to appear in the FOV of the imaging device at an expected time (determined from the initial 11 conditions).
12 At the expected time, the camera view processing submodule 1002A
13 detects a cluster of sub-blobs appearing at an entrance of a subarea. With the classification system, the camera view processing submodule 1002A combines a cluster of sub-blobs into a blob, and determines that the blob may be a human 16 object with a classification probability of 0.9, a probability higher than a predefined classification probability threshold, then the birds-eye view processing submodule determines that the camera view processing submodule 1002A has correctly combined the cluster of sub-blobs in the camera view as one blob, and the blob shall be associated with the human object indicated by the initial conditions.
21 On the other hand, if in the above example, the initial conditions indicate two human objects, the birds-eye view processing submodule 1002B then determines that the camera view processing submodule has incorrectly combined 2 the cluster of sub-blobs into one blob.
3 The birds-eye view processing submodule 1002B records its determination regarding the correctness of the combined cluster of sub-blobs in the OTF.
6 When the camera view processing submodule 1002A combines the 7 cluster of sub-blobs into one blob, it also stores information it derived about the blob 8 in the IBTF. If the camera view processing submodule has incorrectly combined the 9 cluster of sub-blobs into one blob, the derived information may also be wrong. To prevent the incorrect information from populating to subsequent calculation and 11 decision making, the birds-eye view processing submodule 1002B applies uncertainty metrics to the data in the OTF to allow the network arbitrator 148 to use 13 the uncertainty metrics for weighting the data in the OTF in object tracking. With 14 proper weighting, the data obtained by the network arbitrator 148 from other sources, e.g., tag observations, may reduce the impact of OTF data that has less certainty (i.e., more likely to be wrong), and reduce the likelihood that the wrong 17 information in OTF data populates to consequent calculation and decision making.
18 In an alternative embodiment, feedback is provided from the birds-eye 19 view processing submodule 1002B to the camera view processing submodule 1002A to facilitate the combination of sub-blobs. For example, if the birds-eye view processing submodule 1002B determines from the initial conditions that there is 22 only one mobile object appearing at an entrance, it feeds back this information to 23 the camera view processing submodule 1002A, such that the camera view processing submodule 1002A can combine the cluster of sub-blobs appearing at 2 the entrance as one blob, even if the cluster of sub-blobs appear in the camera view, 3 from the CV perspective, are more likely projected to be two or more blobs.

Multiple blobs may also merge into one blob due to mobile objects overlapping therebetween in the FOV of the imaging device, and a previously 6 merge blob may be separated when previously overlapped mobile objects are 7 separated.
8 Before describing blob merging and separating (also called fusion and fission), it is note that each blob detected in an image stream comprises two basic blob events, i.e., blob creation and annihilation. A blob creation event corresponds 11 to an event that a blob emerging in the FOV of an imaging device, such as from a 12 side of the FOV of the imaging device, from an entrance or from an obstruction in =
13 the FOV of the imaging device, and the like.
14 A blob annihilation event corresponds to an event that a blob disappears in the FOV of an imaging device, such as exiting from a side of the FOV
16 of the imaging device (implying moving into an adjacent subarea or leaving the site), 17 disappearing behind an obstruction in the FOV of the imaging device, and the like.
18 Fig. 33 shows a timeline history diagram of a life span of a blob. As 19 shown, the life span of the blob comprises a creation event 1062, indicating the first appearing of the blob in the captured image stream, and an annihilation event 1064, indicating the disappearance of the blob from the captured image stream, connected by an edge 1063 representing the life of the blob. During the life span of 23 the blob, the PDF of the BBTP of the blob is updated at discrete time instants 1066, 1 and the BBTP PDF updates 1068 are passed to the birds-eye view for updating a 2 Dynamic Bayesian Network (DBN) 1070. The BTF comprises all blobs observed 3 and tracked prior to any blob/BV object/tag device association. All attributes of the 4 blobs generated by the camera view processing submodule are stored in the BTF.
When the blob annihilation event occurs, it implies that (block 1072) 6 the corresponding mobile object has exited the current subarea and entered an 7 adjacent subarea (or left the site).
8 A blob event instantaneous occurs in an image frame, and may be represented as a node in a timeline history diagram. A blob transition from one event to another is generally across a plurality of image frames, and is represented 11 as an edge in the timeline history diagram.
12 A blob may have more events. For example, a blob may have one or 13 more fusion events, occurred when the blob is merged into another blob, and one or 14 more fission events, occurring when two or more previously merged blobs are separated.
16 For example, Fig. 34 shows a timeline history diagram of the blobs of 17 Fig.
28, which shows that blobs 1 and 2 are created (events 1062A and 1062B, respectively) at entrances 1024A and 1024B, respectively, to the room 1022 in the 19 FOV of the imaging device 104. After a while, a fusion event 1082 of blobs 1 and 2 occurs, resulting in blob 3. Another while later, blob 3 fissions into blobs 4 and 5 (fission event 1084). At the end of the timeline, blob 4 and 5 are annihilated (annihilation events 1064A and 1064B, respectively) as they exit the FOV of the 23 imaging device 104 through entrances 1024B and 1024A, respectively. The camera =
1 view processing submodule 1002A produces the blob-related event nodes and 2 edges, including the position and attributes of the blobs generated in the edge 3 frames, which are passed to a DBN. The DBN puts the most likely story together in 4 the birds-eye view.
Fig. 35A shows an example of a type 6 blob 1110 corresponding to 6 two persons standing close to each other. The blob 1110 comprises three sub-blobs, including two partially overlapping sub-blobs 1112 and 1114 corresponding to the 8 two persons, and a shadow blob 1116. Fig. 35B illustrates the relationship between 9 the type 6 blob 1110 and its sub-blobs 1112 to 1116. Similar to the example of Fig.
33A, the blob 1110 may be decomposed into individual sub-blobs of two human 11 blobs and a shadow blob under ideal conditions.
12 The type 6 blob 1110 and other types of blobs, e.g., type 5 blobs, that 13 are merged from individual blobs, may be separated in a fission event. On the other 14 hand, blobs of individual mobile objects may be merged to a merged blob, e.g., a type 5 or type 6 blob in a fusion event. Generally, fusion and fission events may 16 occur depending on the background, mobile object activities, occlusion, and the like.
17 Blob fusion and fission may cause ambiguity in object tracking.
18 Fig.
36A shows an example of such an ambiguity. As, shown, two tagged objects and 112C simultaneously enter the entrance 1024A of room 1022 and move in the FOV of imaging device 104 across the room 1022, and exit from the entrance 21 1024B.
22 As the mobile objects 112B and 1120 are tagged objects, the initial conditions from the network arbitrator 148 indicate two objects entering room 1022.

1 On the other hand, the camera view processing submodule 1002A only detects one 2 blob from image frames captured by the imaging device 1.04. Therefore, ambiguity 3 occurs.
4 As the ambiguity is not immediately resolvable when the mobile objects 112B and 112C enter the room 1022, the camera view processing 6 submodule 1002A combines detected cluster of sub-blobs into one blob.
7 If mobile objects 112B and 112C are Moving in room 1022 at the 8 same speed, then they still exhibit, in the camera view, as a single blob and ambiguity cannot be resolved. The IBTF then indicates a blob track graph that appears to be moving at a constant rate of walking. A primitive blob tracking would 11 not classify the blob as two humans. The birds-eye view processing submodule analyzes the IBTF based on the initial conditions, and maps the blob cluster 13 graph from the IBTF to the EBTF. As the ambiguity cannot be resolved, the blob 14 cluster is thus mapped as a single BV object, and stored in the OTF. In this case, the network arbitrator 148 would not request any tag measurements as the data in 16 the OTF
does not indicate any possibility of disambiguation, only the initial 17 conditions indicating ambiguity.
18 When the mobile objects 112B and 112C exit room 1022 into an adjacent, next subarea, the network arbitrator 148 assembles data thereof as initial conditions for passing to the next subarea. As will be described later, if the mobile 21 objects 112B and 1120 are separated in the next subarea, they may be successfully identified, and their traces in room 1022 may be "back-tracked".
In 1 other words, the system may delay ambiguity resolution until the identification of 2 mobile objects is successful.
3 If, however, the mobile objects 112B and 112C are moving in room 4 1022 at different speeds, the single blob eventually separates into two blobs.
The single blob is separated when the mobile object traces separate, 6 wherein one trace extends ahead of the other. It is possible that there exists a 7 transition period of separation, in which the single blob may be separated into more 8 than a plurality of sub-blobs, which, together with the inaccuracy of the BBTP of the 9 single blob, cause the camera view processing submodule 1002A to fail to group the sub-blobs into two blobs. However, such a transition=period is temporary and 11 can be omitted.
12 With the detection of two blobs, the IBTF now comprises three blob 13 tracks, i.e., blob track 1 corresponding to the previous single blob, and blob tracks 2 14 and 3 corresponding to the current two blobs, as shown in the timeline history diagram of Fig. 36B.
16 The initial conditions indicate the two ambiguous objects 112B and 17 112C at the entrance 1024A of room 1022, and the birds-eye view processing 18 submodule 1002B processes the IBTF to generate the floor view for blob tracks or 19 edges 1, 2 and 3. Based on the graph and the floor grid, as blob tracks 2 and 3 start at a location in room 1022 in proximity with the end location of blob track 1, the 21 birds-eye view processing submodule 1002B associates blob track 1 with blob track 22 2 to form a first blob track graph, and also associates blob track 1 with blob track 3 to form a second blob track graph, both associations being consistent to the initial 2 conditions and having high likelihoods.
3 It is worth to note that, if one or both of blob. tracks 2 and 3 start at a 4 location in room 1022 far from the end location of blob track 1, the association of blob tracks 1 and 2 and that of blob tracks 1 and 3 would have low likelihood.
6 Back to the example, with the information from the camera view 7 processing submodule 1002A, the birds-eye view processing submodule 1002B
8 determines activities of walking associated with the first and second blob track 9 graphs, which is compared with tag observations for resolving ambiguity.
The network arbitrator 148 requests the tag devices to report tag 11 observations, e.g., the mobile object velocities, when the mobile objects 112B and 12 1120 are in the blob tracks 2 and 3, and uses the velocity observations for resolving 13 ambiguity. The paces of the mobile objects may also be observed in camera view 14 and by the tag devices, and are used for resolving ambiguity. The obtained tag observations such a velocities and paces are stored in the OTF.
16 In some embodiments, the network arbitrator 148 may request tag 17 devices to provide RSS measurement and/or magnetic field measurement.
The 18 obtained RSS and/or magnetic field measurements are sent to the birds-eye view 19 processing submodule 1002B, As the birds-eye view processing submodule 1002B has the 21 knowledge of the traces of mobile objects 112B and 112C, it can correlate the 22 magnetic and RSS measurements with the RSS and magnetic maps, respectively.
23 As the tagged objects are going through the same path with one behind the other, the RSS and/or magnetic correlations for the two objects 112B and 112C exhibit 2 similar pattern with a delay therebetween. The ambiguity can then be resolved and 3 the blobs can be correctly associated with their respective BV objects and tag 4 devices.
The power spectrum of the RSS can also be used for resolving ambiguity. The RSS has a bandwidth roughly proportional to the velocity of the tag 7 device (and thus the associated mobile object). As the velocity is accurately known 8 from the camera view (calculated based on, e.g., optical flow and/or feature point tracking), the RSS spectral power bandwidths may be compared with the object velocity for resolving ambiguity.
11 As the mobile object moves, the magnetic field strength will fluctuate 12 and the power spectral bandwidth will change. Thus, the magnetic field strength 13 may also be used for resolving ambiguity in a similar manner. All of these correlations and discriminatory attributes are processed by the birds-eye view processing submodule 1002B and sent to the network arbitrator 148.
16 As described above, the camera view processing submodule 1002A
17 tries to combine sub-blobs that belong to the same mobile object, by using background/foreground processing, morphological operations and/or other suitable 19 imaging processing techniques. The blobs and/or sub-blobs are pruned, e.g., by eliminating some sub-blobs that are likely not belonging to any blob, to facilitate 21 blob detection and sub-blob combination. The camera view processing submodule also uses optical flow methods to combine a cluster of sub-blobs into one 23 blob.
However, sub-blobs may not be combined if there is potential ambiguity, and 1 thus the BTF (IBTF and EBFT) may comprise multiple blob tracks for the same 2 object.
3 Fig.
37A illustrates an example, in which a blob 112B is detected by 4 the imaging device 104 appearing at entrance 1024A of room 1022, moving towards entrance 1024B along the path 1028, but splitting (fission) into two sub-blobs that 6 move along slightly different path and both exit the room 1022 from entrance 1024B.
7 In this example, three tracks are detected and included in the BTF, 8 with one track from the entrance 1024A to the fission point, and two tracks from the 9 fission point to entrance 1024B.
Initial conditions play an important role in this example in solving the ambiguity. If the initial conditions indicate two mobile objects appearing at entrance 12 1024A, the two tracks after the fission point are then associated with the two mobile 13 objects.

However, if, in this example, the initial conditions indicate a single object appearing at entrance 1024A, as objects cannot be spontaneously created 16 within the FOV of the imaging device 104, the birds-eye view processing submodule 17 interprets the blob appearing at the entrance 1024A as a single mobile object.
18 The first blob track from the entrance 1024A to the fission point is analyzed in the BV frame. The bounding box size that should correspond to the physical size of the object is calculated and verified for plausibility. In this example 21 we are assuming a diffuse light for simplicity such that the shadows are not an issue, 22 and the processing of shadows is omitted as shadows can be treated as described 23 above.

1 Immediately after the fission point, there appear two bounding boxes 2 (i.e., two CV objects or two FFCs). If the two bounding boxes are then moving at 3 different velocity or along two paths significantly apart from each other, the two CV
4 objects are then associated with two mobile objects. Tag observations may be used to determine which one of the two mobile objects is the tagged object.
However, if 6 the two CV objects are moving at substantially the same velocity along two paths 7 close to each other, the ambiguity cannot be solved. In other words, the two CV
8 objects may be indeed a single mobile object but appearing as two CV
objects due 9 to the inaccuracy in image processing, or the two CV objects are two mobile objects but are close to each other and cannot be distinguished with sufficient confidentiality.
11 The system thus considers them as one (tagged) mobile object. If, after exiting from 12 the entrance 1024B, the system observes two significantly different movements, the 13 above described ambiguity occurred in room 1022 can then be solved.
14 With above examples, those skilled in the art appreciate that ambiguity in most situations can be resolved for by using camera view observations 16 and the initial conditions. If the initial conditions are affirmative, the ambiguity may 17 be resolved with probability of one (1). If, however, the initial conditions are 18 probabilistic, the ambiguity is resolved with a probability less than one (1). The 19 mobile object is tracked with a probability less than one (1) and is conditioned on the possibility of the initial conditions. For example, mobile object tracking may be 21 associated with the following Bayesian probabilities:
22 Pr(blob tracks 1, 2 and 3 being associated) = Pr(initial conditions 23 indicating one person), 1 where Pr(A) represents the probability that A is correct; or 2 Pr(blob tracks 2 and 3 being separately associated with blob track 1) =
3 Pr(initial conditions indicating two persons).
4 During object tracking, a blob may change in size or shape, an example of which is shown in Fig. 37B.
6 In this example, there is a cart 1092 in room 1022 that has been stationary for a long time and therefore become part of the background in camera 8 view. A
tagged person 112B enters from the left entrance 1024A and moves across 9 the room 1022 along the path 1028. Upon reaching the cart 1092, the person 112B
pushes the cart 1092 to the right entrance 1024B and exit therefrom.
11 During tracking of the person 112B, the camera view processing submodule 1002A determines a bounding box for the person's blob, which, however, suddenly becomes much larger when the person 112B starts to push the cart 1092 14 therewith.
Accordingly, the information carried in the edge of the blob track graph 16 is characterized by a sudden increase in the size of the blob bounding box, which 17 causes a blob track abnormality in birds-eye view processing. A blob track abnormality may be considered a pseudo-event not detected in the camera view 19 processing but rather in the subsequent birds-eye view processing.
In the example of Fig. 37B, the initial conditions indicate a single 21 person entering entrance 1024A. Although the camera view processing indicates a 22 single blob crossing the room 1022, the birds-eye view processing analyzes the bounding box of the blob and determines that the bounding box size of the blob at 1 the first portion of the trace 1028 (between the entrance 1024A and the cart 1092) 2 does not match that at the second portion of the trace 1028 (between the cart 1092 3 and the entrance 1024B). A blob track abnormality is then detected.
4 Without further information, the birds-eye view processing/network arbitrator can determine that the mobile object 112B Is likely associated with an additional object that was previously part of the background in captured image 7 frames.
8 The association of the person 112B and the cart 1092 can be further confirmed if the cart 1092 comprises a tag device that wakes up as it is being moved by the person (via accelerometer measuring a sudden change). The tag 11 device of the cart 1092 immediately registers itself with the network arbitrator 148, 12 and then the network arbitrator 148 starts to locate this tag device. Due to the coincidence of the tag device waking up and the occurrence of the blob track abnormality, the network arbitrator 148 can determine that the mobile object 112B is now associated with the cart 1092 with a moderate level of probability.
Furthermore, 16 the tag device of the cart 1092 can further detects that it is being translated in position (via magnetic field measurement, RSS measurement, accelerometer and 18 rate gyro data indicating vibrations due to moving, and the like), and thus the cart can be associated with the mobile object 112B during the second portion of the trace 1028.
21 If feedback can be provided to the camera view processing submodule 1002A, the camera view processing submodule 1002A may analyze the background of captured images and compare the background in the images captured after the cart 1092 is pushed with that in the images captured before the 2 cart 1092 is pushed. The difference can show that the cart object 1092 that is 3 moved.
4 Fig.
370 shows another example, in which a tagged person 112B
enters from the left entrance 1024A and moves across the room 1022 along the 6 path 1028. During moving, the person 112B sits down for a while at location 1094, 7 and then stands up and walks out from entrance 1024B.

Accordingly, in the camera view, the person 112B appears as a 9 moving blob from the entrance 1024A where a new track of blob 112B is initiated.
Periodic oscillating of the bounding box confirms the object walking. Then, the 11 walking stops and the blob 112B becomes stationary (e.g., for a second). After that, 12 the blob 112B remains stationary but the height thereof shrinks. When the person stands up, the corresponding blob 112B increases to its previous height. After 14 a short period, e.g., a second, the blob again exhibits walking motion (periodic undulations) and moves at a constant rate towards the entrance 1024B.
16 While in this embodiment the change of the height of the blob in 17 Fig.
370 does not cause ambiguity, in some alternative embodiments, the system 18 may need to confirm the above-described camera observation using tag 19 observations.
IMU tag observations, e.g., accelerometer and rate gyro outputs, 21 exhibit a motion pattern consistent to the camera view observation. In particular, tag observations reveal a walking motion, and then a slight motion activity (when the 23 person 112B is sitting down and when the person 112B is standing up). Then, the 1 IMU tag observations again reveal a walking motion. Such a motion pattern can be 2 used to confirm the camera view observation.
3 In some embodiments wherein the tag device comprise other sensors 4 such as a barometer, the output of the barometer can detect the change in altitude from standing and sitting (except that the tag device is coupled to the person at an 6 elevation close to the floor, or that the tag device is carried in a handbag that is put 7 on a table when the person 112B sits down). As usually the person 112B
will sit 8 down for at least several seconds or even much longer, the barometer output, while 9 noisy, can be filtered with a time constant, e.g., several seconds, to remove noise and detect altitude change, e.g., of about half meter. Thus, the barometer output 11 can be used for detecting object elevation changes, such as a person sitting down, 12 and for confirming the camera view observation.
13 RSS measurement can also be used for indicating object in stationary 14 by determining that the RSS measurement does not change in a previously detected manner or does not change at all. Note that the RSS measurement does 16 not change when the tagged person is walking along an arc and maintaining a 17 constant distance to the wireless signal transceiver. However, this rarely occurs, 18 and even if it occurs, alternative tag observations can be used.
19 In the example of Fig. 37C, the site map may contain information regarding the location 1094, e.g., a chair pre-deployed and fixed at a location 1094.
21 Such information may also be used for confirming the camera view observation.
22 Fig. 37D shows yet another example. Similar to Fig. 37C, a tagged 23 person 112B enters from the left entrance 1024A and moves across the room 1022 1 along the path 1028. Accordingly, in the camera view, the person 112B
appears as 2 a moving blob from the entrance 1024A where a new track of blob 112B is initiated.
3 Periodic oscillating of the bounding box confirms the object walking.
4 When the person 112B arrives at location 1094, the person 112B
sits down. Unlike the situation of Fig. 37C, in Fig. 37D, two untagged persons 112C
and 6 112D are also sitting at location 1094 (not yet merged into the background).
7 Therefore, the blob of person 112B merges with those of persons 1120 and 112D.
8 After a short while, person 112B stands up and walks out from 9 entrance 1024B. The camera view processing submodule 'detects the fission of the merged blob, and the birds-eye view processing submodule can successfully detect 11 the moving of person 112B by combining camera view observations and tag 12 observations.
13 However, if an untagged person, e.g., person 1120 also stands up 14 and walks with person 112B, unresolvable ambiguity occurs as the system cannot detect the motion of the untagged person 1120. Only the motion of the tagged 16 person 112B can be confirmed. This example shows the limitations in tracking 17 untagged mobile objects.
18 Fig. 38 shows a table listing the object activities and the performances 19 of the network arbitrator, camera view processing and tag devices that may be triggered by the corresponding object activities.

=

1 VIII-4. Tracking blobs in image frames 2 Tracking blobs in image frames may be straightforward in some 3 situations such as Fig. 27A in which the association of the blob, the BV
object and 4 the tag device based on likelihood is obvious as there is only one mobile object 112B in the FOV of the imaging device 104A. During the = movement of the mobile 6 object 112B, each image frame captured by the imaging device 104A has a blob 7 that is "matched" with the blob of the previous image frame only with a slight 8 position displacement. As in this scenario blobs cannot spontaneously appear or 9 disappear, the only likely explanation of such a matched blob is that the blobs in the two frames are associated, i.e., representing the same mobile object, with 11 probability of 1.
12 However, in many practical scenarios, some blobs in consecutive 13 frames may be relatively displaced by a large amount, or are significantly different in 14 character. As described earlier, blobs typically are not a clean single blob outlining the mobile object. Due to ambiguities of distinguishing foreground from background, 16 image processing techniques such as background differencing, binary image 17 mapping and morphological operations may typically result in more than one sub-18 blob. Moreover, sub-blobs are dependent on the background, i.e., the sub-blob 19 region becomes modulated by the background. Therefore, while a mobile object cannot suddenly disappear or appear, the corresponding blob can blend 21 ambiguously with the background, and disappear and capriciously and 22 subsequently appear again.

=

practical approach for handling blobs is to "divide and conquer".
2 More particularly, the sub-blobs are tracked individually and associated to a blob 3 cluster if some predefined criteria are met. Often, sub-blobs originate from a fission process. After a few image frames, the sub-blobs undergo a fusion process and become one blob. When the system determines such fission-fusion, the sub-blobs involved are combined as one blob. Test results show that, by considering the structure of the graph of the sub-blobs, this approach is effective in combining sub-8 blobs.
9 Some image processing techniques such as the binary and morphological operations may destroy much of the information regarding a blob.

Therefore, an alternative is to calculate the optical flow from one image frame to the 12 next.
The blob associated with a moving object exhibits ,a nonzero optical flow while 13 the background has a zero flow. However, this requires the imaging device to be stationary and constant, without zooming or panning. Also the frame rate must be sufficiently high such that the object motion is small during the frame interval, comparing to the typical feature length of the object. A drawback of the optical flow approach is that when a human is walking, the captured images show parts of the 18 human are stationary while other parts are moving. Swinging arms can even exhibit 19 an optical flow in the opposite direction.
Although initial conditions may reveal that the object is a walking 21 human, and may allow determination of parts of the human based on the optical 22 flow, such algorithms are complex and may not be robust.. An alternative method is 23 to use feature point tracking, i.e., to track feature points, e.g., corners of a blob.

1 Depending on the contrast of the humans clothing over the background, suitable 2 feature points can be found and used.
3 Another alternative method is to determine the boundary of the object, 4 which may be applied to a binary image after morphological operations. To avoid merely getting boundaries around sub-blobs, snakes or active contours based on a 6 mixture of penalty terms may be used to generate the outline of the human, from 7 which the legs, arms and head can be identified. As the active contour has to be 8 placed about the desired blob, the system avoids forming too large a blob with 9 limited convergence and errors in background/foreground separation that may result in capricious active contours.
11 Other suitable, advanced algorithms may alternatively be used to track 12 the sub-blob of a person's head, and attempt to place a smaller bounding box about 13 each detected head sub-blob. After determining the bounding box of a head and 14 knowing that the human object is walking or standing, the nominal distance from the head to the ground is thus approximately known. Then the. BBTP of the blob can be 16 determined. A drawback of this algorithm is that it may not work well if the human 17 face is not exposed to the imaging device. Of course, this algorithm will fail if the 18 mobile object is not a human.
19 In this embodiment, the VAILS uses the standard method of morphological operations on a binary image after background differencing. This 21 method is generally fast and robust even though it may omit much of the blob 22 information. This method is further combined with a method of determining the 23 graph of all of the related sub-blobs for combining same. When ambiguities arise, =
1 the blob or sub-blob track, e.g., the trajectory being recorded, is terminated, and, if 2 needed, a new track may be started, and maintained after being stable.
Then the 3 birds-eye view processing connects the two tracks to obtain the most likely mobile 4 object trajectory.
In forming the blob tracks, it is important to note that the system has to 6 maximize the likelihood of association. For example, Figs. 39A and 39B
show two 7 consecutive image frames 1122A and 1122B, each having two detected blobs 8 1124A and 1124B. Assuming that the system does not known any information of the 9 mobile object(s) corresponding to the blobs 1124A and 1124B, to determine whether or not the blobs 1124A and 1124B correspond to the same mobile object, 11 the system uses a likelihood overlap integral method. With this method, the system 12 correlates the two blobs 1124A and 1124B in the consecutive frames 1122A
and 13 1122B to determine an association likelihood. In particular, the system incrementally 14 displaces the blob 1124A in the first frame 1122A, and correlates the displaced blob 1124A with the blob 1124B in the second frame 1122B until a maximum correlation 16 or "match" is obtained. The F is essentially a normalized overlap integral (see Fig.
17 39C) in which the equivalence of the correlation coefficient emerges.
18 The system determines a measurement of the likelihood based on the 19 numerical calculation of the cross-correlation coefficient at the location of the maximum blob correlation. Practically the calculated cross-correlation coefficient is 21 a positive number smaller than or equal to one (1).
22 In calculating the maximum correlation of the two blobs 1124A and 23 1124B, the system actually treats the blobs as spatial random process, as the 1 system does not know any information of the mobile object(s) corresponding to the 2 blobs 1124A and 11246. A numerical calculation of correlation is thus used in this embodiment for determining the maximum correlation. In this embodiment, images and 11226 are binary images, and the blob correlation is calculated using data of these binary images. Alternatively, images 1122A and 1122B may be color 6 images, and the system may calculate blob correlation using data of each color 7 channel of the images 1122A and 1122B (thus each color channel being considered 8 an independent random field).
9 In another embodiment, the system may correlate derived attributes of the blobs, e.g., feature points. In particular, the system first uses the well-known 11 Lucas Kanade method to first establish association of the. feature points, and then 12 establishes the object correlation from frame to frame.
13 The above described methods are somewhat heuristic, guided by the 14 notion of correlation of random signals but after modification and selection of the signal (i.e., blob content) in heuristic ways. Each of the methods has its own limitation and a system designer selects a method suitabl.e for meeting the design 17 goals.
18 The above described likelihood overlap integral method as illustrated 19 in Figs. 39A to 390 has an implied assumption that the blob is time invariant, or at least changes slowly with time. While this assumption is generally practical, in some situations where the blob is finely textured, the changes in the blob can be large in 22 every frame interval, and the method may fail. For example, if the object is a human 23 with finely pitched checkered clothing, then a direct correlation over the typical 1 ms (milliseconds) frame interval will result in a relatively small overlap integral. A
2 solution is that the system may pre-process the textured blob with a low pass spatial 3 filter or even conversion to binary with morphological steps such that the overlap 4 integral will be more invariant. However, as the system does not know ahead of time what object texture or persistence the blob has, there is a trade-off of blob 6 preprocessing before establishing the correlation or overlap integral.
7 While difficulty and drawbacks exist, a system designer can still 8 choose a suitable method such that some correlation can be determined over some 9 vector of object attributes. The outcome of the correlation provides a quantitative measure of the association but also provides a measure of how the attributes 11 change from one frame to the next. An obvious example in correlating the binary 12 image is the basic incremental displacement of the blob centroid. If color channels 13 are used, then additionally the system can track the hue of the object color, which 14 varies as the lighting changes with time. The change in, displacement is directly useful. After obtaining, together with the correlation, a measurement of how much 16 the mobile object has moved, the system can then determine how reliable the 17 measurement is, and use this measurement with the numerical correlation to 18 determine a measurement of the association likelihood.
19 If the camera view processing submodule does not have any knowledge of the blob motion from frame to frame, an appropriate motion model 21 may simply be a first order Markov process. Then, blobs that have small 22 displacements between frames would have a higher likelihood factor, and whether 23 the blob completely changes direction from frame to frame is irrelevant.
On the 1 other hand, if initial conditions indicate that the mobile object is a human with steady 2 walking perpendicular to the axis of the imaging device,. then the system can exploit incremental displacement in a specific direction. Moreover, if the mobile object 4 velocity is limited, and will not vary instantaneously, a second order Markov model can be used, which that tracks the mobile object velocity as a state variable.
Such a 6 second order Markov model is useful in blob tracking through regions in which the 7 blob is corrupted by, e.g., background clutter. A Kalman filter may be used in this 8 situation.
9 The birds-eye view processing (described later) benefits from the blob velocity estimate. The system passes the BBTP and the estimate of velocity from 11 the camera view to the birds-eye view.
12 The system resolves potential ambiguity of blobs to obtain the most 13 likely BV object trajectory in birds-eye view. The system considers the initial conditions having high reliability. Consequently, in an image frame such as the image frame 1130 of Fig. 40, potential ambiguity can be readily resolved as each 16 car 1132, 1134 has its own trajectory. More particularly, ambiguity is resolved based 17 on Euclidean distance of the differential displacement, and if needed, based the 18 tracking of the car velocities as the car trajectories are smooth.

problem in using the likelihood overlap integral method that the system has to deal with is that some attributes, e.g., size, orientation and color mix, 21 of blobs in consecutive frames may not be constant, causing the overlap or correlation integral to degrade. The system deals with this problem by allowing 1 these attributes to change within a predefined or adaptively determined range to 2 tolerate correlation integral degradation.
3 In some embodiments, tolerating correlation integral degradation is 4 acceptable if the variation of the blob attributes is small. In some alternative embodiments, the system correlates the binary images of the blobs that have been 6 treated with a sequence of morphological operations to minimize the variation 7 caused by changes in blob attributes.
8 Other methods are also readily available. For example, in some 9 embodiments, the system does not use background differencing for extracting foreground blobs. Rather, the system purposely blurs captured images and then 11 uses optical flow technology to obtain blob flow relative to the background. Optical 12 flow technology, in particular, works well for the interior of the foreground blob that 13 is not modulated by the variation of the clutter in the background. In some 14 alternative embodiments, feature point tracking is used for tracking objects with determined feature points.
16 The above described methods, including the likelihood overlap integral 17 method (calculating block correlation), optical flow or feature point tracking, allow 18 the system to estimate the displacement increment over one image frame interval.
19 In practical use, mobile objects are generally moving slowly, and the imaging devices have a sufficiently high frame rate. Therefore, = a smaller displacement 21 increment in calculating blob correlations gives rise to higher reliability of resolving 22 ambiguity. Moreover, the system in some embodiments can infer a measurement of 23 the blob velocity, and track the blob velocity as a state variable of a higher order =

Markov process of random walk, driven by white (i.e., Gaussian) acceleration 2 components. For example, a Kalman filter can be used for tracking the blob velocity, 3 as most mobile objects inevitably have some inertia and thus the displacement 4 increments are correlated from frame to frame. Such a statistic model based estimation based method is also useful in tracking mobile objects that are 6 temporarily occluded and causes no camera view observation.
7 Generally, blob tracking may be significantly simplified if some 8 information of the mobile object being tracked can be omitted. One of the simplest 9 blob tracking methods with most omitted mobile object information is the method(s) tracking blobs using binary differenced, morphologically processed images. If more 11 details of the mobile objects are desired, more or all attributes of mobile objects and 12 their corresponding blobs have to be retained and used with deliberate modelling.

14 VIII-5. Interrupted blob trajectories Mobile objects may be occluded by obstructions in a subarea, causing 16 fragments of the trajectory of the corresponding blob. Figs. 41A and 416 show an 17 example. As shown, a room 1142 is equipped with an imaging device 104, and has 18 an obstruction 1150 in the FOV of the imaging device 104. A mobile object 112 is 19 moving in a room 1142 from entrance 1144A towards entrance 1144B along a path 1148. A portion of the path 1148 is occluded by the obstruction 1150.
21 With the initial conditions of the mobile object 112 at the entrance 22 1144A, the system tracks the object's trajectory (coinciding with the path 1148) until 23 the mobile object is occluded by the obstruction 1150, at which moment the blob 1 corresponding to the mobile object 112 disappears from the images captured by the 2 imaging device 104, and the mobile object tracking is interrupted.
3 When the mobile object 112 comes out of the obstruction 1150, and 4 re-appears in the captured images, the mobile object tracking is resumed.
As a consequence, the system records two trajectory segments in the blob-track file.
6 The system then maps the two trajectory segments in the birds-eye 7 view, and uses a statistic model based estimation and, if needed, tag observations 8 to determine whether the two trajectory segments shall be connected. As the 9 obstruction is clearly defined in the site map, processing the two trajectory segments in the birds-eye view would be easier and more straightforward. As 11 shown in Fig. 41B, the two trajectory segments or blob tracks are stored in the blob-12 track file as a graph of events and edges.
13 Fig. 42 is the timeline history diagram of Fig. 41A, showing how the 14 two trajectory segments are connected. As shown, when blob 1 (the blob observed before the mobile object 112 is occluded by the obstruction 1150) is annihilated and 16 blob 2 (the blob observed after the mobile object 112 'came out of the obstruction 17 1150) is created, the system determines whether or not blobs 1 and 2 shall be 18 associated by calculating an expected region of blob re-emerging, and checking if 19 blob 2 appears in the expected region. If blob 2 appears in the expected region, the system then associates blobs 1 and 2, and connects the two trajectory segments.
21 In determining whether or not blobs 1 and 2 shall be associated, the 22 system, if needed, may also request tag device(s) to provide tag observations for 23 resolving ambiguity. For example, Fig. 43 shows an alternative possibility that may 1 give rise to same camera view observations. The system can correctly decide 2 between Figs. 41A and 43 by using tag observations.

4 VIII-6. Birds-eye view processing In the VAILS, a blob in a camera view is mapped into the birds-eye 6 view for establishing the blob/ BV object/tag device association. The BBTP is used 7 for mapping the blob into the birds-eye view. However, the uncertainty of the BBTP
8 impacts the mapping.
9 As described above, the BBTP, bounding box track point, of a blob is a point in the captured images that the system estimates as the point that the object 11 contacts the floor surface. Due to the errors introduced in calculation, the calculated 12 BBTP is inaccurate, and the system thus determines an ambiguity region or a 13 probability region associated with the BBTP for describing the PDF of the BBTP
14 location distribution. In ideal case that the BBTP position has no uncertainty, the ambiguity region is reduced to a point.
16 Fig. 44 shows an example of a blob 1100 with a BBTP ambiguity 17 region 1162 determined by the system. The ambiguity region 1162 in this 18 embodiment is determined as a polygon in the camera view with a uniformly 19 distributed BBTP position probability therewithin. Therefore, the ambiguity region may be expressed as an array of N vertices.
21 The vertex array of the ambiguity region is mapped into the birds-eye 22 view floor space using above-described perspective mapping. As the system only CA 02934102 2016-06-22 =
1 needs to calculate the mapping of the vertices, mapping such a polygonal ambiguity 2 region can be done efficiently, resulting in an N-point polygon in the birds-eye view.
3 Figs. 45A and 45B show a BBTP 1172 in the camera view and 4 mapped into the birds-eye view, respectively, wherein the dash-dot line 1174 in Fig. 45B represents the room perimeter.
6 Figs. 46A and 46B show an example of an ambiguity region of a 7 BBTP identified in the camera view and mapped into the birds-eye view, 8 respectively. In this example, the imaging device is located at the corner of a 3D
9 coordinate system at xW=0 and yW=0 with a height of zW=12m. The imaging device has an azimuth rotation of azrot=pi/4 and a down tilt angle of downtilt =pi/3.
11 For example, the object monitored by the imaging device could have a height of 12 z0=5m. Ambiguity mapped into BV based on outline contour of blob results from 13 the 3D box object. The slight displacement shown is a result of the single erosion 14 step taken of the blob.) One would decompose/analyze the blob to get a smaller BBTP polygon uncertainty region.
16 The PDF of the BBTP location is used for Bayesian update. In this 17 embodiment, the PDF of the BBTP location is uniformly distributed within the 18 ambiguity region, and is zero (0) outside the ambiguity region.
Alternatively, the 19 PDF of the BBTP location may be defined as Gaussian or other suitable distribution for taking into account random factors such as the camera orientation, lens 21 distortion and other random factors. These random factors may also be mapped 22 into the birds-eye view as a Gaussian process by determining the mean and 23 covariance matrix thereof.

1 In this embodiment, the VAILS uses a statistic model based estimation 2 method to track the BBTP of a BV object. The statistic model based estimation, 3 such as a Bayesian estimation, used in this embodiment is similar to that described 4 above. The Bayesian object prediction is a prediction of the movement of the BBTP
of a BV object for the next frame time (i.e., the time instant the next image frame to 6 be captured) based on information of the current and historical image frames as 7 well as available tag observations. The Bayesian object prediction works well even 8 if nothing is known regarding the motion of the mobile object (except the positions of 9 the blob in captured images). However, if a measurement of the object's velocity is available, the Bayesian object prediction may use the Object's velocity in predicting 11 the movement of the BBTP of a BV object. The object's velocity may be estimated 12 by the blob Kalman filter tracking of the velocity state variable, based on the optical 13 flow and feature point motion of the camera view bounding box. Other mobile object 14 attributes, such as inertia, maximum speed, object behavior (e.g., a child likely behaving differently than an attendant pushing someone in a wheelchair), and the 16 like. As described above, after object prediction, blob/BV object/tag device 17 association is established, and the prediction result is feedback to computing vision 18 process. The detail of the birds-eye view Bayesian processing is described later.

VIII-7. Updating posterior probability of object location 21 Updating posterior probability of object location is based on the blob 22 track table in the computer cloud 108, which is conducted after the blob/BV
23 object/tag device association is established. The posterior object location pdf is 1 obtained by multiplying the current object location pdf by the blurred polygon 2 camera view observation pdf. Other observations such as tag observations and 3 RSS measurement may also be used for updating posterior probability of object 4 location.
6 VIII-8. Association table update 7 The blob/BV object/tag device association is important to mobile 8 object tracking. An established blob/BV object/tag device association is the 9 association of a tagged mobile object associated with a set of blobs through the timeline or history. Based on such an association, the approximate BV object 11 location can be estimated based on the mean of the posterior pdf. The system 12 records the sequential activities of the tagged mobile object, e.g., "entered door X of 13 the room Y at time T, walked through central part of room and left at time T2 14 through entrance Z". Established blob/BV object/tag device association are stored in an association table. The update of the association table and the Bayesian object 16 prediction update are in parallel and co-dependent. In one alternative embodiment, 17 the system may establish multiple blob/BV object/tag device associations as 18 candidate associations for a mobile object, track the candidate associations, and 19 eventually select the most likely one as the true blob/BV object/tag device association for the mobile object.

1 VIII-9. DBN update 2 The VAILS in this embodiment uses a dynamic Bayesian network 3 (DBN) for calculating and predicting the locations of BV objects.
Initially, the camera 4 view processing submodule operates independently to generate a blob-track file.
The DBN then starts with this blob-track file, transforms the blob therein into a BV
6 object and tracks the trajectory probability. The blob-track file contains the 7 sequence of likelihood metrics based on the blob correlation coefficient.
8 As described before, each blob/BV object/tag device association is 9 associated with an association probability. If the association probability is smaller than a predefined threshold, object tracking is then interrupted. To prevent object 11 tracking interruption due to temporarily lowered association probability, a state 12 machine with suitable intermediate states may be used to allow an association 13 probability to temporarily lower for a short period of time, e.g., for several frames, 14 and increase to above the predefined threshold.
Fig. 47 shows a simulation configuration having an imaging device 16 104 and an obstruction 1202 in the FOV of the imaging device 104. A
mobile object 17 moves along the path 1204. Fig. 48 shows the results of the DBN
prediction.
18 Tracking of a first mobile object may be interrupted when the first 19 mobile object is occluded by an obstruction in the FOV. During the occlusion period, the probability diffuses outward. Mobile object tracking may be resumed after the 21 first mobile object comes out of the obstruction and re-appears in the FOV.
22 However if there is an interfering source such as a second mobile 23 object also emerging from a possible location that the first mobile object may re-1 appear, the tracking of the first mobile object may be mistakenly resumed to 2 tracking the second mobile object. Such a problem is due to the fact that, during 3 occlusion, the probability flow essentially stops and then diffuses outward, 4 becoming weak when tracking is resumed. Fig. 49 shows the prediction likelihood over time in tracking the mobile object of Fig. 47. As shown, the prediction likelihood 6 drops to zero during occlusion, and only restores to a low level after tracking is 7 resumed.
8 If velocity feedback is available, it may be used to improve the 9 prediction. Fig. 50 shows the results of the DBN prediction in tracking the mobile object of Fig. 47. The prediction likelihood is shown in Fig. 51, wherein the circles 11 indicate camera view observations are made, i.e., images are captured, at the 12 corresponding time instants. As can be seen, after using velocity feedback in DBN
13 prediction, the likelihood after resuming tracking only exhibits a small drop. On the 14 other hand, if the prediction likelihood after resuming tracking drops significantly below a predefined threshold, a new tracking is started.
16 Figs. 52A to 52C show another example of a simulation configuration, 17 the simulated prediction likelihood without velocity feedback, and the simulated 18 prediction likelihood with velocity feedback, respectively.
19 To determine if it is the same object when the blob re-emerges or it is a different object, the system calculates the probability of the following two 21 possibilities:

1 A ¨
assuming the same object: considering the drop in association likelihood and considering querying the tag device to 'determine if a common tag 3 device corresponding to both blobs.
4 B ¨
assuming different objects: what is the likelihood that a new object can be spontaneously generated at the start location of the trajectory after the 6 tracking is resumed? What is the likelihood that the original object vanished?
7 Blob-track table stores multiple tracks, and the DBN selects the most 8 likely one.
9 Fig.
53A shows a simulation configuration for simulating the tracking of a first mobile object (not shown) with an interference object 1212 nearby the trajectory 1214 of the first mobile object and an obstruction 1216 between the 12 imaging device 104 and the trajectory 1214. The camera view processing submodule produces a bounding box around each of the first, moving object and 14 the stationary interference object 1212, and the likelihood of the two bounding boxes are processed.
16 The obstruction 1216 limits the camera view measurements, and the 17 nearby stationary interference 1212 appears attractive as the belief will be spread 18 out when the obstruction is ended. The likelihood is calculated based on the overlap 19 integration and shown in Fig. 42. The calculated likelihood is shown in Fig. 53B.
At first the likelihood of the first object builds up quickly but then starts dropping as the camera view measurements stops due to the obstruction.
However, 22 the velocity is known and therefore the likelihood of the first object doesn't decay =

1 rapidly. Then the camera view observations resume after the obstruction and the 2 likelihood of the first object jumps back up.
3 Figs. 54A and 54B show another simulation example.

VI 11-10. Network arbitrator 6 Consider the simple scenario of Fig. 25. The initial conditions originate 7 from the network arbitrator, which evaluates the most likely trajectory of the mobile 8 object 112A as it goes through the site consisting of multiple imaging devices 9 104B,A,C. The network arbitrator attempts to output the most likely trajectory of the mobile object from the time the mobile object enters the site to the time the mobile 11 object exits the site, which may last for hours. The mobile object moves from the 12 FOV of one imaging device to that of the next. As the mobile object enters the FOV
13 of an imaging device, the network arbitrator collects initial conditions relevant to the 14 CV/BV processing module and sends the collected initial conditions thereto. The CV/BV processing module is then responsible for object tracking. When the mobile 16 object leaves the FOV of the current imaging device, the network arbitrator again 17 collects relevant initial conditions for the next imaging device and sends to the 18 CV/BV processing module. This procedure repeats until the mobile object =
19 eventually leaves the site.
In the simple scenario of Fig. 25, the object trajectory is simple and 21 unambiguous such that the object's tag device does not have to be queried.
22 However, if an ambiguity regarding the trajectory or regarding the blob/BV
23 object/tag device association arises, then the tag device will be queried. In other 1 words, if the object trajectory seems dubious or confused with another tag device, 2 the network arbitrator handles requests for tag observations to resolve the ambiguity. The network arbitrator has the objective of minimizing the energy consumed by the tag device subject to the constraint of the acceptable likelihood of the overall estimated object trajectory.
6 The network arbitrator determines the likely trajectory based on a 7 conditional Bayesian probability graph, which may have high computational 8 complexity.
9 Fig. 55 shows the initial condition flow and the output of the network arbitrator. As shown, initial conditions come from network arbitrator and is used in 11 camera view to acquire and track the incoming mobile object as a blob. The blob trajectory is stored in the blob-track file and is passed to the birds-eye view. The 13 birds-eye view does a perspective transformation of the blob track and does a 14 sanity check on the mapped object trajectory to ensure that all constraints are satisfied. Such constraints includes, e.g., that the trajectory cannot pass through building walls, pillars, propagate at enormous velocities, and the like. If constraints 17 are violated then the birds-eye view will distort the trajectory as required, which is conducted as a constrained optimization of likelihood. Once the birds-eye view constraints are satisfied, the birds-eye view reports to the network arbitrator, and the network arbitrator puts the trajectory into the higher level site trajectory 21 likelihood.
22 The network arbitrator is robust to handle errors to avoid failures, such 23 as prediction having no agreement with camera view or with tag observation, =

1 camera view observations and/or tag observations stopped due to various reasons, 2 a blob being misconstrued as a different object and the misconstruing being 3 propagated into another subarea of the site, invalid tag observations, and the like.
4 The network arbitrator resolves ambiguities. Fig. 56 shows an example, wherein the imaging device reports that a mobile object exits from an 6 entrance on the right-hand side of the room. However, there are two entrances on 7 the right-hand side, and ambiguity arises in that it is uncertain which of the two 8 entrances the mobile object takes to exit from the room.
9 The CV/BV processing module reports both possible paths of room-leaving to the network arbitrator. The network arbitrator. processes both paths using 11 camera view and tag observations until the likelihood of one of the paths attains a 12 negligibly low probability, and is excluded.
13 Fig. 57 shows another example, wherein the network arbitrator may 14 delay the choice among candidate routes (e.g., when the mobile object leaves the left-hand side room) if the likelihoods of candidate routes are still high, and make a 16 choice when one candidate route exhibits sufficiently high likelihood.
In Fig. 57, the 17 upper route is eventually selected.
18 Those skilled in the art appreciate that many graph theory and 19 algorithms, such as the Viterbi algorithm, are readily available for selecting the most likely route from a plurality of candidate routes.
21 If a tag device reports RSS measurements of a new set of WiFi 22 access point transmissions, then a new approximate location can be determined 23 and the network arbitrator may request the CV/BV processing module to look for a corresponding blob among the detected blobs in the subarea of the WiFi access 2 point.

4 VIII-11. Taq device Tag devices are designed to reduce power consumption. For example, 6 if a tag device is stationary for a predefined period of time, the tag device then automatically shut down with a timing clock and the accelerometer remaining in operation. When the accelerometer senses sustained motion, i.e., not merely a 9 single impulse disturbance, then the tag device is automatically turned on and establishes communication with the network arbitrator. The network arbitrator may 11 use the last known location of the tag device as the current location thereof, and 12 later updates its location with incoming information, e.g., new camera view 13 observations, new tag observations and location prediction.
14 With suitable sensors therein, tag devices may obtain a variety of observations. For example, 16 = RSS
of wireless signals: the tag device can measure the RSS of 17 one or more wireless signals, indicate if the RSS measurements 18 are increasing, decreasing, and determine the short term variation 19 thereof;
= walking step rate: which can be measured and compared directly 21 with the bounding box in camera view;

1 =
magnetic abnormalities: the tag device may comprise a magnetometer for detecting magnetic field with a magnitude, e.g., 3 significantly above 40 pT;
4 =
measuring temperature for obtaining additional inferences; for example, if the measured temperature is below a first predefined threshold, e.g., 37 C, then the tag device is away from the human 7 body, and if the measured temperature is about 37 C, then the tag 8 device is on the human body. Moreover, if the measured temperature is below a second predefined threshold, e.g., 20 C, then it may indicates that the associated mobile object is in outdoor;
11 and 12 = other measurement, e.g., the rms sound level.
13 Fig.
58B shows the initial condition flow and the output of the network arbitrator in a mobile object tracking example of Fig. 58A. A single mobile object moves across a room. The network arbitrator provides the birds-eye view with a set 16 of initial conditions of mobile object entering the current subarea. The birds-eye 17 view maps the initial conditions into the location that the new blob is expected.
After 18 a few image frames the camera view affirms to the birds-eye view that it has detected the blob and the blob-track file is initiated. The birds-eye view tracks the blob and updates the object-track file. The network arbitrator has access to the 21 object-track file and can provide an estimate of the tagged object at any time. When 22 the blob finally vanishes at an exit point, this event is logged in the blob-track file 23 and the birds-eye view computes the end of the object track. The network arbitrator 1 then assembles initial conditions for the next subarea. In this simple example, there 2 is no query to the tag device as the identity of the blob was never in question.
3 Tagged object may be occluded by untagged object. Fig. 59 shows an 4 example, and the initial condition flow and the output of the network arbitrator are the same as Fig. 58B. In this example, the initial conditions are such that the tagged 6 object is known when it walks through the left-hand side entrance, and that the 7 untagged object is also approximately tracked. As the tracking progresses, the 8 tagged object occasionally becomes occluded by the untagged object. The camera 9 view will give multiple tracks for the tagged object. The untagged object is continuously trackable with feature points and optical flow. That is, the blob events 11 of fusion and fission are sortable for the untagged object. In the birds-eye view, the 12 computation of the blob-track file to object-track file will request a sample of activity 13 from the tag through the network arbitrator. In this scenario the tag will reveal 14 continuous walking activity, which, combined with the prior existence of only one tagged and one untagged object, forces the association of the segmented tracks of 16 the object-track file with high probability. When the tagged object leaves the current 17 subarea, the network arbitrator assembles initial conditions for the next subarea.
18 In this example, for additional confirmation, the tag device can be 19 asked if it is undergoing a rotation motion. The camera view senses the untagged object has gone through about 400 degrees of turning while the tagged object only 21 45 accumulated. However, as the rate gyros require significantly more power than 22 other sensors, such as request will not be sent to the tag device if the ambiguity can 23 be resolved using other tag observations, e.g., observation from the accelerometer.

1 Fig. 60 shows the relationship between the camera view processing 2 submodule, birds-eye view processing submodule, and the network arbitrator/tag 3 devices.

VIII-12. Birds-eye view (BV) Bayesian processing 6 In the following the Bayesian update of the BV is described. The 7 Bayesian update is basically a two-step process. The first step is a prediction of the 8 object movement for the next frame time, followed by update based on a general 9 measurement. It would be basic diffusion if nothing is known of the motion of the object. However, if an estimate of the blob velocity is available, and that the 11 association of the blob and object is assured, then the estimate of the blob velocity 12 is used. This velocity estimate is obtained from the blob Kalman filter tracking of the 13 velocity state variable, based on the optical flow and feature point motion of the 14 camera view bounding box with known information of the mobile object.
16 (i) Diffuse prediction probability based on arbitrary building wall constraints 17 In this embodiment, the site map has constraints of walls with 18 predefined wall lengths and directions. Fig. 61 shows a 3D simulation of a room 19 1400 having an indentation 1402 representing a portion of the room that is inaccessible to any mobile objects. The room is partitioned into a plurality of grid 21 points.
22 The iteration update steps are as follows: =

1 Si. Let the input PDF be Po. Then the Gaussian smearing or diffusion is applied by the 2D convolution, resulting in P1. P1 represents the increase in the uncertainty of the object position based on underlying random 4 motion.
S2. The Gaussian kernel has a half width of If such that P1 is larger 6 than Po by a border of width. The system considers that the walls are reflecting 7 walls such that the probability content in these borders is swept inside the walls of 8 Po.
9 S3. In the inaccessible region, the probability content of each grid point in the inaccessible region is set to that of the closest (in terms of Euclidean distance) wall grid point. The correspondence of the inaccessible grid points and 12 the closest wall points is determined as part of the initialization process of the 13 system, and thus is only done once. To save calculations in each iteration, every inaccessible grid point is pre-defined with a correction, forming an array of corrections. The structure of this matrix is 16 [Correction index, isource, isource, isink, isink]
17 S4.
Finally the probability density is normalized such that it has an integrated value of one (1). This is necessary as the corner fringe regions are not 19 swept and hence there is a loss of probability.
The probability after sufficient number of iterations to approximate a 21 steady state is given in Fig. 62 for the room example of Fig. 61. In this example, the 22 process starts with a uniform density throughout the accessible portion of the room, implying no knowledge of where the mobile object is. Note that the probability is 169 =

1 higher in the vicinity of the walls as the probability impinging on the walls is swept 2 back to the wall position. On the other hand, the probability in the interior is smaller 3 but non-zero, and appears fairly uniform. Of course this result is a product of the 4 heuristic assumptions of appropriating probability mass that penetrates into inaccessible regions. Actually, when measurements are applied the probability ridge 6 at the wall contour becomes insignificant.
7 Figs. 63A and 63B show a portion of the MATLAB code used in the 8 simulation.

(ii) Update based on a general measurement 11 Below, based on standard notation, x is used as the general state 12 variable and z is used as as a generic measurement related to the state variable.
13 The Bayes rule is then applied as p(zIx)p(x) p(xlz) = ___________________________ p(z) = np(zix)p(x), (31) 14 where p(x) can be taken as the pdf prior to the measurement of z and p(xlz) is conditioned on the measurement. Note then that p(z1x) is the probability of the 16 measurement given X. In other words, given the location x then p(z1x) is the 17 likelihood of receiving a measurement z. Note that z is not a variable;
rather it is a 18 given measurement. Hence as z is somewhat random in every iteration then so is 19 p(z1x), which can be a source of confusion.
Putting this into the evolving notation, the calculation of the pdf after 21 the first measurement given can be expressed as = 71Pz1,j,iPu0 ,j,i- (32) 1 Here, pz1,1,i is the probability or likelihood of the observation z given that the object is 2 located at the grid point of fjAg, i6,91. The prior probability of pyi is initially modified 3 based on the grid transition to generate the pdf with update as pi . This is 4 subsequently updated with the observation likelihood of pzi- , resulting in the posterior probability of p}i for the first update cycle. 17 is the universal normalization 6 constant that is implied to normalize the pdf such that it always sums to 1 over the 7 entire grid.
8 Consider the simplest example of initial uniform PDF such that pyi is a 9 constant and positive in the feasibility region where the probability in the inaccessible regions is set to 0. Furthermore, assume that the object is known to be 11 completely static such that there is no diffusion probability, or the Gaussian kernel 12 of the transition probability is a delta function. We can solve for the location pdf as Pu,j,i = 0 (33) = 17 (34) [I
13 Finally assume that the observation likelihood is constant with respect 14 to time such that pzki =p11. This implies that the same observation is made at each iteration but with different noise or uncertainty. For large t the probability of 16 Ki will converge to a single delta function at the point where p is maximum 17 (provided that pyi is not zero at that point). Also implicitly assumed is that the 1 measurements are statistically independent. Note that pyi can actually be anything 2 provided that there is a finite value at the grid point where p(,),Li is maximum.
3 Next, consider the case where the update kernel has finite deviation, 4 which implies that there will be some diffusing of the location probability after each iteration. The measurement will reverse the diffusion. Hence we have two opposing 6 processes like the analogy of the sand pile where one process spreads the pile 7 (update probability kernel) and another group builds up the pile (observations).
8 Eventually a steady state equilibrium will result that is tantamount with the 9 uncertainty of the location of the object. =
As an example, consider a camera view observation, which is 11 described as a Gaussian shaped likelihood kernel (a PDF), and may be the BBTP
12 estimate from the camera view. The Gaussian shaped likelihood kernel may be a 13 simple 2D Gaussian kernel shape represented by the mean and deviation.
Fig. 64 14 shows a portion of the MATLABO code for generating sLich a PDF. Figs.
65A to 65C show the plotting of py,i (the initial probability subject to the site map wall regions), pzk which is the measurement probability kernel which is a constant 17 shape every iteration but with a "random" offset equivalent to the actual 18 measurement z, and is the variable D in the MATLABO code of Fig. 64, and p (the 19 probability after the measurement likelihood has been applied).
After a few iterations, a steady state distribution is reached, an 21 example of which is illustrated in Fig. 66. The steady state is essentially a weighting 22 between the kernels of the diffusion and the observation likelihood.
Note that in the 1 example of Fig. 66, z is a constant such that pzkj,i is always the same.
On the other 2 hand, in the practical cases there is no "steady state" distribution as z is random.
3 Consider the above example where the camera view tracking a blob in 4 which the association of the blob and the mobile object is considered to be uninterrupted. In other words, there are no events causing ambiguity with regards to 6 the one-to-one association between the moving blob and the moving mobile object.
7 If nothing is known regarding the mobile object and the camera view does not track 8 it with a Kalman filter velocity state variable, then the object probability merely 9 diffuses in each prediction or update phase of the Bayesian cycle. This is tantamount to the object undergoing a two dimensional random walk. The deviation 11 of this random walk model is applied in birds-eye view as it directly relates to the 12 physical dimensions. Hence the camera view provides observations of the BBTP of 13 a blob where nothing of the motion is assumed.
14 In the birds-eye view, the random walk deviation is made large enough such that the frame by frame excursions of the BBTP are accommodated.
16 Note that if the deviation is made too small then the tracking will become sluggish.
17 Likewise, if the deviation is too large then tracking will merely follow the 18 measurement z and the birds-eye view will not provide any useful filtering or 19 measurement averaging. Even if the object associated with the blob is unknown, the system is in an indoor environment tracking objects that generally do not exceed 21 human walking agility. Hence practical limits of the deviation can be placed.
22 A problem occurs when the camera view observations are interrupted 23 based on an obstruction of sorts like the object propagating behind an opaque wall.

1 Now there will be an interruption in the blob tracks, and the birds-eye view then has 2 to consider if these paths should be connected, i.e., if they should be associated 3 with the same object. We calculated p without camera view observations based 4 on probability diffusion and realize that the probability "gets stuck"
with p centered at the end point of the first path with ever expanding deviation representing the 6 diffusion. The association to the beginning of the second path is then based on a 7 likelihood that initially grows but only reaches a small level. Hence the association is 8 weak and dubious. The camera view cannot directly assist with the association of 9 the two path segments as it has no assumptions of the underlying object dynamics.
However, the camera view does know about the velocity of the blob just prior to the 11 end of path 1 where camera view observations were lost.
12 Blob velocity can in principle be determined by the optical flow and 13 movement of feature points of the blob resulting in a vector in the image plane.
14 From this a mean velocity of the BBTP can be inferred by the camera view processing submodule alone. The BBTP resides on the flobr surface (approximately) 16 and then we can map this to the birds-eye view with the same routine that was used 17 for the BBTP uncertainty probability polygon mapping onto the floor space. If the 18 velocity vector is perfectly known then the diffusion probability is a delta function 19 that is offset by a displacement vector that is the velocity vector times the frame update time. However, practically the velocity vector will have uncertainty 21 associated with it and the diffusion probability will include this with a deviation. It is 22 reasonable that the velocity uncertainty grows with time and therefore so should this 1 deviation. This is of course heuristic but a bias towards drifting the velocity towards 2 zero is reasonable.

4 VIII-13. H matrix Processing Below describes the H matrix processing necessary for the 6 perspective transformations between the camera and the world coordinate systems.
7 The meaning of variables in this section can be found in the tables of subsection "(vi) 8 Data structures" below.

(i) Definition of rotation angles and translation 11 Blobs in a captured image may be mapped to a 3D coordinate system 12 using perspective mapping. However, such a 3D coordinate system, denoted as a 13 camera coordinate system, is defined from the view of the imaging device or 14 camera that captures the image. As the site may comprises a plurality of image devices, there may exist a plurality of camera coordinate systems, each of which 16 may only be useful for the respective subarea of the site.
17 On the other hand, the site has an overall 3D coordinate system, 18 denoted as a world coordinate system, for site map and for tracking mobile objects 19 therein. Therefore there may need to a mapping between the world coordinate system and a camera coordinate system.
21 The world and camera coordinate systems are right hand systems.
22 Fig. 67A shows the orientation of the world and camera coordinate systems with the 1 translation vector T = [0 0 ¨h]T. First rotate about Xc by (¨pi/2) as in Fig. 67B.
2 Rotation matrix is R1= [1 0 0 0 0 ¨11. (35) 3 Next, rotate in azimuth about Yc in the positive direction by az as in 4 Fig. 67C. The rotation matrix is given as =
R2 [C 0 ¨S
0 1 0 I, (36) where C = cos(az) and S = sin(az). Finally we do the down tilt of atilt as shown in 6 Fig. 67D. The rotation is given by =

R3= [0 C SI, (37) 0 ¨S C
7 where C = cos(atilt) and S = sin(atilt). The overall rotation matrix is R
= R3R2R1, 8 wherein the order of the matrix multiplication is important.
9 After the translation and rotation the camera scaling (physical distance to pixels) and the offset in pixels is applied.
x, x = s¨ + ox, (38) zc Yc y s¨ oy. (39) zc 11 x and y are the focal image plane coordinates which are in terms of pixels.

1 (ii) Direct generation of the H matrix 2 The projective mapping matrix is given as H = [R ¨RT] with the 3 mapping of a world point to a camera point as xc xw Yc H
zw (40) z, 4 Note that we still have to apply the offset and the scaling to map into the focal plane pixels.

7 (iii) Determining the H matrix directly from the image frame.
8 Instead of using the angles and camera height from the floor plane to 9 get R and T and subsequently H, we can compute H directly from an image frame if we have a set of points on the floor and image that correspond. These are called 11 control points. This is very useful procedure as it allows us to map from the set of 12 control points to H to R and T. To illustrate this, suppose we have a picture that is 13 viewed with the camera from which we can determine the four vertex points as 14 shown in Figs. 68A and 68B.
We can easily look at the camera frame and pick out the 4 vertex 16 points of the picture unambiguously. Suppose that the vertex points of Pout are 17 given by (-90, -100), (90, -100), (90, 100) and (-90, 100). The corresponding vertex 18 points in the camera image are given as (0.5388, 1.2497), (195.7611, 39.3345), 19 (195.7611, 212.3656) and (0.8387, 251.3501). We can then run a suitable function, e.g., the cp2tform() MATLABO function, to determine the inverse projective 21 transform. The MATLAB code is shown in Fig. 69.

1 In Fig. 69, [g1,g2] is the set of input points of the orthographic view, 2 which is the corner vertex points of the image. [x,y] is the set of output points, which 3 are the vertex points of the image picked off the perspective image.
These are used 4 to construct the transformation matrix H. H can be used in, e.g., the MATLABO
imtransform() function to "correct" the distorted perspective image Fig. 68B
back to 6 the orthographic view resulting in Fig. 70.
7 Note that here we have used 4 vertex points. We may alternatively 8 use more points and then H will be solved in a least-square sense.
9 The algorithm contained in cp2tform() and imtransform() is based on selecting control points that are contained in a common plane in the world reference 11 frame. In the current case, the control points reside on the Zw = 0 plane. We will 12 use the constraint of [ff:xyl F[R1211 R2 1 [[R112 =R

R T ffww yx = H xy (41) [R3]1 [R3]2 1 1 13 to first determine H and then extract the coefficients of {R, T}. The elements of H
14 are denoted as H = {H11 H12 H13 H21 H22 H23I. (42) Note that the first two columns of H are the first two columns of R and 16 the third column of H is -RT. The object then is to determine the 9 components of H
17 from the pin hole image components. We have =

f 1 = f fc cx =Hid' . + Hi2f,õõ + H13 . Hawx + H32fwy + H33' fcyH2ifwx + _____________________ H22fwy + H23 f Y fcz Hawx + H32fwy + H33' (43) = =
1 which is rearranged as i H3ifxfwx + H32fxfwy + H33fx = Hilfwx + Hi2fwy + H13, (44) (H31fyfwx + H32fyfvvy + H33fy = H21fivx + H22fwy + H23' 2 This results in a pair of constraints expressed as (uxb= 0, t (45) u b = O.
Y
3 where b = [H11 { H12 H13 I-121 H22 H23 H31 H32 H33f, ux = [¨fw, ¨fwy ¨1 0 0 0 fxfwx fxfwy fx], (46) u, = [0 0 0 .f.wx -fWy -1 fy fwx fy fwy fy]=
4 Note that we have a set of 4 points in 2D giving us 8 constraints but 9 coefficients of H. This is consistent with the solution of the homogeneous equation 6 given to within a scaling constant as .
ux,i uy,i i b= [1.
ux,4 uy,4 0 (47) 7 Defining the matrix ux,1 =
u = uy.,11 (48) ux,4 tly,4 8 we have Ub = 08.
179 =

1 As stated above, any arbitrary line in the world reference frame is 2 mapped into a line on the image plane. Hence the four lines of a quadrilateral in the 3 world plane of Z, = 0 are mapped into a quadrilateral in the image plane.
Each 4 quadrilateral is defined uniquely by the four vertices, hence 8 parameters. We have 8 conditions which is sufficient to evaluate the perspective transformation including 6 any scaling. The extra coefficient in H is due to a constraint that we have not 7 explicitly imposed due to the desire to minimize complexity. This constraint is that 8 the determinant of R is unity. The mapping in Equation (41) does not include this 9 constraint and therefore we have two knobs that both result in the same scaling of the image. For example we can scale R by a factor of 2 and reduce the magnitude 11 of T and leave the scaling of the image unchanged. Including a condition that IRI=1 12 or fixing T to a constant magnitude ruins the linear formulation of Equation (41).
13 Hence we opt for finding the homogeneous solution to Equation (41) to within a 14 scaling factor and then determining the appropriate scaling afterwards.
Using the singular value decomposition method (SVD), we have U = xv=wH. (49) 16 As U is an 8x9 matrix the matrix, x is an 8x8 matrix of left singular vectors and w is 17 a 9x9 matrix of right singular vectors. If there is no degeneracy in the vertex points 18 of the two quadrilaterals (i.e., no three points are on a line) then the matrix v of 19 singular values will be an 8x9 matrix where the singular values will be along the diagonal of the left 8x8 component of v with the 9th column as all zeros. Now let the 21 9th column of w be Wo, which is a unit vector orthogonal to the first 8 column 22 vectors of w. Hence we can write =

Uwo = xvwHwo = xv o = x09,1 [
= 9xl= .
(50) 1 Hence Wo is the desired vector that is the solution of the 2 homogeneous equation to within a scaling factor. That is, b = Wo. The SVD
method 3 is more robust in terms of the problem indicated above that H33 could potentially be 4 zero. However, the main motivation for using the SVD is that the vertices of the imaged quadrilaterals will generally be slightly noisy with lost resolution due to the 6 spatial quantization. However, the 2D template pattern may have significantly many 7 more feature points than the minimum four assumed. The advantage of using the 8 SVD method is that it provides a convenient method of incorporating any number of 9 feature point observations greater or equal to the minimum requirement of 4.
Suppose n feature points are used. Then the v matrix will be 2nx9 with the form of a 11 9x9 diagonal matrix with the 9 singular values and the bottom block matrix of size 12 (2n-9)x9 of all zeros. The singular values will be nonzero due to the noise and 13 hence there will not be a right singular vector that corresponds to the null space of 14 U. However, if the noise or distortion is minor then one of the singular values will be much smaller than the other 8. The right singular vector that corresponds to this 16 singular value is the one that will result in smallest magnitude of the residual 17 IlAwo 112. This can be shown as follows IlAw0112 = 4AHAw0 = wwvx H xvwwo =
(51) = "-smallest =

= .
1 where As2mallest denotes the smallest singular value and wo is the corresponding 2 right singular vector.
3 Once wo is determined by the svd of U, then we equate b = wo and H
4 is extracted from b. We then need to determine the scaling of H.
Once H is determined, then we can map any point in the Zw = 0 plane 6 to the image plane based on 1H11fwx + H12fvv3, + H13 fx = p aa3lfwx H32 f3, H33' (52) f , lizifwx + H22fw3, + H23 .
=
Y H31fwx H32fwy + H33' 8 (iv) Obtaining R and T from H
9 From H we can determine the angles associated with the rotation and the translation vector. Details of this depends on the set of variables used.
One 11 possibility is the Euler angles of tax, a31, az}, a translation of fxr, yr, zr} and scaling 12 values of {sx,sy,szl. The additional variable of s is a scaling factor that is necessary 13 as H will generally have an arbitrary scaling associated with it.
Additionally there are 14 scaling coefficients of {sx, syl that account for the pixel dimensions in x and y. We have left out the offset parameters of fox, oy}. These can be assumed to be part of 16 the translation T. Furthermore the parameters {ox, oy, sx, sy} are generally assumed 17 to be known as part of the camera calibration.
18 The finalized model for H is then CA 02934102 2016-06-22 =
sx 0 Oi F[R1]1 [R1]2 H = s 0 sy 0 [R2]1 [R2]2 ¨RT , (53) 0 0 1 [R3]1 [R3}2 2 (v) Mapping from the image pixel to the floor plane 3 The mapping from the camera image to the floor surface is nonlinear 4 and implicit. Hence we use MATLABO fsolve() to determine the solution of the set of equations for {xw, ywl. For this example we assume that H is known from the 6 calibration as well as s, ox and oy.
xc [ wl Y
Ycl ¨ H w Fxo (54) zc xc x = s ¨ ox, = (55) zc Yc y = s¨ + oy, (56) zc 7 Note that zw has been set to zero as we are assuming the point on the 8 floor surface.

(vi) Data structures 11 Structures are used to group the data and pass it to functions as 12 global variables. These are given as follows:
13 buildmap ¨ describing the map of the site, including structure of all 14 building dimensions, birds eye floor plan map. Members are as follows:

member description XD Overall x dimension of floor in meters YD Overall x dimension of floor in meters dl Increment between grid points Nx, Ny Number of grid points in x and y 2 scam ¨ structure of parameters related to the security camera.
We 3 are assuming the camera to be located at x=y=0 and a height of h in meters.
member description =
Height of camera in meters az Azimuth angle in radians atilt Downtilt angle of the camera in radians Scaling factor OX Offset in x in pixels oy Offset in y in pixels 3D translation vector from world center to camera center in world coordinates Projective mapping matrix from world to camera coordinates obj ¨ structure of parameters related to each object (multiple objects 6 can be accommodated) member description xo,yo Initial position of the object H,w,d Height, width and depth of object Homogeneous color of object in [R,G,B]
=
vx, vy Initial velocity of the object 1 MiSC - miscellaneous parameters member description Nf Number of video frames Index of video frame Vd Video frame array 3 As those skilled in the art appreciate, in some embodiments, a site 4 may be divided into a number of subareas with each subarea having one or more "virtual" entrances and/or exits. For example, a hallway may have a plurality of 6 pillars or posts blocking the FOVs of one or more imaging devices. The hallway may 7 then be divided into a plurality of subareas defined by the pillars, and the space 8 between pillars for entering a subarea may be considered as a "virtual"
entrance for 9 the purposes of the system described herein.
Moreover, in some other embodiments, a "virtual" entrance may be 11 the boundary of the FOV of an imaging device, and the site may be divided into a 12 plurality of subareas based on the FOVs of the imaging devices deployed in the site.
13 The system provides initial conditions for objects entering the FOV of the imaging 14 device as described above. In these embodiments, the site may or may not have any obstructions such as walls and/or pillars, for defining each subarea.
16 As those skilled in the art appreciate, the processes and methods 17 described above may be implemented as computer executable code, in the forms of 18 software applications and modules, firmware modules and combinations thereof, 19 which may be stored in one or more non-transitory, computer readable storage devices or media such as hard drives, solid state drives, floppy drives, Compact 21 Disc Read-Only Memory (CD-ROM) discs, DVD-ROM discs, Blu-ray discs, Flash 1 drives, Read-Only Memory chips such as erasable programmable read-only 2 memory (EPROM), and the like.
3 Although embodiments have been described above with reference to 4 the accompanying drawings, those of skill in the art will appreciate that variations and modifications may be made without departing from the scope thereof as defined 6 by the appended claims.

=

Claims

WHAT IS CLAIMED IS:

1. A system for tracking at least one mobile object in a site, the system comprising.
one or more imaging devices capturing images of at least a portion of the site, and one or more tag devices, each of the one or more tag devices being associated with one of the at least one mobile object and moveable therewith, each of the one or more tag devices having one or more sensors for obtaining one or more tag measurements related to the mobile object associated therewith, and at least one processing structure combining the captured images with at least one of the one or more tag measurements for tracking the at least one mobile object.

2. The system of claim 1 wherein said one or more sensors comprising at least one of an Inertial Measurement Unit (IMU), a barometer, a thermometer, a magnetometer, a global navigation satellite system (GNSS) sensor, an audio frequency microphone, a light sensor, a camera, and a receiver signal strength (RSS) measurement sensor.

3. The system of claim 1 or 2 wherein the at least one processing structure analyzes images captured by the one or more imaging devices for determining a set of candidate tag devices for providing said at least one of the one or more tag measurements

4 The system of claim 3 wherein the at least one processing structure analyzes images captured by the one or more imaging devices for selecting said at least one of the one or more tag measurements.

The system of any one of claims 1 to 4 wherein each of the tag devices provides the at least one of the one or more tag measurements to the at least one processing structure only when said tag device receives from the at least one processing structure a request for providing the at least one of the one or more tag measurements.

6 The system of any one of claims 1 to 5 wherein the at least one processing structure identifies, from the captured images, one or more foreground feature clusters (FFCs) for tracking the at least one mobile object, and determines a bounding box and a tracking point therefor, said tracking point being at a bottom edge of the bounding box.

7. The system of claim 6 wherein at least one processing structure associates each tag device with one of the FFCs.

8. The system of claim 7 wherein, when associating a tag device with a FFC, the at least one processing structure calculates an FFC-tag association probability indicating the reliability of the association between the tag device and the FFC.

9. The system of claim 8 wherein said FFC-tag association probability is calculated based on a set of consecutively captured images.

10. The system of any one of claims 6 to 9 wherein, after detecting the one or more FFCs, the at least one processing structure determines the location of each of the one or more FFCs in the captured image, and maps each of the one or more FFCs to a three-dimensional (3D) coordinate system of the site by using perspective mapping.

11. The system of any one of claims 6 to 10 wherein each FFC
corresponds to a mobile object, and wherein the at least one processing structure tracks the FFCs using a first order Markov process

12. The system of claim 11 wherein the at least one processing structure tracks the FFCs using a Kalman filter with a first order Markov Gaussian process.

13. The system of any one of claims 6 to 12 wherein, when tracking each of the FFCs, the at least one processing structure uses the coordinates of the corresponding mobile object in a 3D coordinate system of the site as state variables, and the coordinates of the FFC in a two dimensional (2D) coordinate system of the captured images as observations for the state variables, and wherein the at least one processing structure maps the coordinates of the corresponding mobile object in a 3D coordinate system of the site to the 2D coordinate system of the captured images.

14. The system of any one of claims 1 to 13 wherein the at least one processing structure discretizes at least a portion of the site into a plurality of grid points, and wherein, when tracking a mobile object in said discretized portion of the site, the at least one processing structure uses said grid points for approximating the location of the mobile object

15. The system of claim 14 wherein, when tracking a mobile object in said discretized portion of the site, the at least one processing structure calculates a posterior position probability of the mobile object.

16. A method of tracking at least one mobile object in at least one visual field of view, comprising capturing at least one image of the at least one visual field of view, identifying at least one candidate mobile object in the at least one image, obtaining one or more tag measurements from at least one tag device, each of said at least one tag device being associated with a mobile object and moveable therewith, and tracking at least one mobile object using the at least one image and the one or more tag measurements

17. The method of claim 16 further comprising:
analyzing the at least one image for determining a set of candidate tag devices for providing said one or more tag measurements

18. The method of claim 16 or 17 further comprising.
analyzing the at least one image for selecting said at least one of the one or more tag measurements.

19. The method of any one of claims 16 to 18 further comprising.
identifying, from the at least one image, one or more foreground feature clusters (FFCs) for tracking the at least one mobile object, and determines a bounding box and a tracking point therefor, said tracking point being at a bottom edge of the bounding box

20. The method of claim 19 further comprising:
associating each tag device with one of the FFCs

21. The method of claim 20 further comprising calculating an FFC-tag association probability indicating the reliability of the association between the tag device and the FFC

22. The method of any one of claims 19 to 21 further comprising tracking the FFCs using a first order Markov process

23. The method of any one of claims 16 to 22 further comprising:
discretizing at least a portion of the site into a plurality of grid points, and tracking a mobile object in said discretized portion of the site by using said grid points for approximating the location of the mobile object

24. A non-transitory, computer readable storage device comprising computer-executable instructions for tracking at least one mobile object in a site, wherein the instructions, when executed, cause one or more processing structure to perform actions comprising' capturing at least one image of the at least one visual field of view;
identifying at least one candidate mobile object in the at least one image, obtaining one or more tag measurements from at least one tag device, each of said at least one tag device being associated with a mobile object and moveable therewith, and tracking at least one mobile object using the at least one image and the one or more tag measurements

25. The storage device of claim 24 further comprising computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising:
calculating an FFC-tag association probability indicating the reliability of the association between the tag device and the FFC.

26. The storage device of claim 24 or 25 further comprising computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising:
analyzing the at least one image for selecting said at least one of the one or more tag measurements

27. The storage device of any one of claims 24 to 26 further comprising computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising identifying, from the at least one image, one or more foreground feature clusters (FFCs) for tracking the at least one mobile object, and determines a bounding box and a tracking point therefor, said tracking point being at a bottom edge of the bounding box.

28 The storage device of claim 27 further comprising computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising:
associating each tag device with one of the FFCs

29. The storage device of claim 28 further comprising computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising calculating an FFC-tag association probability indicating the reliability of the association between the tag device and the FFC

30. The storage device of any one of claims 24 to 29 further comprising computer-executable instructions, when executed, causing the one or more processing structure to perform actions comprising.
discretizing at least a portion of the site into a plurality of grid points, and tracking a mobile object in said discretized portion of the site by using said grid points for approximating the location of the mobile object.

31. A system for tracking at least one mobile object in a site, the system comprising.
at least a first imaging device having a field of view (FOV) overlapping a first subarea of the site and capturing images of at least a portion of the first subarea, the first subarea having at least a first entrance; and one or more tag devices, each of the one or more tag devices being associated with one of the at least one mobile object and moveable therewith, each of the one or more tag devices having one or more sensors for obtaining one or more tag measurements related to the mobile object associated therewith; and at least one processing structure for.
determining one or more initial conditions of the at least one mobile object entering the first subarea from the at least first entrance, and combining the one or more initial conditions, the captured images, and at least one of the one or more tag measurements for tracking the at least one mobile object.

32. The system of claim 31 wherein the at least one processing structure builds a birds-eye view based on a map of the site, for mapping the at least one mobile object therein

33. The system of claim 31 or 32 wherein said one or more initial conditions comprise data determined from one or more tag measurements regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance

34. The system of any one of claims 31 to 33 further comprising:
at least a second imaging device having an FOV overlapping a second subarea of the site and capturing images of at least a portion of the second subarea, the first and second subareas sharing the at least first entrance;
and wherein the one or more initial conditions comprise data determined from the at least second imaging device regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance.

35. The system of any one of claims 31 to 34 wherein the first subarea comprises at least one obstruction in the FOV of the at least first imaging device, and wherein the at least one processing structure uses a statistic model based estimation for resolving ambiguity during tracking when the at least one mobile object temporarily moves behind the obstruction

36. A method for tracking at least one mobile object in a site, the method comprising obtaining a plurality of images captured by at least a first imaging device having a field of view (FOV) overlapping a first subarea of the site, the first subarea having at least a first entrance, obtaining tag measurements from one or more tag devices, each of the one or more tag devices being associated with one of the at least one mobile object and moveable therewith, each of the one or more tag devices having one or more sensors for obtaining one or more tag measurements related to the mobile object associated therewith, determining one or more initial conditions of the at least one mobile object entering the first subarea from the at least first entrance; and combining the one or more initial conditions, the captured images, and at least one of the one or more tag measurements for tracking the at least one mobile object.

37. The method of claim 36 further comprising.
building a birds-eye view based on a map of the site, for mapping the at least one mobile object therein.

38. The method of claim 36 or 37 further comprising assembling said one or more initial conditions using data determined from one or more tag measurements regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance

39. The method of any one of claims 36 to 38 further comprising:
obtaining images captured by at least a second imaging device having an FOV overlapping a second subarea of the site, the first and second subareas sharing the at least first entrance, and assembling the one or more initial conditions using data determined from the at least second imaging device regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance.

40. The method of any one of claims 36 to 39 wherein the first subarea comprises at least one obstruction in the FOV of the at least first imaging device; and the method further comprising:
using a statistic model based estimation for resolving ambiguity during tracking when the at least one mobile object temporarily moves behind the obstruction.

41. One or more non-transitory, computer readable media storing computer executable code for tracking at least one mobile object in a site, the computer executable code comprising computer executable instructions for.
obtaining a plurality of images captured by at least a first imaging device having a field of view (FOV) overlapping a first subarea of the site, the first subarea having walls and at least a first entrance;
obtaining tag measurements from one or more tag devices, each of the one or more tag devices being associated with one of the at least one mobile object and moveable therewith, each of the one or more tag devices having one or more sensors for obtaining one or more tag measurements related to the mobile object associated therewith;
determining one or more initial conditions of the at least one mobile object entering the first subarea from the at least first entrance; and combining the one or more initial conditions, the captured images, and at least one of the one or more tag measurements for tracking the at least one mobile object

42. The computer readable media of claim 41 wherein the computer executable code further comprises computer executable instructions for.
building a birds-eye view based on a map of the site, for mapping the at least one mobile object therein.

43. The computer readable media of claim 41 or 42 wherein the computer executable code further comprises computer executable instructions for assembling said one or more initial conditions using data determined from one or more tag measurements regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance

44. The computer readable media of any one of claims 41 to 43 wherein the computer executable code further comprises computer executable instructions for:
obtaining images captured by at least a second imaging device having an FOV overlapping a second subarea of the site, the first and second subareas sharing the at least first entrance; and assembling the one or more initial conditions using data determined from the at least second imaging device regarding the at least one mobile object before the at least one mobile object enters the first subarea from the at least first entrance

45. The computer readable media of any one of claims 41 to 45 wherein the first subarea comprises at least one obstruction in the FOV of the at least first imaging device; and wherein the computer executable code further comprises computer executable instructions for:
using a statistic model based estimation for resolving ambiguity during tracking when the at least one mobile object temporarily moves behind the obstruction.