WO2007147171A2

WO2007147171A2 - Scalable clustered camera system and method for multiple object tracking

Info

Publication number: WO2007147171A2
Application number: PCT/US2007/071501
Authority: WO
Inventors: Senem Velipasalar; Jason Schlessman; Cheng-Yao Cheng; Wayne H. Wolf; Jaswinder P. Singh
Original assignee: Verificon Corporation
Priority date: 2006-06-16
Filing date: 2007-06-18
Publication date: 2007-12-21
Also published as: WO2007147171A3

Abstract

Embodiments of the invention are directed to a Scalable Clustered Camera System (SCCS), which is a peer-to-peer multi-camera system for multiple object tracking. Instead of transferring control of tracking jobs from one camera to another, each camera in the presented system performs its own tracking, keeping its own tracks for each target object, which provides fault tolerance. A fast and robust tracking algorithm is described to perform tracking on each camera view, while maintaining consistent labeling. In addition, a novel communication protocol is introduced, which can handle the problems caused by communication delays and different processor loads and speeds, and incorporates variable synchronization capabilities, so as to allow flexibility with accuracy tradeoffs.

Description

SCALABLE CLUSTERED CAMERA SYSTEM AND METHOD FOR MULTIPLE

OBJECT TRACKING

CROSS REFERENCE TO RELATED APPLICATION

[001] This application claims the benefit of U.S. Provisional Patent Application No.

60/814,446, the contents of which is hereby incorporated by reference herein. FIELD OF THE INVENTION

[002] The invention relates to the tracking of objects and, more particularly, the tracking of objects with multiple cameras. BACKGROUND OF THE INVENTION

[003] Reliable and efficient tracking of objects by multiple cameras is an important and challenging problem which finds wide-ranging application areas. Most existing systems assume that data from multiple cameras is processed on a single processing unit or by a centralized server. However, these approaches are neither scalable nor fault-tolerant. We propose multi-camera algorithms that operate on peer-to-peer computing systems where a different processing unit is used to process each camera, and the processing units communicate with each other directly, eliminating the need for a central server. Peer-to-peer vision systems require co-design of image processing and distributed computing algorithms as well as sophisticated communication protocols, which should be carefully designed and verified to avoid deadlocks and other problems. SUMMARY OF THE INVENTION

[004] Embodiments of the invention are directed to a Scalable Clustered Camera

System (SCCS), which is a peer-to-peer multi-camera system for multiple object tracking. Instead of transferring control of tracking jobs from one camera to another, each camera in the presented system performs its own tracking, keeping its own tracks for each target object, which provides fault tolerance. A fast and robust tracking algorithm is proposed to perform tracking on each camera view, while maintaining consistent labeling. In addition, a novel communication protocol is introduced, which can handle the problems caused by communication delays and different processor loads and speeds, and incorporates variable synchronization capabilities, so as to allow flexibility with accuracy tradeoffs. This protocol was exhaustively verified by using the SPIN verification tool. The success of the proposed system is demonstrated on different scenarios captured by multiple cameras placed in different setups. Also simulation and verification results for the protocol are presented.

BRIEF DESCRIPTION OF THE DRAWINGS

[005] Embodiments of the invention will be more readily understood from the detailed description of exemplary embodiments presented below considered in conjunction with the attached drawings, of which:

[006] FIG. 1 shows the total number of messages that need to be sent, for

T=l,5, 10,20, with the server-based scenario and SCCS, respectively; (b) is the rotated version of (a);

[007] FIG. 2 shows corresponding locations of recovered FOV lines;

[008] FIGs. 3-5 show recovered FOV lines for different video sequences and camera setups;

[009] FIGs. 6-7 show and examples of successfully resolving a merge;

[0010] FIG. 8 shows communication between two cameras;

[0011] FIG. 9 shows camera states at the synchronization point;

[0012] FIG. 10 shows message totals needed for different scenarios;

[0013] FIG. 11 shows the different number of states reached for verification;

[0014] FIG. 12 shows different camera locations;

[0015] FIG. 13 shows processing times; [0016] FIG. 14 shows waiting times;

[0017] FIG. 15 shows average accuracy;

[0018] FIGs. 16-17 show an exemplary camera setups;

[0019] FIGs. 18-20 show lost camera examples;

[0020] FIGs. 21-22 show resolving;

[0021] FIG. 23 shows accuracy results; and

[0022] FIG. 24 shows requests for synch rates.

[0023] It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale. DETAILED DESCRIPTION OF THE INVENTION

[0024] The following references, some of which are referred to in this application, are hereby incorporated by reference herein.

[0025] [1] N. Atsushi, K. Hirokazu, H. Shinsaku, and I. Seiji, "Tracking multiple people using distributed vision systems" Proc. of IEEE Int'l Conf. on Robotics and Automation, pp. 2974-2981, 2002. [2] D. Beymer, P. McLauchlan, B. Coifman and J. Malik, "A real-time computer vision system for measuring traffic parameters," Proc. of IEEE CVPR, pp. 495-501, 1997. [3] M. Bramberger, A. Doblander, A. Maier, B. Rinner and H. Schwabach, "Distributed embedded smart cameras for surveillance applications," IEEE Computer, vol. 39, no. 2, pp. 68-75, Feb. 2006. [4] Q. Cai and J. K. Aggarwal, "Tracking human motion in structured environments using a distributed camera system," IEEE Trans, on PAMI, vol. 21, no. 11, pp. 1241-1247, Nov. 1999. [5] T. -H. Chang, and S. Gong, "Tracking multiple people with a multi-camera system ," Proc. of IEEE Workshop on Multi- Object Tracking, pp. 19-26, 2001. [6] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt, and L. Wixson, "A system for video surveillance and monitoring: VSAM final report," Technical report CMU-RI-TR- 00-12, Robotics Institute, Carnegie Mellon University, May, 2000. [7] T. Ellis, "Multi- camera Video Surveillance," Proc. of Int'l Carnahan Conf. on Security Technology, pp. 228- 233, 2002. [8] O. Javed, S. Khan, Z. Rasheed and M. Shah, "Camera handoff: tracking in multiple uncalibrated stationary cameras," Proc. of IEEE Workshop on Human Motion, pp. 113-118, Dec. 2000. [9] P. H. Kelly, A. Katkere, D. Y. Kuramura, S. Moezzi, S. Chatterjee and R. Jain, "An architecture for multiple perspective interactive video," Proc. of ACM Conf. on Multimedia, pp. 201-212, 1995. [10] V. Kettnaker and R. Zabih, "Bayesian multi-camera surveillance," Proc. of IEEE Conf on CVPR, pp. 253-259, 1999. [11] S. Khan and M. Shah, "Consistent labeling of tracked objects in multiple cameras with overlapping fields of view," IEEE Trans, on PAMI, pp. 1355-1360, Oct. 2003. [12] L.

Lee, R. Romano and G. Stein, "Monitoring activities from multiple video streams: Establishing a common coordinate frame," IEEE Trans, on PAMI, pp. 758-768, Aug. 2000. [13] K. Nguyen, G. Yeung, S. Ghiasi, and M. Sarrafzadeh, "A general framework for tracking objects in a multi-camera environment," Proc. of Int'l Workshop on Digital and Computational Video, pp.200-204, 2002. [14] H. Pasula, S. Russell, M. Ostland and Y. Ritov, "Tracking many objects with many sensors," Proc. of IJCAI, 1999. [15] A. Utsumi, H. Mori, J. Ohya and M. Yachida, "Multiple-camera-based human tracking using non- synchronous observations," Proc. of Asian Conf on Computer Vision, pp. 1034-1039, 2000. [16] S. Velipasalar and W. Wolf, "Multiple object tracking and occlusion handling by information exchange between uncalibrated cameras," Proc. of IEEE ICIP, pp. 418-421, Sept. 2005. [17] S. Velipasalar, J. Schlessman, C-Y. Chen, W. Wolf, and J. P. Singh, "SCCS: a scalable clustered camera system for multiple object tracking communicating via message passing interface," IEEEICME, 2006. [18] S. Velipasalar and W. Wolf, "Recovering field of view lines by using projective invariants," Proc. of IEEE Int'l Conf on ImageProcessing, pp. 3060-3072, Oct. 2004. [19] S. Funiak, C. Guestrin, M. Paskin and R. Sukthankar, "Distributed localization of networked cameras," Proc. of IPSN, pp. 34-42, 2006. [20] B. Ping Lai Lo, J. Sun, S. A. Velastin, "Fusing visual and audio information in a distributed intelligent surveillance system for public transport systems," ActaAutomaticaSinica, pp. 393-407, May 2003. [21] J. Watlington and V. M. Bove, Jr,

"A System for parallel media processing," ParallelComputing, vol.23(12), pp.1793-1809, 1997. [22] D. Comaniciu, V. Ramesh and P. Meer, "Real-time tracking of non-rigid objects using mean shift," Proc. of IEEE Conf onCVPR, pp. 142-149, 2000. [23]C. Stauffer and W. E. L. Grimson, "Adaptive background mixture models for real-time tracking," Proc. of IEEE Int'l Conf onCVPR, vol. 2, June 1999. [24] C. Karlof, N. Sastry , D. Wagner, "Cryptographic voting protocols: A systems perspective," USENIXSecuritySymp., 2005. [25] N. Evans and S. Schneider, "Analysing time dependent security properties in CSP using PVS," ESORICS, 2000. [26] V. Vanack^vere, "The TRUST protocol analyser, automatic and efficient verification of cryptographic protocols," Verification Workshop, 2002. [27] H. Bowman, G. Faconti and M. Massink, "Specification and verification of media constraints using UPPAAL," Eurographics Workshop,DSV-IS, 1998. [28] T. Sun, K. Yasumoto, M. Mori and T. Higashino, "QoS functional testing for multimedia systems," IFIPFORTE, 2003. [29] G. J. Holzmann, Tlie Spin Model Checker -Primer and Reference Manual, Boston: Addison Wesley, 2004. [30] R. Hartley, A. Zisserman, "Multiple view geometry in computer vision," Cambridge University Press, 2001. [31] CA. Rothwell, "Object recognition through invariant indexing," Oxford Science Publications, 1995. [32] The MPI Standard, http://www-unix.mcs.anl.gov/mpi/. [33] LAM/MPI Parallel Computing, http://www. lam-mpi. org/. [0026] I. INTRODUCTION

[0027] Embodiments of the invention are directed to a distributed smart camera system that uses a peer-to-peer communication protocol to transfer data between multiple cameras. In a smart camera system, each camera is attached to a computing component, in this case different CPUs. Distributed camera systems communicating in peer-to-peer fashion can be scaled up to very large systems, unlike server-based designs that are limited by the server capacity.

[0028] Reliable and efficient tracking of objects by multiple cameras is an important and challenging problem which finds wide-ranging application areas such as video surveillance, indexing and compression, gathering statistics from videos, traffic flow monitoring, and smart rooms. Due to the inherent limitations of a single visual sensor, such as limited field of view and delays due to panning and tilting, a collaboration of multiple cameras makes manifest an inevitable trend. Multiple cameras can enhance the capability of vision applications, providing fault- tolerance and robustness for issues such as target occlusion. Examples of multi-camera systems can be found in [1] -[17]. [0029] Yet, using multiple cameras to track multiple objects poses additional challenges. One of these challenges is the consistent labeling problem, i.e. establishing correspondences between moving objects in different views. Multi-camera systems, rather than treating each camera individually, compare features and trajectories from different cameras in order to obtain history of the object movements, and handle the loss of the target objects which may be caused by occlusion or errors in the background subtraction (BGS) algorithms.

[0030] Different approaches have been taken to solve the consistent labeling problem.

Kelly et al. [9] assume that all cameras are fully calibrated to construct a 3D environment and objects are tracked as a group of voxels, whereas Cai and Aggarwal [4] assume that only the neighboring cameras are calibrated; they switch cameras when one camera loses view of the target, and can handle limited occlusion. Bayesian networks are used by Chang and Gong [5] to do feature matching, and establish correspondences. They use camera calibration information, landmark modalities and apparent height and color to derive additional constraints. Utsumi et al. [15] employ feature matching as well. Yet, calibrating cameras fully is expensive and impractical to install by the end user as it requires some expert intervention. In addition, relying on feature matching can cause problems as the features can be seen differently by different cameras due to lighting variations.

[0031] Lee et al. [12] assume that the intrinsic camera parameters are known and use the centroids of the tracked objects to estimate a homography and to align the scene's ground plane across multiple views.

[0032] Observation intervals and transition times of objects across cameras are used for tracking in [10]. Khan and Shah [11] presented a method which uses field of view (FOV) lines and does not require camera calibration. However, due to the way the lines are recovered, they may not be localized reliably, and if there is dense traffic around the FOV line, the method can result in inconsistent labels. Funiak et al. [19] proposed a distributed calibration algorithm for camera networks.

[0033] Although many groups have developed methods to combine data from multiple cameras, much less attention has been paid to the computational efficiency and scalability of these methods. Many existing systems assume that multiple cameras are processed on a single CPU or by a centralized server. However, these are not scalable approaches. Chang and Gong [5] propose a multi-camera system which is implemented on an SGI workstation. For a single CPU system, the amount of processing necessary to track multiple objects on multiple camera views can be excessive for real-time performance. Furthermore, scalability is debilitated as each additional camera imposes greater performance requirements.

[0034] In order to increase processing power, and handle multiple video streams, distributed systems have been employed instead of using a single CPU. In a distributed system, different CPUs are used to process inputs from different cameras. Yet, most existing distributed multi-camera systems use a centralized server/control center [2], [6], [13], [20]. Nguyen et al. [13] propose a system using multiple cameras where all the local processing results are sent to a main controller. Collins et al. [6] introduced the VSAM system, where all resulting object hypotheses from all sensors are transmitted at every frame back to a central operator control unit, which is responsible for integrating symbolic object trajectory information accumulated by each of the processing units together with a 3D geometric site model. Using a central server simplifies some problems, such as video synchronization and communication between the algorithms handling the various cameras. But, server-based multi-camera systems have a bandwidth scaling problem, since the central server can quickly become overloaded with the aggregate sum of requests from an increased number of nodes. In addition, server-based systems are not practical in many realistic environments, and have high installation costs. Besides the algorithm development, hardware design and resource management has also been considered for parallel processing. Watlington and Bove [21] proposed a data-flow model and use a distributed resource manager to support parallelism for media processing.

[0035] The aforementioned problems of server-based systems necessitate the use of peer-to-peer systems, where individual nodes communicate with each other without going through a centralized server. Several important issues need to be addressed when designing peer-to-peer systems. First, there may be significant communication delays between processing nodes. This necessitates the design of tracking algorithms requiring relatively little inter-process communication and a small number of messages between the nodes. This requires a careful design and choice of when to trigger the data transfer, what data to send in what fashion, and to whom to send this data. Another important issue and challenge is maintaining consistency for data across cameras as well as operations upon the data without use of a centralized server. Also, even if the cameras and input video sequences are synchronized, communication and processing delays pose a serious problem. Depending on the amount of processing each processor has to do, one processor can run faster/slower than the other. Thus, when a processor receives a request, it may be ahead/behind compared to the requester. These issues mandate efficient and sophisticated communication protocols for peer-to-peer systems.

[0036] These protocols find use in real-time systems, which tend to have stringent requirements for proper system functionality. Hence, the protocol design for these systems necessitates transcending typical qualitative analysis using simulation and instead, requires verification. The protocol must be checked to ensure it does not cause unacceptable issues such as deadlocks and process starvation, and has correctness properties such as the system eventually reaching specified operating states.

[0037] Atsushi et al. [1] use multiple cameras attached to different PCs connected to a network. They use calibrated cameras and track the objects in world-coordinates, sending message packets between stations. Ellis [7] also uses a network of calibrated cameras. Bramberger et al. [3] introduce a distributed embedded smart camera system consisting of loosely coupled cameras. They use predefined migration regions to handover the tracking process from one camera to the other. But, these methods do not discuss the type and details of communication, and how to address the communication and processing delay issues. [0038] Verification of communication protocols is a rich topic, particularly for security and cryptographic systems. Karlof et al. [24] analyzed the security properties of two cryptographic protocols and discovered several potential weaknesses in voting systems. Evans and Schneider [25] verified time-dependent authentication properties of security protocols. Vanack^vere [26] modeled cryptographic protocols as a finite number of processes interacting with a hostile environment and proposed a protocol analyzer TRUST for verification. Finally, a burgeoning body of work exists pertaining to the formal verification of networked multimedia systems. Bowman et al. [27] described multimedia stream as a timed automata, and verified the satisfaction of quality of service QoS properties including throughput and end-to-end latency. Sun et al. [28] proposed a testing method for verifying QoS functions in distributed multimedia systems where media streams are modeled as a set of timed automata.

[0039] Our previous work [16], [18] introduced some of the tools necessary towards building a peer-to-peer camera system. The work presented in [16] performs multi-camera tracking, and information exchange between cameras. However, it was implemented on a single CPU in a sequential manner, and the tracking algorithm used required more data transfer. This paper introduces SCCS, the Scalable Clustered Camera System, together with its communication protocol and its verification results. SCCS is a scalable peer-to-peer multi- camera system for multi-object tracking. It is a smart camera system wherein each camera is attached to a computing component, in this case different CPUs. In this article, if not specified otherwise, a peer-to-peer smart camera system is referred when the term multi- camera system is used.

[0040] A computationally efficient and robust tracking algorithm is introduced to perform tracking on each camera view, while maintaining consistent labeling. Instead of transferring control of tracking jobs from one camera to another, each camera in SCCS performs its own tracking and keeps its own tracks for each target object, thus providing fault tolerance. Cameras can communicate with each other to resolve partial/complete occlusions, and to maintain consistent labeling. In addition, if the location of an object cannot be determined at some frame reliably due to the errors resulted from BGS, the track of that object is robustly updated from other cameras. If tracking is performed independently instead on different camera views, there are two possibilities when an object in the view of a camera, C , is completely occluded by a static background item for a long time: 1) the current location can be estimated from the speed and direction of the previous movement, but this requires several assumptions, such as steady movement, and no direction change; 2) this object is lost

in C but is continued to be tracked in the other views where it is visible. However, we do not want to lose the object in any of the views since we cannot guarantee that the target will always be visible in multiple cameras. Thus, it is always more advantageous to keep all the tracks updated in all the views. SCCS updates trajectories without interruption and without any need for an estimation of the moving speed and direction, even if the object is totally invisible to that camera. Our tracking algorithm deals with the merge/split cases on a single camera view without sending requests to other nodes in the system. Thus, it provides coarse object localization with sparse message traffic.

[0041] In addition, a novel communication protocol is introduced, which coordinates multiple tracking components across the distributed system, and handles the processing and communication delay issues. The decisions about when and with whom to communicate are made such that the frequency and size of transmitted messages are kept small. This protocol incorporates variable synchronization capabilities, so as to allow flexibility with accuracy tradeoffs. Non-blocking sends and receives are used for message communication, since for each camera it is not possible to predict when and how many messages will be received from other cameras. Moreover, the type of data that is transferred between the nodes can be changed, depending on the application and what is available, and our protocol remains valid and can still be employed. For instance, when full calibration of all the cameras is tolerated, the 3D world coordinates of the objects can be transferred between the nodes. We verified this communication protocol with success by using the SPIN verification tool. [0042] We present experimental results which demonstrate the success of the proposed peer-to-peer multi-camera tracking system, with a minimum accuracy of 94.2% and 90% for new_label and lost_label cases, respectively, with a high frequency of synchronization. We also present the results obtained after exhaustively verifying the presented communication protocol with different communication scenarios. [0043] The rest of the paper is organized as follows: Section II compares SCCS with a server-based scenario. Section III describes the computer vision algorithms in general. More specifically, recovery of field of view (FOV) lines is described in Section III-A.l, and the tracking algorithm is introduced in Section M-B. The communication protocol is explained in Section IV, and its verification and obtained results are described in Section V. Section VI presents the experimental results obtained with several different video sequences with varying difficulty, and Section VII concludes the paper. [0044] II. A SERVER-BASED SYSTEM SCENARIO VERSUS SCCS

[0045] As stated before, server-based multi-camera systems have a bandwidth scaling problem, and are limited by the server capacity. In order to illustrate the excessive number of messages and the load a server needs to handle, and compare these to SCCS, we will introduce a server-based system scenario in this section. In this server-based system, the nodes keep the server updated by sending it messages for each tracker in their FOV. To make a fair comparison between this scenario and the communication protocol used by SCCS, we assume that these messages are sent at the synchronization points, which will be defined in Section IV-D. Due to different processing rates of the distinct processors coupled with communication delays, a server keeps the received data buffered to provide consistent data transfer between the nodes. However, this is not a practical approach since the buffer size may need to be very large. Thus, we designed this scenario so that the server does not need a buffer. The nodes are required to wait at each synchronization point until they receive an overall done message from the server. At this point, each node needs to send a message for each tracker. These messages also indicate if the node has a request from any of the other nodes or not. Then, the server handles all these messages, determines the replies for each request, if there were any, and sends the replies to the corresponding nodes. The nodes update their trackers after receiving the replies, and acknowledge the server that they are done. After receiving a done message from all the nodes, the server sends an overall done message to the nodes so that nodes can move on. Based on this scenario, the total number of messages that go through the server can be determined by using:

= ,9 y 2 x :V 4- J] Ei -f S y J^ l- (D i=Λ ^; =d

where S is the number of synchronization points, N is the number of nodes/cameras and E, is the total number of events that will trigger requests in the view of camera C . T, is the total i number of trackers in the view of C , and in this formula, without loss of generality, it is i assumed that, for camera C , Ti remains the same during the video. Whereas, for SCCS this number is equal to:

*%«?*» ~ & \ fr x & - I ? 4 2 x ?Λ^{; >»} 1 } v. y~; £V

[0046] The more detailed discussion of how (2) is obtained will be given in Section

IV-D. As seen from (2), the total number of messages sent around by SCCS is independent of the number of trackers in each camera view, since the communication is done in a peer-to- peer manner. This fact can also be seen in Fig. l(a) and l(c). These figures were obtained by

setting Ej =20 and T, = T, Vi, where T e{ l, 5, 10, 20}. In addition, equations (1) and (2), and

Fig. 1 show that the server-based system does not scale well.

[0047] III. MULTI-CAMERA MULTI-OBJECT TRACKING [0048] A. Field of View (FOV) Lines

[0049] FOV lines have been introduced by Khan and Shah [11] to solve the consistent labeling problem. They show that when FOV lines are recovered, the consistent labeling

problem can be solved successfully. The 3D FOV lines of camera C are denoted by L [11],

where s e{r, Lt , b} correspond to one of the sides of the image plane. The projections of the

3D FOV lines of camera C onto the image plane of camera C will result in 2D lines denoted

by L^!;^s , and called the FOV lines.

[0050] 1 ) Recovery of Field of View Lines: The FOV lines are recovered by observing moving objects in different views and using entry/exit events by Khan and Shah [H]. However, there needs to be enough traffic across a particular FOV line to be able to recover it. In addition, the method in [11] relies on the performance of the BGS algorithm. Depending on the size of the objects, they may not be detected instantly or entirely, which affects the location of the FOV lines. Since FOV lines will play an important role in consistent labeling and in our communication protocol later, it is necessary to recover all of them in a robust way. Moreover, the precision in locating the FOV lines is very important, especially in crowded scenes, for consistent labeling.

[0051] In this application, embodiments are described that are directed to a robust and reliable way of recovering the FOV lines, which does not rely on the object movement in the scene or the performance of the BGS algorithm. This way, all visible FOV lines in a view can be recovered at the beginning, even if there is no traffic at the corresponding region. In addition, there is no need to know the intrinsic or extrinsic camera parameters. As in [11], it is assumed that the scene ground is planar, and a homography is estimated to recover the FOV lines. [0052] With reference to Figure 2, p_b ^(}> and p_b ^(}> are the corresponding locations of

p_b ^} and p_b ^} , respectively, and the recovered FOV line passing through p_b ^(]) and p_b ^(f) is

shown with a dashed line in the view of C .

[0053] The inputs to the proposed system are four pairs of corresponding points

(chosen off-line on the ground plane) in two different camera views.

[0054] These points in the views of c' and d are denoted by P^w = { p[^l) ,..., pf⁽ } and

pO⁾ ₌ { _p ^(j) _ _ __{5 p}ω } respectively (See Fig. 2). Let pf = ( 4° , y? , 1)^T (k e { 1, ... ,4} denote

the homogeneous coordinates of the input point p^ = ( x^ , y^ ). The, a homograph (H) is

estimated from { p[^l) ,..., p^ } and P^ = { p{^}\..., p₄ ^(}) } by using a Direct Linear Transform

(DLT), as described by Hartley and Zisserman [30].

[0055] The image of the camera view whose FOV lines will be recovered on the other view, is called the field image. After the homography is estimated, the system finds two points on one of the boundaries of the fieldimage, so that each of them is in general position with the four input points. Then it checks with the user that these boundary points are coplanar with the four input points. Let's denote the two points found on the image boundary

s of the camera C by p^ = ( ^⁰ , y^ ) and p™ = ( x^® , y^® ) where s e {r, 1, t, b}

correspond to one of the sides of the image plane (See Fig. 2). The corresponding locations of

( ^₁ ⁰ , y^ ) and ( x™ , y™ ) on the view of camera d are denoted by p^ = ( x™ , y^ ) and

p_s'^}) = ( x_s'^}) , y_s'^}) ), and computed by using: [0056] where n e{ l,2},

denotes the homogeneous

coordinates of p/⁽ \ = ( x/⁽ \ , y/^{( >} _l ). x/⁽ \ and j^ are obtained by normalizing p/⁽ \ so that its

third entry is equal to 1.

[0057] Once

and

are obtained on the other view, the line going through these

points defines the FOV line corresponding to the image boundary s of the camera C (See Fig.

2). Two points are found on each boundary of interest and the FOV line corresponding to that boundary is recovered similarly.

[0058] Figures 3 to 5 show the recovered FOV lines for different video sequences and camera setups. Although there was no traffic along the right boundary of Fig. 3b, the FOV line corresponding to it is successfully recovered as shown in Fig. 3a.

[0059] Fig. 3. (a) -(b) and (c) -(d) show the recovered FOV lines for two different camera setups. The shaded regions are outside the FOV of the other camera.

[0060] Fig. 4. (a), (b) and (c) show the recovered FOV lines. The shaded regions are outside the FOV of the other cameras.

[0061] Fig. 5. (a),(b) and (c) show the recovered FOV lines. The shaded regions are outside the FOV of the other cameras.

[0062] Another way of recovering the FOV lines is using the projective invariants in

2

P [18]. On the projective V ,r \, M .' I V t U i \t \

.

[0063] Where I M^J {a, b, c} e { 1, ...,5}, denotes the determinant of the matrix

M_a ⁽ _b'_C for image /, whose columns are the homogeneous coordinates of the points p^ ,

[0064] With five coplanar points, one ( p^ ) on the boundary s, and four that are input

to the system ({ p[^l) ,..., pf⁽ }), two independent P invariants can be calculated by using

equations (4) and (5). Then the points in { p[^]) ,..., p^ }are inserted in (4) and (5), and these

equations are rewritten to solve for the corresponding point of pf⁽ on the view of C. The

same steps are repeated for the boundary point p^ , and the line passing through the

calculated corresponding points on the view of C is recovered as the FOV line corresponding

to the boundary s of the view of camera C .

[0065] 2) Checking Object Visibility: As stated previously, Zi_; ^s denotes the projection

of the 3D FOV line L onto the view of C and is represented by the equation of the line,

which is written as: y = SJC+C. Henceforth, a point p_^ ^{ = ( x^ , y^ ) will be considered on

the visible side of Zi_; ^s , if sign( y_a ^(}) - S x_a ^(}) -C), where ( x_a ^(}) , y_a ^(}) ) are the coordinates of the

p_a ^(}) which is any one of the input points in P .

[0066] When an object O enters the view of C , BGS is applied first and a boundary

box around the foreground object is obtained. Then, its visibility in the view of C is checked

by employing L^! _; ^s . The midpoint ( p^^} ) of the bottom line of the bounding box of the object is used as its location. If this point lies on the visible side of ZJ_; ^S for all s e{r,l,t,b}, then it is

(J) i deduced that O is visible by C (See Fig. X).

[0067] B. The tracking algorithm : Coarse object localization with sparse message traffic

[0068] Our proposed tracking algorithm allows for sparse message traffic by handling merge and split cases within a single camera view without sending request messages to other cameras.

[0069] First, foreground objects are segmented from the background in each camera view by using the BGS algorithm presented by Stauffer and Grimson [23], which employs adaptive background mixture models to model the background and to segment the foreground objects. Then, connected component analysis is performed, which results in foreground blobs. When a new foreground blob is detected within the camera view, a new tracker is created, and a mask for the tracker is built where the foreground pixels from this blob and background pixels are set to be 1 and 0 respectively. The box surrounding the foreground pixels of the mask is called the boundingbox. Then, the color histogram of the blob is learned from the input image, and is saved as the modelhistogram of the tracker.

[0070] At each frame, the trackers are matched to detected foreground blobs by using a computationally efficient blob tracker which uses a matching criteria based on the boundary box intersection and the Bhattacharya coefficient p(y)[22] defined by

^{■ •}m ^;;:

ij ! - V>z lY J';V<2- (65

[0071] In (6), z is the feature representing the color of the target model and is assumed to have a density function q_z while p_z(y) represents the color distribution of the candidate foreground blob centered at location y. The Bhattacharya coefficient is derived from the sample data by using:

S°? 4

»•4

[0072] where q = { q_u }u=l...m, and p = { p_M }u=l...m are the discrete densities

estimated from the m-bin histogram of the model and the candidate blobs respectively. These densities are estimated by using the color information at the nonzero pixel locations of the masks. If the bounding box of a foreground blob intersects with that of the current model mask of the tracker, the Bhattacharya coefficient between the model histogram of the tracker and the color histogram of the foreground blob is calculated by using (6). The tracker is assigned to the foreground blob which results in the highest Bhattacharya coefficient, and the mask, and thus the bounding box, of the tracker are updated. The Bhattacharya coefficient with which the tracker is matched to its object is called the similarity coefficient. If the similarity coefficient is greater than a predefined distribution update threshold, the model histogram of the tracker is updated to be the color histogram of the foreground blob to which it is matched.

[0073] Based on this matching criteria, when objects merge, multiple trackers are matched to one foreground blob, and the labels of all matched trackers are displayed on this blob, as shown in Figures 6, 7, 21 and 22. The masks of the trackers are then updated in the previously discussed fashion. The trackers that are matched to the same foreground blob are put into a merge state, and in this state their model histogram is not updated. When objects split from each other, trackers are matched to their targets based on the boundary box intersection and Bhattacharya coefficient criteria mentioned above. [0074] With reference to Fig. 6, and example of successfully resolving a merge is shown: (a)(b)(c) and (a')(b')(c') show the original images, and the tracked objects with their labels, respectively.

[0075] There may be rare but unfavorable cases where a foreground object, appearing after the split of merged objects, may not be matched to its tracker. We deal with these cases as follows: denote two trackers by T₁ and T₂, and their target objects by O₁ and O₂ respectively. When these objects merge, O_1U2 is formed, and T₁ and T₂ are both matched to

O_1U2- After O₁ and O₂ split, B_T;O are calculated, where {i, j}<≡{ l,2}, and B_T;O denotes the

Bhattacharya coefficient calculated between the histograms of T, and O₇. Based on Bτ_;o , both

T₁ and T₂ can still be matched to O₂, for instance, and stay in the merge state. Denote the similarity coefficient of T, by S_T1- Thus, in this case, S_T1 = B₁₂ and S_T2 = B₂₂. This can happen

because the model distributions of the trackers are not updated during the merge state, and there may be changes in the color of Ol during and after the merge. Another reason may be O₁ and O₂ having similar colors from the outset. When this occurs, O₁ is compared against the trackers which are in the merge state, and intersect with the bounding box of O₁. That is, it is compared against T₁ and T₂, and B_T1O1 and B_T2O1 are calculated. Then, O₁ is assigned to

the tracker T;_* , where:

[0076] If a foreground blob cannot be matched to any of the trackers, and if there are trackers in the merge state, the unmatched object is compared against those trackers by using the logic in (8), which is also applicable if there are more than two trackers in the merge state as shown in Figures 7 and 21. [0077] An example of resolving the merge of multiple objects is shown in FIG. 7.

[0078] As stated previously, this algorithm provides coarser object localization and decreases the message traffic by not sending a request message each time a merge or split occurs. If the exact location of an object in the blob formed after the merge is required, we propose another algorithm that can be used at the expense of more message traffic: When a tracker is in the merge state, other nodes that can see its most recent location can be determined as described in III-A.2, and a request message can be sent to these nodes to retrieve the location of the tracker in the merge state. If the current location of the tracker is not visible by any of the other cameras, then the mean-shift tracking [22] can be activated. The mean-shift tracking algorithm aims to minimize the distance between a given target distribution and the candidate distribution in the current frame. The similarity between two distributions is expressed as a metric based on the Bhattacharya coefficient. Given the distribution of the target model and the estimated location of the target in the previous frame, an optimization is performed to find a new location and increase the value of the Bhattacharya coefficient.

[0079] The mean-shift tracking is error-prone since it can be distracted by the background. It is also computationally more expensive. Thus, when a tracker is in the merge state, it is preferable to send messages to other nodes, and request the location of this tracker, if its most recent location is in their FOV. Thus, this algorithm requires additional message traffic. We proposed this second algorithm as an alternative if the exact location of a tracker in the merge state is required. The experiments presented in Section VI, were performed by using the first proposed algorithm as the tracking component of the SCCS. [0080] IV. INTER-CAMERA COMMUNICATION PROTOCOL

[0081] One issue that needs to be addressed when using peer-to-peer systems is that communication is expensive and there may be significant communication delays between processing nodes. Also, the number of messages that are sent between the nodes should be decreased to save power and increase speed. Another issue is maintaining consistency for data across cameras as well as operations upon the data without use of a centralized server. In addition, even if the cameras and input video sequences are synchronized, communication and processing delays pose a serious problem. The processors will have different amounts of processing to do, and may also run at different processing rates. This, coupled with potential network delays, causes one processor to be ahead of/behind the others during execution. Thus when a processor receives a request, it may be ahead/behind compared to the requester. Hence, system synchronization becomes very important to ensure the transfer of coherent vision data between cameras. These aforementioned issues mandate an efficient and sophisticated communication protocol.

[0082] As mentioned before, the SCCS protocol utilizes point-to-point communication, as opposed to some previous approaches that require a central message processing server. Our approach offers a latency advantage, and the nodes do not need to send the state of the trackers to a server at every single frame. This decreases the number of messages considerably as will be discussed in Section IV-D.1. Moreover, this design is more scalable, since for a central server implementation, the server quickly becomes overloaded with the aggregate sum of messages and requests from an increased number of nodes. [0083] In this section, a communication protocol is introduced which can handle communication and processing delays and hence maintain consistent data transfer across multiple cameras. This protocol is designed by determining the answers to these questions: [0084] a. When to communicate -determining the events which will require the transfer of data from other cameras. These events will henceforth be referred to as request events. [0085] b. With whom to communicate -determining the cameras to which requests should be sent.

[0086] c. What to communicate -choosing the data to be transferred between the cameras.

[0087] d. How to communicate -designing the manner in which the messages are sent, and determining the points during execution at which data transfers should be made.

The protocol is designed so that the number of messages that are sent between the nodes is decreased, and the process synchronization issue is addressed.

[0088] In the following, we refer to camera C and C for a requesting and replying camera node, respectively. The block diagram in Fig. 8 illustrates the concepts discussed in this section. It should be noted that, at some point during execution, each camera node can act as the requesting or replying node. The implementation of the proposed system consists of a parallel computing cluster, with communication between the cameras performed by the Message Passing Interface (MPI) library [32]. In this work, the use of MPI is illustrative but not mandatory since it, like other libraries, provides well-defined communication operations including blocking and non-blocking send and receive, broadcast, and gathering. MPI is also well-defined for inter-and intra-group communication and can be utilized to manage large camera groups. We take advantage of the proven usefulness of this library, and treat it as a transparent interface between the camera nodes. With reference to Fig. 8, communication between two cameras is shown.

[0089] A. When to communicate -Request Events

[0090] A camera will need information from the other cameras when: a) a new object appears in its FOV, or b) a tracker cannot be matched to its target object. These events are called request events, and are referred to as new_label and lost_label events, respectively. If one of these events occurs within a camera's FOV, the processor processing that camera needs to communicate with the other processors.

[0091] In the new_label case, when a new object is detected in the current camera view, it is possible that this object was already being tracked by other cameras. If this is the case, the camera will issue a new_label request to those cameras to receive the existing label of this object, and to maintain consistent labeling.

[0092] Camera C could also need information from another node when a tracker in C cannot be matched to its target object, and this is called the lost_label case. This may occur, for instance, if the target object is occluded in the scene or cannot be detected as a foreground object at some frame due to the failure of the BGS algorithm. In this case, a lost_label request will be sent to the appropriate node to retrieve and update the object location. [0093] Another scenario where communication between the cameras may become necessary is when trackers are merged and the location of each merged object is required. However, if the exact location of the object is not required, and coarser localization is tolerated, then the tracking algorithm introduced in Section III-B can be used to handle the merge/split within single camera view without sending request messages to the other nodes. [0094] B. With whom to communicate

[0095] The proposed protocol is designed such that rather than sending requests to every single node in the system, requests are sent to the processors who can provide the answers for them. This is achieved by employing the FOV lines.

[0096] When a request needs to be made for an object O in the view of C, the

visibility of this object by camera C is checked using the FOV lines as described in Section

III-A.2. If it is deduced that the object is visible by C , a request message targeted for node i is created and the id of the target processor, which is i in this case, is inserted into this message. Similarly, a list of messages for all the cameras that can see this object is created. [0097] C. What to communicate

[0098] The presented protocol sends minimal amounts of data between different nodes. Messages consist of 256-byte packets, with character command tags, integers and floats for track labels and coordinates, respectively, and integers for camera id numbers.

Clearly, this is significantly less than the amount of data inherent in transferring streams of video or even image data and features.

[0099] Messages that are sent between the processors, processing the camera inputs, are classified into four categories: 1) New label request messages, 2) Lost label request messages, 3) New label reply messages, and 4) Lost label reply messages. As stated, all these messages consist of 256-byte packets.

[00100] 1) New label request case: If a foreground object viewed by camera C cannot be matched to any existing tracker, a new tracker is created for it, all the cameras that can see this object are found by using the FOV lines, and a list of cameras to communicate is formed. A request message is created to be sent to the cameras in this list. The format of this message is:

[00101] Cmd_tag Target_id Curr_id Side x y Curr_label.

[00102] In this case, Cmdjag is a string that holds NEW_LABEL_REQ indicating that this is a request message for the new_label case. Target_id and Curr_id are integers. Target_id is the id of the node to which this message is addressed, and Curr_id is the id of the node that processes the input of the camera which needs the label information. For instance, Curr_id is i in this case. These id numbers are assigned to the nodes by MPI at the beginning of the execution. Side is another string which holds information about the side of the image from which the object entered the scene. Thus, it can be right, left, top ,bottom, or middle. The next two entities in the message, x and y, are doubles representing the

coordinates of the location ( p^ ) of the object in the coordinate system of C . Finally, Currjtabel is an integer holding the temporary label given to this object by C . The importance and benefit of using this temporary label will be clarified in Sections IV-D and

VI.

[00103] 2) Lost label request case: When a tracker in C cannot be matched to its target object, this is called the lost_label case. For every tracker that cannot find its match in the current frame, the cameras that can see the most recent location of its object are determined by using the FOV lines. Then, a lost_label request message is created to be sent to the appropriate nodes to retrieve the updated object location. The format of a lost_label message is:

[00104] Cmdjag Targetjd Currjd Lostjabel x y.

[00105] In this case, Cmdjag is a string that holds LOST_LABEL_REQ indicating that this is a request message for the lost_label case. Targetjd and Curr_id are integers. Targetjd is the id of the node to which this message is addressed, and Curr_id is the id of the node that processes the input of the camera that needs the location information. Lostjabel is another integer which holds the label of the tracker which could not be matched to its target object. Finally, x and y are doubles which are the coordinates of the latest location

( P_m' ) °f ^me tracker in the coordinate system of C .

[00106] 3) New label reply case: If node j receives a message, and the Cmdjag of this message holds NEW_LABEL_REQ, then node j needs to send back a reply message. The format of this message is:

[00107] Cmdjag Temp_label Answer jabel Min_pnt_dist.

[00108] In this case, Cmdjag is a string that holds NEW_LABEL_REP indicating that this is a reply message to a new jabel request. Temp jabel and Answer jabel are integers.

Temp jabel is the temporary label given to a new object by the requesting camera, and

Answer jabel is the label given to the same object by the replying camera. Finally, Min_pnt_dist is the distance between the corresponding location of the sent point and the current location of the object. As stated in Section IV-Cl, the NEW_LABEL_REQ request message has information about the requester id, side, and object coordinates in the requester's

coordinate system. Let p^ = (x,y)denote the point sent by node i. When camera node j

receives this message from node i, the corresponding location of p^ in the view of C is

calculated by using (3) as described in Section III-A.l, and this corresponding (j) location is

denoted by . If the received Side information is not middle, the FOV line corresponding

to this side of the requester camera view is found. Then, the label of the object that has

crossed this line, and is closest to the

is sent back as the Answer_label. If, on the other

hand, the received Side information is middle, then it means that this object appeared in the middle of the scene, for instance from inside of a building. In this case, the FOV lines cannot

be used, and the label of the object that is closest to the p^] is sent back as the

Answer _label. The Min_pnt_dist that is included in the reply message is the distance between

P^] and the location of the object that is closest p^] .

[00109] The proposed protocol also handles the case where the labels received from different cameras do not match. In this case, the label is chosen so that Min_pnt_dist is the smallest among all the reply messages. 4) Lost label reply case: If node j receives a message from node i, and the Cmd_tag of this message holds LOST_LABEL_REQ, then node j needs to send back a lost_label reply message to node i. The format of this message is: [00110] Cmd_tag Lost_label x_reply y_reply.

[00111] In this case, Cmdjag is a string that holds LOST_LABEL_REP indicating that this is a reply message to a lost_label request. Lost_label is an integer which is the label of i the tracker in C that could not be matched to its target object. When node j receives a lost_label request, it sends back the coordinates of the current location of the tracker with the label Lost_label as x_reply and y_reply. These coordinates are floats, and they are in the

coordinate system of C . When a reply message is received by node i, the corresponding point

of the received location is calculated on the view of C as described in Section III-A.l, and the location of the tracker is updated. [00112] D. How to communicate

[00113] The steps so far provide an efficient protocol both by reducing the number of times a message must be sent as well as the message size. This part of the protocol addresses the issue of handling the communication and processing delays without using a centralized server. This process will henceforth be called the system synchronization. It should be noted that system synchronization is different from camera or input video synchronization as mentioned above.

[00114] The SCCS protocol utilizes non-blocking send and receive primitives for message communication. This effectively allows for a camera node to make its requests, noting the requests it made, and then continuing its processing, with the expectation that the requestee will issue a reply message at some point later in execution. This is in contrast to blocking communication where the execution is blocked until a reply is received for a request. With blocking communication, the potential for parallel processing is reduced, as a camera node may be stuck waiting for its reply, while the processing program will likely require stochastic checks for messages. It is very difficult for each camera to predict when and how many messages will be received from other cameras. In the non-blocking case, checks for messages can take place in a deterministic fashion. Another possible problem with blocking communication is the increased potential for deadlocks. This can be seen by considering the situation where both cameras are making requests at or near simultaneous instances, as neither can process the other node's request while each waits for a reply. [00115] System synchronization ensures the transfer of coherent vision data between cameras. To the best of our knowledge, existing systems do not discuss how to handle communication and processing delays without using blocking communications. Even if the cameras are synchronized or time- stamp information is available, communication and

processing delays pose a problem for peer-to-peer camera systems. For instance, if camera C

sends a message to camera C asking for information, it incurs a communication delay. When

camera C receives this message, it could be on a frame behind camera C depending on the

amount of processing its processor has to do, or it can be ahead of C due to the communication delay. As a result, the data received may not correspond to the data appropriate to the requesting camera's time frame. To alleviate this and achieve system synchronization, our protocol provides synchronization points, where all nodes are required to wait until every node has reached the same point. These points are determined based on a synchronization rate which will henceforth be called synch rate. Synchronization points occur every synch rate frames.

[00116] Between two synchronization points, each camera focuses on performing its local tracking tasks, saving the requests that it will make at the next synchronization point. When a new object appears in a camera view, a new label request message is created for this object, and the object is assigned a temporary label. Since a camera node does not send the saved requests, and thus cannot receive a reply until the next synchronization point, the new object is tracked with this temporary label until receiving a reply back. Once a reply is received, the label of this object is updated.

[00117] Typical units of synchronization rate are time- stamp information for live camera input, or specific frame number for a recorded video. Henceforth, to be consistent, we refer to the number of video frames between each synchronization point when we use the terms synchronization rate or synchronization interval. There is no deterministic communication pattern for vision systems, so it is expected that the camera processors will frequently have to probe for incoming request messages. Although the penalty of probing is smaller than that of a send or receive operation, it is still necessary to decrease the number of probes because of power constraints. In order to decrease the amount of probing, we make each camera probe only when it finishes its local tasks and reaches a synchronization point. [00118] With reference to Fig. 9, there are shown camera states at the synchronization point. Fig. 9 shows a diagram of the system synchronization mechanism. This figure illustrates the camera states at the synchronization point. In the first state, the camera finishes its local tracking, and the processor sends out all of its saved requests. Then, the camera enters the second state and begins to probe to see if a done message has been received from the previous camera. If not, this node probes for incoming requests from the other nodes and replies to them while waiting for the replies to its own requests. When the done message is received from the previous camera the camera enters the third state. When all of its own requests are fulfilled, it sends out a done message to the next camera. In the fourth state, each camera node still processes requests from other cameras, and keeps probing for the overall done message. Once it is received, a new cycle starts and the node returns back to the first state.

[00119] The done messages in our protocol are sent by using a ring type of message routing to reduce the number of messages. Thus, each node receives a done message only from its previous neighbor node and passes that message to the next adjacent node when it finishes its own local operations and has received replies to all its requests for that cycle. However, based on the protocol, all the cameras need to make sure that all the others already have finished their tasks before starting the next interval. Thus, a single pass of the done

message is insufficient. If we have N cameras (C , i =0, ..., N-I), a single pass of the done

0 1 1 2 ι-\ i message will be from C to C , C to C and so on. In this case, C will not know whether C ι-2 has finished its task since it will only receive done messages from C . Thus, a second ring pass or a broadcast of an overall done message will be needed. In the current implementation, the overall done message is broadcasted from the first camera in the ring since the message is the same for every camera.

[00120] In this protocol, the synchronization rate can be set by the end user depending on the system specification. Different synchronization rates are desirable in different system setups. For instance, for densely overlapped cameras, it is necessary to have a shorter synchronization interval because an object can be seen by several cameras at the same time, and each camera may need to communicate with other cameras frequently. On the other hand, for loosely overlapped cameras, the synchronization interval can be longer since the probability for communication is lower and as a result, excess communication due to superfluous synchronization points is eliminated.

[00121] 1) Comparison of the number of messages for a server-based scenario and for SCCS: We will continue the discussion in Section II, and compare the server-based system scenario, introduced in Section II, with SCCS in terms of the total number of messages, and the message loads on the server and the SCCS nodes.

[00122] Based on the SCCS protocol discussed in Section IV, the total number of messages that are sent around by SCCS can be determined by using (2). It should be noted that, when calculating M_Sccs, this equation considers the worst possible scenario, where it is assumed that all the cameras in the system have overlapping FOVs, and all the events happen

in the overlapping region. In this case, in SCCS, N nodes will send N -I request messages to

the other nodes for E₁ events and will receive N -I reply iN messages, hence the 2* (N -I)*

N

^ _j E₁ term. At each synchronization point, each node will send a done message to its next ι=l neighbor in the ring, and the first node will send an overall done message to N -1 nodes,

hence the S x (2* N -1) term.

[00123] Figures 10(a) and 10(c) show the total number of messages sent in the server-

based system and SCCS for T =3 and T =15, respectively, where T,, in (2), is set to be T, Vi e

1,..., N. As can be seen, even with the worst case assumptions for SCCS, the total number of messages sent is less than that of the server-based system. Another very important point to note is that this is the total number of messages. For the server-based system all of these messages go through the server. Whereas, in SCCS, one ordinary node i has to send only S

+2 x (N -l)x E₁ messages and receive 2x S +(N -l)x E₁ messages. The node j sending the

overall done message has to send 2x (N -l)x E₇ -I- S ^χ N messages, and receive (N-l)xE₇+S

messages. These numbers are plotted in Figures 10(b) and 10(d) for T =3 and T =15, respectively. As can be seen, the number of messages sent or received by the server is much larger than the number of messages sent or received by any node of the SCCS. [00124] V. VERIFICATION OF THE PROTOCOL

[00125] Communicating between nodes in a peer-to-peer fashion and eliminating the use of a centralized server decreases the number of messages sent around, and provides scalability and latency advantages. However, this requires a sophisticated communication protocol which finds use in real-time systems having stringent requirements for proper system functionality. Hence, the protocol design for these systems necessitates transcending typical qualitative analysis using simulation; and instead, requires verification. The protocol must be checked to ensure that it does not cause unacceptable issues such as deadlocks and process starvation, and has correctness properties such as the system eventually reaching specified operating states. Formal verification methods of protocols can be derived from treating the individual nodes of a system as finite state automata. These then emulate communication with each other through the abstraction of a channel.

[00126] SPIN is a powerful software tool used for the formal verification of distributed software systems. It can analyze the logical consistency of concurrent systems, specifically of data communication protocols. A system is described in a modeling language called Promela (Process Meta Language). Communication via message channels can be defined to be synchronous or asynchronous. Given a Promela model, SPIN can either perform random simulations of the system's execution or it can perform exhaustive verification of correctness properties [29]. It goes through all possible system states, enabling designers to discover potential flaws while developing protocols. This tool was used to analyze and verify the communication protocol used in SCCS and described in Section IV.

[00127] To analyze and verify the communication protocol of the SCCS, we first described our system by using Promela. We modeled three different scenarios: (a) a 2- processor system with full communication, where full communication means every processor in the system can send requests and replies to each other, (b) a 3-processor system, where the first processor can communicate with the second and third, the second processor can only communicate with the first, and the third one only replies to incoming requests, and (c) a 3- processor system with full communication. The reason of modeling scenario (b) is clarified below.

[00128] After modeling different scenarios, we first performed random simulations.

With random simulation, every run may produce a different type of execution. In all the simulations of all three scenarios, all the processors of the model terminated properly. However, each random simulation goes through one possible set of states. Thus, an exhaustive search of the state space is needed to guarantee that the protocol is error-free. We performed exhaustive verification of the three different scenarios with different synchronization rates. We also inserted an assertion into the model to ensure that a processor

starts a new synchronization interval only if every processor in the system has sent a done

message at the synchronization point. All of our three scenarios have been verified

exhaustively with no errors. Table I shows the results obtained, where the synch rate is 1, and

there are 4 synchronization points, (a), (b) and (c) correspond to the scenarios described

above. As can be seen in the table, when three processors are used with full communication,

the number of states becomes very high compared to other scenarios, thus the search requires

more memory. Scenario (b) was modeled so that we can compare scenario (c) to (b), and see

the increase in the number of states and memory requirement. The total memory usage in the

table is the "total actual memory usage" output of the SPIN verification. This is the amount

after the compression performed by SPIN, and includes the memory used for a hash table of

IABLE I

COMPA RISON OF EXHAUSTIV E V ERIFICATION OUTPUTS FOR swchj-tiw ni=

states.

[00129] Fig. 11 shows the number of states reached with the three scenarios, and with

different number of synchronization points. For the 3-processor and full communication

scenario, the number of states increases very fast with increasing number of communication

points. Since the memory requirement increases with the number of states, the scenario (c)

requires the most amount of memory for verification. In addition, when the synch rate is increased, the number of states increases for the same number of synchronization points, as the requests of the local trackers are saved until the next synchronization

[00130] Fig. 11. Number of states reached during the verification of different communication scenarios.

[00131] point, and then sent out. These results illustrate that, as is well known in the field of communications, verification of complicated protocols is not a straightforward task.

Also, careful modeling of the large systems having many possible states is very important for exhaustive verification.

[00132] VI. EXPERIMENTAL RESULTS

[00133] A. Camera Setups

[00134] We have implemented SCCS on Linux using PC platforms and Ethernet. This section describes the results of experiments on a 3-camera 3-CPU system. Different types of experiments with different camera setups and video sequences of varying difficulty have been performed by using SCCS and the proposed communication protocol.

[00135] Fig. 12 shows the two different camera setups and two types of environment states used for the indoor experiments. We formed different environment states by placing or removing occluding structures, for instance a large box in our case, into the environment. As shown in Figures 12. (al) and 12. (bl), we placed three cameras in two different configurations in a room. Figures 12. (al) versus 12. (a2) and 12. (bl) versus 12. (b2) illustrate the two different environment states, i.e. scenes with or without an occluding box. As seen in

Figures 12. (a3), 12. (b3) and 12. (b4), three remotely controlled cars/trucks have been used to experiment with various occlusion, merge and split cases. We also captured different video sequences by operating one, two or three cars at a time.

[00136] First, the processing times of a single processor system and a distributed multi-camera system incorporating peer-to-peer communication were compared. Fig. 13 shows the speedup attained using our system relative to a uniprocessor implementation for two cases: processing input from two cameras and from three cameras. In the figure, processing times are normalized with respect to the uniprocessor case processing inputs from three cameras which takes the longest processing time. As can be seen, the uniprocessor approach does not scale very well as processing the input from three cameras takes

[00137] Fig. 12. (al) and (bl) show the locations of the cameras for the first and second camera setups, respectively; (a2) and (b2) show the environment states for the lost label experiments. The photographs of the first and second camera setups are displayed in

(a3), (b3) and (b4).

[00138] 3.57 times as long compared to processing input from two cameras. Whereas, in our case, processing inputs from three cameras by using three CPUs takes only 1.18 times as long compared to processing inputs from two cameras by using two CPUs. Hence, it is demonstrated that the execution time required is maintained, without significant increase, while adding the beneficial functionality of an additional camera. In addition, our approach provides 3.37x and 10.2x speedups for processing inputs from two and three cameras, respectively, compared to a uniprocessor system.

[00139] With reference to Fig. 13, there is shown a comparison of the processing times required for processing inputs from two and three cameras by a uniprocessor system and by

SCCS. 2Proc-2Cam and 3Proc-3Cam denote the times required by SCCS.

[00140] B. Waiting time experiments

[00141] In this set of experiments, we measured the average elapsed time between the instance an event occurs and the next synchronization point, where the reply of the request corresponding to this event is received.

[00142] Henceforth, this elapsed time will be referred to as waiting time. For instance, if the synch rate is 10, then the synchronization points will be located at frames 1,11,21,..., 281,291,301... and so on. If a new object appears in a camera's FOV at frame 282, then the waiting time will be 9 frames, as the next synchronization point will be at frame 291.

[00143] Figure 14 shows the average waiting time for experiments performed with different video sequences with different synch rate values. As can be seen, even when the synch rate is 60 frames, the average

[00144] Synchronization Rate (frames) Synchronization Rate (frames) Synchronization

Rate (frames)

[00145] Fig. 14. Waiting times for different videos and environment setups; (a), (b) and (c) show the waiting times for the videos captured with indoor setup 1, indoor setup 2 and for the PETS video, respectively.

[00146] C. Accuracy of the data transfer

[00147] In this set of experiments, we measured the accuracy of the data transfer and data updates. This accuracy is determined by the following formula:

_v ,

[00148] where #correct_updates represents the number of times a new_label or lost_label request is correctly fulfilled and the corresponding tracker is correctly updated (its label or its location). The determined accuracy values are shown in Fig. 15. As can be seen, for a synch rate of 1, the system achieves a minimum of 94.2% accuracy for the new label requests/updates on both indoor and outdoor videos. For the lost_label requests, a minimum of 90% accuracy is achieved for both indoor and outdoor videos with a synch_rate of 1. Further, even with allowing the processors to operate up to 2 seconds without communication, a minimum of 90% accuracy is still attained for newjlabel requests with indoor sequences, while 90.9% accuracy is obtained for the outdoor sequence. Again, with allowing the processors to operate up to 2 seconds without communication, a level of 80% or higher accuracy is attained for lost_label requests with indoor sequences, while 60% accuracy is obtained for the outdoor sequence.

[00149] Figures 16 and 17 show examples of receiving the label of a new tracker from the other nodes, and updating the label of the tracker in the current view accordingly. For Fig. 16, the synch_rate is 10. As Average Accuracy (%). The average accuracy of the data transfer for indoor [(a)(b)] and outdoor (c) sequences can be seen in Fig. 16. (bl), when the car first appears in the view of camera 2, it is given a temporary label of 52, and is tracked with this label until the next synchronization point. Then, the correct label is received from the other nodes in the system and the label of the tracker in the view of camera 2 is updated to be 51 as seen in Fig. 16. (b3). Fig. 17 is another example for a synch_rate of 60 for the second camera setup. Again, the label of the tracker, created at frame 1468 and given a temporary label of 56, is updated successfully at frame 1501 from the other nodes in the system. [00150] Figures 18, 19 and 20 show examples of updating the location of a tracker, whose target object is lost, from the other nodes. For Fig. 18, the synch_rate is 5, and the views of the three cameras are as seen in Figures 16(al), 16(bl) and 16(cl). As seen in Figures 18. (al) through (alO), the location of the car behind the box is updated every 5 frames from the other nodes, until it reappears. Fig. 19 is another example for a synch_rate of 1 for the second camera setup. The location of the tracker is updated from the other nodes at every frame. Figures 19. (al) through (a5) show some example images. Fig. 20 shows an example, where the location of people occluded in an outdoor sequence is updated. [00151] Figures 21 and 22 show examples of SCCS dealing with the merge/split cases on a single camera view for indoor and outdoor videos, respectively. The accuracy of giving the correct labels to objects after they split is displayed in Fig. 23.

[00152] Figure 24 shows the number of new_label and lost_label requests for different synchronization rates for the video captured by the first camera setup with the box placed in the environment. As expected, with a synch_rate of 1, a lost_label request is sent at each frame as long as the car is occluded behind the box. Thus, the number of lost_label requests is highest for the synch_rate of 1, and decreases with increasing synch_rate. [00153] VII. CONCLUSIONS

[00154] This paper has presented the Scalable Clustered Camera System, which is a peer-to-peer multi-camera system for multiple object tracking. Each camera is connected a CPU, and individual nodes communicate with each other directly eliminating the need for a centralized server. Instead of transferring control of tracking jobs from one camera to another, each camera in the presented system keeps its own tracks for each target object, which provides fault tolerance. A fast and robust tracking algorithm was proposed to perform tracking on each camera view, while maintaining consistent labeling.

[00155] Peer-to-peer systems require sophisticated communication protocols that can handle communication and processing delays. These protocols need to be evaluated and verified against potential deadlocks, and their correctness properties need to be checked. We introduced a novel communication protocol designed for peer-to-peer vision systems, which can handle the communication and processing delays. The reasons of processing delays include heterogenous processor, different loads at different processors and instruction and task scheduling within the node's processing unit. The protocol presented in this paper incorporates variable synchronization capabilities. Moreover, compared to server-based systems, decreases the number of messages that a single node has to handle as well as the total number of messages that need to be sent considerably. We then analyzed and exhaustively verified this protocol, without any errors or redundancies, by using the SPIN verification tool.

[00156] Video sequences with varying levels of difficulty have been captured by using different camera setups and environment states. Different experiments were performed to obtain the speed up provided by SCCS, to measure average data transfer accuracy and average waiting time. Experimental results demonstrate the success of the SCCS, with high data transfer accuracy rates.

[00157] It is to be understood that the exemplary embodiments are merely illustrative of the invention and that many variations of the above-described embodiments may be devised by one skilled in the art without departing from the scope of the invention.

Claims

CLAIMSWhat is claimed is:

1. A method for tracking one or more objects with a plurality of cameras, the method comprising the steps of: finishing, by a first camera, its local tracking; sending, by a processor, all of its saved requests; probing, by the first camera, to determine if a done message has been received from a second camera; if no done message is received, probing for requests from other cameras; replying to the requests from the other cameras; upon receipt of a done message from a second camera, determining if the first camera's done requests are fulfilled; and sending a done message to a third camera.