US20140254875A1  Method and system for automatic objects localization  Google Patents
Method and system for automatic objects localization Download PDFInfo
 Publication number
 US20140254875A1 US20140254875A1 US14/267,598 US201414267598A US2014254875A1 US 20140254875 A1 US20140254875 A1 US 20140254875A1 US 201414267598 A US201414267598 A US 201414267598A US 2014254875 A1 US2014254875 A1 US 2014254875A1
 Authority
 US
 United States
 Prior art keywords
 method
 vector
 λ
 formula
 atoms
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Abandoned
Links
 230000004807 localization Effects 0 abstract claims description title 18
 230000002596 correlated Effects 0 abstract claims 6
 230000001902 propagating Effects 0 abstract claims 2
 238000000034 methods Methods 0 claims description 25
 230000000875 corresponding Effects 0 claims description 15
 238000007781 preprocessing Methods 0 claims description 8
 230000000694 effects Effects 0 claims description 7
 230000003044 adaptive Effects 0 claims description 6
 238000000605 extraction Methods 0 claims description 6
 230000001603 reducing Effects 0 claims description 6
 239000011159 matrix materials Substances 0 claims description 5
 230000003247 decreasing Effects 0 claims description 3
 125000004429 atoms Chemical group 0 abstract 4
 230000004301 light adaptation Effects 0 claims 1
 239000002609 media Substances 0 claims 1
 238000004422 calculation algorithm Methods 0 description 24
 241000282414 Homo sapiens Species 0 description 11
 206010053648 Vascular occlusion Diseases 0 description 11
 238000005457 optimization Methods 0 description 8
 238000005070 sampling Methods 0 description 7
 238000004458 analytical methods Methods 0 description 6
 239000000203 mixtures Substances 0 description 5
 230000011218 segmentation Effects 0 description 4
 230000006399 behavior Effects 0 description 3
 230000000977 initiatory Effects 0 description 3
 230000015654 memory Effects 0 description 3
 238000003909 pattern recognition Methods 0 description 3
 230000004044 response Effects 0 description 3
 230000003068 static Effects 0 description 3
 238000003860 storage Methods 0 description 3
 241000282412 Homo Species 0 description 2
 238000010276 construction Methods 0 description 2
 238000009826 distribution Methods 0 description 2
 239000000047 products Substances 0 description 2
 230000002123 temporal effects Effects 0 description 2
 230000000007 visual effect Effects 0 description 2
 230000003139 buffering Effects 0 description 1
 238000004891 communication Methods 0 description 1
 238000004590 computer program Methods 0 description 1
 238000005520 cutting process Methods 0 description 1
 230000004438 eyesight Effects 0 description 1
 238000009472 formulation Methods 0 description 1
 230000014509 gene expression Effects 0 description 1
 230000001976 improved Effects 0 description 1
 230000001965 increased Effects 0 description 1
 238000009740 moulding (composite fabrication) Methods 0 description 1
 230000002093 peripheral Effects 0 description 1
 230000002829 reduced Effects 0 description 1
 238000006722 reduction reaction Methods 0 description 1
 230000035945 sensitivity Effects 0 description 1
Images
Classifications

 G06T7/0044—

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scenespecific objects
 G06K9/00771—Recognising scenes under surveillance, e.g. with Markovian modelling of scene activity

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scenespecific objects

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scenespecific objects
 G06K9/00771—Recognising scenes under surveillance, e.g. with Markovian modelling of scene activity
 G06K9/00778—Recognition or static of dynamic crowd images, e.g. recognition of crowd congestion

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scenespecific objects
 G06K9/00785—Recognising traffic patterns acquired by static cameras

 G06T7/0079—

 H—ELECTRICITY
 H04—ELECTRIC COMMUNICATION TECHNIQUE
 H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
 H04N5/00—Details of television systems
 H04N5/222—Studio circuitry; Studio devices; Studio equipment ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, TV cameras, video cameras, camcorders, webcams, camera modules for embedding in other devices, e.g. mobile phones, computers or vehicles
 H04N5/225—Television cameras ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, camcorders, webcams, camera modules specially adapted for being embedded in other devices, e.g. mobile phones, computers or vehicles
 H04N5/247—Arrangements of television cameras

 H—ELECTRICITY
 H04—ELECTRIC COMMUNICATION TECHNIQUE
 H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
 H04N5/00—Details of television systems
 H04N5/30—Transforming light or analogous information into electric information
 H04N5/33—Transforming infrared radiation

 H—ELECTRICITY
 H04—ELECTRIC COMMUNICATION TECHNIQUE
 H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
 H04N7/00—Television systems
 H04N7/18—Closed circuit television systems, i.e. systems in which the signal is not broadcast

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
 G06T2207/00—Indexing scheme for image analysis or image enhancement
 G06T2207/10—Image acquisition modality
 G06T2207/10004—Still image; Photographic image

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
 G06T2207/00—Indexing scheme for image analysis or image enhancement
 G06T2207/10—Image acquisition modality
 G06T2207/10016—Video; Image sequence

 G06T2207/20144—

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
 G06T2207/00—Indexing scheme for image analysis or image enhancement
 G06T2207/30—Subject of image; Context of image processing
 G06T2207/30196—Human being; Person

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
 G06T2207/00—Indexing scheme for image analysis or image enhancement
 G06T2207/30—Subject of image; Context of image processing
 G06T2207/30232—Surveillance
Abstract
A method for automatic localization of objects in a mask. The method includes building a dictionary or atoms, wherein each atom models the presence of one object at one location and iteratively determining the atom of said dictionary which is best correlated with said mask, until ending criteria are met. The invention system concerns also automatically detects objects in a mask. At least one fixed camera is provided for acquiring video frames. A computation device is used for calibrating at least one fixed camera for extracting foreground silhouettes in each acquired video frames for discretizing said ground plane into a nonregular grid of potential location points for constructing a dictionary of atoms, and for finding objects location points with the previous method. And a propagating device is provided to propagate the result in at least one fixed camera view.
Description
 This application is a continuation of U.S. patent application Ser. No. 12/779,547, filed May 13, 2010, the disclosure of which is hereby incorporated by reference herein.
 The present invention concerns a method and a system for automatic localization and/or detection and/or identification and/or singulation and/or segmentation and/or computation and/or tracking of objects. More precisely, objects are automatically detected and localized on the ground and in the image planes of a network of cameras or even a single camera setup. Objects can be people in a crowded environment.
 Accurate visionbased people detection has been of interest for the past decades in many applications. A number of surveillance applications for example require the detection of people to ensure security, safety and site management. Examples include the behavior analysis and the monitoring of entry points, bus terminal and train stations in order to know for example if people stay in a forbidden area.
 Another example is the safety in an urban environment: the knowledge of the pedestrians' positions can for example prevent collisions with cars. For many applications it is also useful to automatically compute pedestrians in a specific area.
 People detection can be applied also to the sport game analysis: the detection of football or basketball players can be fedback to the coach in order to improve the team play. Another possible application concerns the recognition of the numbers worn by athletes during a race.
 The automatic localization of objects different to people can be used for analyzing and classifying biomedical cells in a laboratory or for computing industrial objects, cars etc. . . . . Each kind of application requiring the localization of an object with a specific shape in an environment is imaginable.
 The main difficulty in automatic localization of the objects comes from the mutual occlusions of these objects in a group. For instance, in sport game such as basketball, players can strongly occlude each other and have abrupt change of behavior.
 The most effective techniques in the related art for detecting moving people or objects occluding each other are highly complex and may thus require the use of costly hardware. Moreover, they may not be fast enough to perform in realtime detection and localization of people or objects within a crowd. Often they use tracking information to deal with occlusions.
 EP2131328 concerns a method for automatic detection and tracking of multiple objects. The video data obtained from one or more cameras are processed in two steps: the first fast step comprises an indexing process, which generates estimate hypothesis for the location and attributes of objects. Hypothesis are then refined to generate statistical models for the objects appearance and geometry, and then applied for discriminative tracking, during which objects locations and attributes are updated by using online uncertainty estimation. Each person is tracked by a combination of two kernels, one for the head and one for the torso. When there is an occlusion and the system predicts for example that the torso of a person will be occluded by another person, the system lowers the weight of this torso kernel and reduces its influence in the tracking algorithm. The invention is fast only during the first step and requires complex computing, based on online uncertainty estimation and tracking information. Moreover estimates can be improved by using multiple cameras.
 US2003/0107649 concerns a method for detecting and tracking groups of socially interrelated people at a checkout line in a store, in order to permit an additional checkout line to be opened when the number of groups in the observed checkout line exceeds a predetermined value. Groups are determined by analyzing interperson distance over time. After, the people silhouette detection by using background subtraction and person segmentation allows distinguishing between individual persons. It is based on two kinds of informations: temporal constraints (i.e., people belonging to the same shopping group enter and exit the scene together) and global motion (i.e., people in group show similar movement patterns during checkout).
 US2009/0304265 concerns a method for modeling threedimensional object by using twodimensional images of the object from multiple cameras. The surface of the object is computed from estimate slices of the object that lie in parallel planes and cutting through the object. The greater the number of views, i.e. the number of cameras, the more accurate the reconstruction of the object is. The greater the number of parallel plans, the more robust the method is. This method does not work by using a single camera with fixed orientation. Such an approach is computationally intensive.
 US2009/0002489 concerns a method for object tracking. For each of a plurality of human objects, a person's model comprising at least one feature of the person, e.g. a color feature, is created and dynamically updated. An occlusion disambiguation algorithm is performed by using the image of a blob, based on the previous tracks and the person's model. More precisely, a person's model is selected from the set of models and each selected person's model is matched to the blob and scored. The model with the best score is then removed from the list of the models, according to a greedy algorithm. This invention enables to track efficiently objects through occlusions because it uses informations from previous tracks, but does not permit localization of the objects without temporal informations.
 US2008/0123968 concerns a human detection and tracking system including a full body detector and a plurality of part detectors, each for a specific part of the human body (for example head, torso and legs). A combiner detector combines the detectors' responses and generates a combined detection response. It is further configured to model interocclusions of the humans in the image, and to implement a greedy matching algorithm to perform the match between the detectors' responses and the body parts of humans. The human detection problem is formulated as a maximum a priori estimation problem. This invention requires at least four detectors. Moreover, the part detectors have to be learned.
 US2009/0034793 concerns a method for performing crowd segmentation by using an indexing step, which produces a quick approximate result followed, when desired, by an estimation step that further refines the approximate result. During the indexing step, the foreground silhouette shape is matched with a set of foreground silhouettes shapes, for each of which the number and the position of human subjects is known. During the estimation step a MCMC (Markov Chain Monte Carlo) is used. The method requires the construction of a lookup table containing the set of foreground silhouettes shapes and its relative number and position of the human subjects forming each shape. Moreover, the higher the accuracy of representation of the foreground silhouette shape to be matched, the more complex the calculations are.
 US2008/0118106 concerns a method and a system for crowd counting and monitoring. The invention uses a global description of a group of people, and rather than looking for individual, it uses the area of the entire group as the cue to estimate the number of people. The effect of occluded members is mitigate by treating the crowd as a whole group for each frame in an image sequence, and by maintaining a history of the estimates for each blob throughout the lifetime of the blob so that the most occurred value is chosen at any given frame.
 Isolated people, in an uncluttered scene, can be detected with a single static or moving camera, based on pattern recognition techniques. A set of features can be extracted from a large number of training samples to train a classifier. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Int'l Conference on Computer Vision and Pattern Recognition, 2005, pp. I: 886893, used histogram of the oriented gradient as the set of features and O. Tuzel, F. Porikli, and P. Meer, “Human detection via classification on riemannian manifolds,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, pp. 17131727, 2008, uses covariance matrices as the set of features together with a boosting approach.
 Given a fixed camera, a moving object can also be detected by modeling the background; tracking then becomes simply an object correspondence across the frames. Typically, the work of Stauffer and Grimson, “Adaptive background mixture models for realtime tracking,” Proc. IEEE Int'l Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 246252, 1999, can be used to extract the foreground pixels. Each pixel is modeled as a mixture of Gaussians with an online approximation for the update. Then, detected people can be tracked using standard approaches. Porikli, “Achieving realtime object detection and tracking under extreme conditions,” Journal of RealTime Image Processing, vol. 7, no. 1, pp. 3340, 2006, presents a survey on object detection and tracking methods given a single fixed camera.
 Current approaches fail to robustly isolate different persons in the image of a group of people taken by one camera due to their mutual occlusions.
 In order to deal with a dense spatial distribution of people, and their mutual occlusions, the outputs of several cameras could be used to detect the objects of interest. However, current multiview approaches do not robustly segment a crowd of people. S. M. Khan and M. Shah, “Tracking multiple occluding people by localizing on multiple scene planes,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 31, no. 3, pp. 505519, 2009, wraps the foreground silhouettes of all the camera views into a common reference and segment the feet region of people.
 F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera people tracking with a probabilistic occupancy map,” IEEE Trans. on Pattern Analysis and Machine intelligence, vol. 30, no. 2, pp. 267282, 2008, take advantage of a multiview infrastructure to accurately track people across camera given foreground silhouettes. A mathematical framework is developed for estimating the probabilities of occupancy of a ground plane given the foreground silhouettes. The occupancy probabilities are approximates as the marginals of a product law minimizing the KullbackLeibler divergence from the true conditional posterior distribution (referred to as Fixed Point Probability Field algorithm). This mathematical framework leads to potentially high false positives rate. In addition, the computation cost of the algorithm depends on the number of ground plane points to be evaluated, which limits the area to be monitored. Moreover, this algorithm does not work in realtime.
 According to the state of art, a realtime method for localizing and/or detecting and/or identifying and/or segmenting and/or computing and/or tracking objects, robust to noise, to the possible occlusions and to abrupt changes is needed. In addition, to avoid buffering delay, a method that does not require tracking information to deal with occlusions is preferably needed. A method which allows a single frame localization is also needed.
 According to the state of art, a simple, fast and low cost method for detecting/localizing the objects, which does not impose any constraint on the scene surface to be monitored is needed.
 The aim of the present invention is to provide a method and a system for automatic localization of objects that is exempted from the limitations of the prior art.
 One object of the present invention is to provide a method and a system for detecting/localizing objects, robust to noise, to possible occlusions of objects and to abrupt changes of the scene.
 Another object of the present invention is to provide a realtime method for detecting objects, which quickly identifies the objects' locations in presence of noise and the mutual occlusions.
 Another object of the present invention is to provide a method and a system for object detection/localization, versatile with respect to the shape of the objects to be detected/localized.
 Another object of the present invention is to provide a method and a system which does not impose any constraint on the scene surface to be monitored.
 Another object of the present invention is to provide a method and a system for detecting/localizing objects, which is robust even when objects have similar appearance.
 Another object of the present invention is to provide a method and a system which does not require large memory storage.
 Another object of the present invention is to provide a method and a system for detecting objects, which is scalable to any number of the monitoring cameras including a single camera setup.
 Another object of the present invention is to provide a method and a system for detecting objects that is generic to any scene of objects and sensing modality.
 Another object of the present invention is to provide a method and a system for detecting objects versatile with respect to heterogeneous camera network, i.e. able to merge specific camera geometries such as planar and omnidirectional cameras, and to handle cameras with and/or without overlapping fieldofviews.
 According to the invention, these aims are achieved by means of method for automatic localization of objects according to claims 1 and 38, by a computerprogram product for signal processing according to claim 41 and by a system for automatically detecting objects on the ground plane and in the image plane of at least one fixed camera according to claim 42.
 According to one aspect, the method is based on a priori assumption that in many practical images objects are sparse. By introducing this sparsity constraint, a new and fast greedy algorithm can be developed for fast localization, detection, identification, singulation, segmentation, computation and tracking of objects. A greedy algorithm is an approximation algorithm for optimization problems, which works iteratively. At each iteration, it locally optimizes the objective function of the problem, with the hope of finding the global optimum.
 The advantages of the greedy algorithm, comparing with the prior art, include in particular the possibility of reducing the time required for localizing objects. In one embodiment, objects can be localized on the ground plane and in the image plane of at least one camera. Moreover, this method allows a robust localization of these objects in presence of noise and mutual occlusions.
 Advantageously, the method according to the invention is versatile with respect to the shape of the objects to be localized. It does not impose any constraint on the scene surface to be monitored.
 Advantageously the method for detecting/localizing objects according to the invention is simple, fast and robust to objects with similar appearance and to abrupt changes.
 In one possible embodiment, the method at each new frame takes into account localizations of objects detected and localized in previous frames (objects tracking).
 Advantageously the system for detecting objects according to the invention is scalable to any number of used cameras, including a single camera setup, and versatile with respect to heterogeneous camera network. it does not require large memory storage.
 The invention will be better understood with the aid of the description of an embodiment given by way of example and illustrated by the figures, in which

FIG. 1 shows a view of one embodiment of a system for detecting people according to the invention, comprising three cameras and one image processing system. 
FIG. 2 shows an image plane of a camera and three foreground silhouettes. 
FIG. 3 shows the approximate correspondence between each point of the ground and a silhouette modeling the presence of an object, here a person, in a camera view. 
FIG. 4 shows three atoms modeling the foreground silhouettes ofFIG. 2 . 
FIG. 5 is an overview of the adaptive sampling process. 
FIG. 6 a shows sample points given a regularly spaced grid (prior art). 
FIG. 6 b shows sample points given a nonregular grid according to the invention. 
FIG. 7 a shows an example of five views of cameras with overlapping fieldofviews. 
FIG. 7 b shows the corresponding degraded silhouettes. The grid is only for visual purposes. 
FIG. 7 c shows atoms modeling the given foreground silhouettes. The grid is only for visual purposes. 
FIG. 1 illustrates one embodiment of a system for detecting/localizing objects according to the invention. It comprises at least one or several cameras 2. The camera may be either analog or digital. If an analog device is used, the image is digitalized with means known by the man skilled in the art. The camera 2 can be planar and/or omnidirectional. In one embodiment IR cameras can be used. The cameras 2 can have overlapping fieldofviews. Moving camera can also be used instead or in addition to static camera.  In the example of
FIG. 1 , there are three static cameras 2. When at least two cameras are used, they are pseudosynchronized, in order to acquire quasi simultaneously images. In this application with the expression “cameras pseudosynchronized” it is intended cameras that acquire images that are not precisely simultaneous but that can be considered as simultaneous, i.e. having a few frame delay, a few ms of difference. Cameras 2 are connected to an image processing system 8 via a link 6. Link 6 may be any imagetransferring capable link known in the art, such as a video cable or a wireless transmission link. One end of the link 6 is connected to each camera 2, while the other end is connected to the image processing system 8. The image processing system 8 may be any type of system or device capable of executing an algorithm for interpreting the images taken by cameras 2. For example the image processing system 8 can be a PC, a laptop or a processing chip integrated into an instrument panel or the like, with suitable image processing software. The image processing system 8 can be connected to peripheral equipments such as recording devices, external communication links and power sources.  First the system is calibrated offline by standard techniques. As an example, the camera model proposed by J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wideangle, and fisheye lenses,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 8, pp. 1335, 2006, can be used to map 3D points in the scene to the image plane 20 of all cameras 2.
 On the ground plane 10 there are one or more objects 1. In one embodiment these objects are moving people. In another embodiment objects constitute a dense set of people, for example basketball players. Players can strongly occlude each other or have abrupt changes of behavior.
 The number of objects, such as people, present in the scene is sparse because for example it is limited by the capacity of the room or of the play area.
 The camera or cameras 2 acquire stills or video images, and foreground silhouettes 22 are extracted from those images, as shown in
FIG. 2 . The extraction process thus produces masks, in which extracted foreground features are represented by a nonzero value while background pixels are represented by zeros, or by any other binary values. In one embodiment multivalued masks can be used, where the value correspond for example to a probability of having a foreground feature, or to a color or brightness of the feature. In another embodiment the extraction process produces binary masks. Up to the selection of an appropriate background subtracting method, at a given time, each camera 2, in the set of C calibrated cameras—where C is an integer positive number and it can be equal to one—is the source of one mask, for example a binary mask y_{c}∈{0,1}^{M} ^{ c }where M_{c }is the number of pixels, i.e. the resolution, of each camera indexed by 1≦c≦C. The 2D silhouette image of each camera can be represented in a 1D vector y_{c}, by the concatenation of its columns. Further all these vectors y_{c }are stacked into the MultiSilhouette Vector (MSV) y given by 
y=(y _{1} ^{T} , . . . ,y _{C} ^{T})^{T}∈{0,1}^{M }  where,

$M=\sum _{i=1}^{C}\ue89e{M}_{i}.$  To simplify notations, the invention often refers to 2D objects or images as 1D vectors, i.e. the vectors obtained for instance by the concatenation of the columns of these 2D objects respectively images.
 The background/foreground extraction may be performed by applying any appropriate algorithm known by the man skilled in the art to the video frames.
 In practice, the extracted foreground silhouettes 22 suffer from two flaws. First, a single silhouette in the binary mask can correspond to several people or objects close to each other. Second, silhouettes are usually made of many false positives pixels (e.g. shadows, reflections) and false negatives ones (i.e. missing foreground pixels). For example the shape of the silhouettes 22 shown in
FIG. 2 can correspond to one person or two or three people.  The continuous ground plane is discretized into 2D grid comprising sample points, for example and preferably into a nonregular or uneven grid of N subregions (cells). It is assumed that each cell can be occupied by only one object at each time instance. Therefore, the whole image can be represented by a binary vector x of N elements. The grid is bidimensional but it is represented by a vector, to simplify the notations. The elements of vector x∈{0,1}^{N }are indicating the presence of objects in the corresponding cells. For example, an object is present in a cell identified by the index i if and only if x_{i}=1. This vector is called the “occupancy vector”.
 In one embodiment, an adaptive discretization of the ground is used in order to reduce the search space. As will be described, instead of sampling the ground by a uniform regularly spaced grid, a nonregular sampling process is used. It is adaptively built as a function of the cameras' topology and of the scene activity.
 An average object or person 1 with a given volume, located in a part of the scene refers to an occupancy vector x, with only one nonzero element depending on its location. If x contains one nonzero component, each of the cameras 2 will capture only one silhouette 22, i.e. a connected area of nonzero pixels in the image plane, with size and location related to the particular projective geometry, combining the scene and the cameras. The occupancy vector x can also contain more than one nonzero element,
 The following “forward model” describes the underlying correspondence between each occupancy vector and its resulting multisilhouette vector MSV. It maps the occupancy vector x∈{0,1}^{N }to a certain configuration of occluding silhouettes in the camera images, by using a matrix D∈{0,1}^{M×N}, called the “dictionary”. The dictionary is composed of N columns (atoms) 30, each of them represents an approximation for the silhouette of an averagesize object or person, at a possible location on the image or ground surface. By such a construction, the dictionary relates the nonempty locations of the occupancy vector x (positions occupied by objects) to an approximation of the multiple silhouettes y viewed by the cameras 2, through a nonlinear mapping.
 In other words, each atom 30 approximates the silhouette 22 generated by a single object or person 1, at a given location 12, in all the camera views 20. For example, if a camera's view comprises w×h pixels, a single object or a person 1 corresponds to a vector with w×h binary elements, i.e. each element takes zero or a nonzero value, depending on the shape of the object/person and its position on the scene. In one embodiment this vector comprises zeros and ones. This vector represents a column of the dictionary D. For each ground position, i.e. for each sample point, given a camera 2, a column of the dictionary D is formed. In the case of more than one camera 2, the columns corresponding to the other cameras' view are concatenated vertically under the previous columns. The columns of D, i.e. the atoms, live thus in the same space as the observed MultiSilhouette Vector (MSV), i.e. in a space of M dimensions.
 Each column (atom) of D, say d_{i}, indicates the corresponding approximated ideal MSV of an average object or person, at position j in the scene. Practically, each of the atoms d_{i }are generated by a computing process, thanks to the homographies mapping points in the 3D scene to their 2D coordinates in the camera views during the calibration step. The generation of each atom depends on the position, the zoom, the focus and the resolution of at least one camera 2.
 As discussed, the dictionary D∈{0,1}^{M×N }is a matrix, wherein the number of rows M corresponds to the resolution of the cameras 2. In one embodiment each atom represents only one possible shape per object; in this case the number of columns N of the dictionary D corresponds to the number of sampling points of the scene. In another embodiment, more than one atom represents an object/person at each location, depending on its possible different shapes: for example, one atom for a standing person and another for a sitting person, at a certain location. In this case, the number of columns N of the dictionary D corresponds to the number of the ground sampling points multiplied by the number of possible objects'/persons' shapes.
 The foreground silhouette 22 of an object or a person 1 is approximated by an atom with simple shapes. For example, in the case of a standing person, a rectangular or an elliptical shape can be used. To cope with the various poses and shapes that a person can generate in a camera view, a halfcylinderhalfspherical shape can be used to approximate the silhouette 22 of a standing person 1 in the views 20.
 Mathematically, the “forward model” relates the MultiSilhouette Vector y to the occupancy vector x and to the dictionary D (made of atoms representing single objects' approximated ideal multiview silhouettes), by the formula

y=D·x⊕z (1)  where z∈{0,1}^{M }is the noise vector that corrupts MSV by both missing and extra foreground pixels. This may occur due to several reasons, e.g. non ideal silhouette extraction, non ideal modeling of the dictionary atoms, shadows, reflections, etc.
 The operations in equation (1) are Boolean, i.e. the matrix multiplication in D·x differs from the conventions in linear algebra by substituting the sum and the product of the matrices elements by nonlinear Boolean operators OR and AND, respectively. The symbol ⊕ denotes bitwise XOR operation between two Boolean vectors. The formula (1) demonstrates a nonlinear mapping between x and y.
 Instead of using Boolean operators, an alternative way of formulating the forward model can be achieved by applying a quantization operator Q:R^{N}→{0,1}^{N }on the conventional linear algebra's matrix multiplication D×x, where (Q[v])_{i}=1 if v_{i}≠0 and 0 else.
 The dictionary D can also be considered as the concatenation of all the subdictionaries D_{c}∈{0.1}^{M·N}, where D_{c }is the index restriction of the atoms of D to the pixel range of each camera c, for 1≦c≦C. Therefore

D=(D _{1} ^{T} , . . . ,D _{C} ^{T})^{T }  meaning implicitly that there is no theoretical constraint on the number or on the type of camera used, e.g. planar or omnidirectional.
 Practically, the atoms of each D_{c }are generated thanks to
 (i) the homographies mapping points in the 3D scene to their 2D coordinates in the planar view during the calibration step
 (ii) the approximation of the silhouettes by simple shapes (e.g. rectangular or elliptical shapes).
 Advantageously a dictionary D can be modified depending on the shape of the objects or people to be detected. In one embodiment it is contained in a file, like a XML file, a txt file, a binary file or other appropriated format file, stored in the image processing system 8, and for example it can be changed/modified in order to fit better with silhouettes of Japanese or American people. Atoms of a dictionary that models the shape of a cellule or an industrial object or a car can be advantageously used for automatically localizing cellules or industrial objects or cars, respectively.
 Given the constructed dictionary D and the foreground silhouettes y, the problem of detecting/localizing people in a scene can be formulated by one of the three following minimizations, depending on their respective prior side information:
 (i) When the number of objects is a priori known or bounded, i.e. W_{H}(x)≦k where k is an integer and positive number:

$\begin{array}{cc}\hat{x}=\underset{v}{\mathrm{argmin}}\ue8a0\left({W}_{H}\ue8a0\left(y\oplus l\right)\xb7x\right))\ue89e\text{}\ue89es.t.\text{}\ue89e{W}_{H}\ue8a0\left(x\right)\le k& \left(2\right)\end{array}$  (ii) When the maximum power of noise is bounded, i.e. W_{H}(x)≦ε which means noise has flipped at most ε bits of the original MSV, where ε is an integer and positive number:

$\begin{array}{cc}\hat{x}=\underset{x}{\mathrm{argmin}}\ue8a0\left({W}_{H}\ue8a0\left(x\right)\right)\ue89e\text{}\ue89es.t.\text{}\ue89e{W}_{H}\ue8a0\left(y\oplus I\right)\xb7x)\le \varepsilon & \left(3\right)\end{array}$  (iii) When there is neither a prior knowledge about the noise level nor on the number of the objects

$\begin{array}{cc}\hat{x}=\underset{x}{\mathrm{argmin}}\ue8a0\left(\alpha \ue89e\phantom{\rule{0.3em}{0.3ex}}\ue89e{W}_{H}\ue8a0\left(x\right)+{W}_{H}\ue8a0\left(y\oplus D\xb7x\right)\right)& \left(4\right)\end{array}$  In optimization (4) α is a regularization factor. In the formulas (2), (3) and (4) W_{H}(a) denotes the Hamming weight of a Boolean vector a, i.e. the number of the nonzero elements of a. All the three minimizations above indicate recovering a sparse occupancy vector x that results in a close approximation for a given MSV y.
 Minimizations (2), (3) and (4) are formulations for nonlinear inverse problems with sparsity constraint, i.e. the constraint of minimizing W_{H}(x), which is the number of nonzero elements of the occupancy vector x. In fact, the number of the objects 1 present in the scene is sparse, since, for example, it is limited to the capacity of the room or of the play area.
 To solve minimizations (2), (3) and (4) a greedy approach is preferably used that is called “Regularized Matching Pursuit (RMP)”. As discussed, a greedy algorithm is an approximation algorithm for optimization problems, which works iteratively. At each iteration, it locally optimizes the objective function of the problem, with the hope of finding the global optimum. Regularized Matching Pursuit (RMP) has three versions that are slightly different, in order to approximate respectively the solutions of the so described three optimizations (2), (3) and (4). All three versions of RMP perform in polynomialtime and in a localization application they approximate the solution in realtime.
 In all the following RMP versions



 the symbol ⊕ denotes bitwise XOR operator
 The following version of RMP approximates the optimization (2), i.e. the case wherein there is a priori knowledge or an upperbound on the number of the objects/people in the scene.
 The MSV vector y
 An upperbound on the number of the objects/people, i.e. W_{H}(x)≦k
 The regularization factor λ
 The dictionary D
 The support set {circumflex over (Λ)}, hence the occupancy vector x

 for (i=1:N) do

 if (W_{H}(d_{i} y)>0)
 UU∪{d_{i}}
 end if
 end for


 while (t≦k) do

$j\le \underset{{j}^{\prime}\in U}{\mathrm{argmax}}\ue89e\left\{\lambda \ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}+\left(1\lambda \right)\ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left(R\right)}\right\}$  Updates:





 end while
 The following version of RMP approximates the optimization (3), i.e. the case wherein there is a priori knowledge on the noise level, i.e. W_{H}(x)≦ε. In case of a probabilistic noise, ε is set to be the maximum noise level with very high probability.
 The MSV vector y
 An upperbound on the noise level, i.e. W_{H}(x)≦ε

 for (i=1:N) do

 if (W_{H}(d_{i} y)>0) AND (W_{H}(d_{i} (y))≦ε)
 U∪{d_{i}}
 end if
 end for
 if (W_{H}(d_{i} y)>0) AND (W_{H}(d_{i} (y))≦ε)


 while (e>ε) do

$j\Leftarrow \underset{{j}^{\prime}\in U}{\mathrm{argmax}}\ue89e\left\{\lambda \ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}+\left(1\lambda \right)\ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left(R\right)}\right\}$  Updates:





 Counter: tt+1
 end while
 The following version of RMP approximates the optimization (4) with a regularization parameter α, in the case wherein there is neither a prior information on the noise level, nor on the number of the objects/people on the scene.
 The MSV vector y
 The regularization parameter α
 The regularization factor λ
 The dictionary D
 The support set {circumflex over (Λ)}, hence the occupancy vector x

 for (i=1:N) do

 if (W_{H}(d_{i} y)>0)
 U∪{d_{i}}
 end if
 if (W_{H}(d_{i} y)>0)
 end for


 while (e_{p}−e≧α) do

$j\Leftarrow \underset{{j}^{\prime}\in U}{\mathrm{argmax}}\ue89e\left\{\lambda \ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}+\left(1\lambda \right)\ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left(R\right)}\right\}$  Updates:

 Recovered support: {circumflex over (Λ)}{circumflex over (Λ)}∪{j}
 Recovered MSV: ŷŷd_{1 }
 Remainder: RR(ŷ)
 Errors: e_{p} e
 eW_{H}(ŷ⊕y)
 Counter: tt+1
 end while
 All the three versions of RMP take MSV y, the Dictionary D and a regularization parameter λ as the inputs. The shape used to generate the atoms then does not affect the computation complexity since the dictionary D is computed “offline”, i.e. once before the detection process. Moreover, each version of RMP, depending on the corresponding optimization problem that it tends to solve, takes especial a priori side information. For example, the version related to the formula (2) takes a priori known upperbound k on the number of the objects/people in the scene. Whereas the version related to the formula (3) asks for ε an upperbound on the noise level, and finally the version related to the formula (4) takes an extra regularization parameter α to weight appropriately two terms of the optimization (4).
 All the three versions of RMP (based on their respective inputs) estimate the support set {circumflex over (Λ)}⊂{1, 2, . . . , N}, which determines the positions of the nonzero elements (positions of ones in Boolean case) of the recovered occupancy vector {circumflex over (x)}. Since occupancy vector is Boolean, it can be perfectly characterized by its support set.
 All the three versions of RMP contain a “Preprocessing” step that reduces the search space of the “Main Greedy Process” to a set U⊂{1, 2, . . . , N}. Hence, the major complexity of the algorithm that belongs to the iterative part (Main Greedy Process) scales with the cardinality of U, rather than N. The Preprocessing step for the first (formula (2)) and the third (formula (4)) version of the RMP is the same, and it means reducing the search space U to the atoms of the dictionary whose support set (positions of the nonzero bits) have at least one element in common with the support of MSV y. For the second version (formula (3)) of the RMP, since RMP knows the upperbound on the noise corruptions, the search space U contains atoms of D whose support set have at least one element in common with the support of MSV y, but no more than ε elements out of the support of MSV.
 All three versions of RMP contain an iterative step “Main Greedy Process”. At each iteration, RMP selects an atom d_{i }of the dictionary based on a selection criterion and adds its corresponding index i to the recovered support set {circumflex over (Λ)}. The algorithm repeats the iterations until meet stopping or ending criteria. At each iteration, the algorithm updates many parameter including the recovered support set {circumflex over (Λ)}.
 The selection criteria in all three cases are set to be: the atom d_{j′} which is the maximizer of the following statistics:

$\begin{array}{cc}{f}_{{j}^{\prime}}=\lambda \ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}+\left(1\lambda \right)\ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left(R\right)}\ue89e\phantom{\rule{1.7em}{1.7ex}}\ue89e\forall {j}^{\prime}\in U& \left(5\right)\end{array}$  The complexity of the maximization at each iteration reduces from N to the cardinality of U, thanks to the Preprocessing step.
 Regarding people detection on the ground, the selection criteria in all three cases is set to select an atom maximizing the formula (5) and that has a minimum distance with previous selected atoms, typically 6070 cm.
 Each statistic f_{j′} of formula (5) contains two normalized terms:

$a.\phantom{\rule{0.8em}{0.8ex}}\ue89e\mathrm{Cover}\ue89e\text{:}\ue89e\phantom{\rule{0.8em}{0.8ex}}\ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left(R\right)}$ $b.\phantom{\rule{0.8em}{0.8ex}}\ue89e\mathrm{Fitness}:\phantom{\rule{0.8em}{0.8ex}}\ue89e\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge R\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}.$  The cover measures how much an atom intersects with the remainder and the fitness measures how much an atom fits to the remainder. The remainder is a Boolean vector initially equivalent to the MSV y. The regularization parameter λ is used to weight appropriately between these two terms.
 In one embodiment of RMP, the selection criteria in all three realizations change to be the maximizer of a statistic wherein the cover is:

$\begin{array}{cc}\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge y\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}\ue89e\phantom{\rule{1.7em}{1.7ex}}\ue89e\forall {j}^{\prime}\in U& \left(6\right)\end{array}$  The use of the MSV y instead of the remainder R on the numerator of the cover relaxes the formulation such that it allows detecting highly overlapping objects.
 In another embodiment of RMP, the selection criteria in all three realizations change to be the maximizer of a statistic wherein the cover is

$\begin{array}{cc}\frac{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\bigwedge \left(R\bigvee \hat{y}\right)\right)}{{W}_{H}\ue8a0\left({d}_{{j}^{\prime}}\right)}\ue89e\phantom{\rule{1.7em}{1.7ex}}\ue89e\forall {j}^{\prime}\in U& \left(7\right)\end{array}$  In all three version of RMP, at each iteration, after selecting the atom's index, the following parameters update:

 a. The recovered support set {circumflex over (Λ)} which initially set to be empty.
 b. The recovered MSV ŷ, which is a Boolean vector constructed by bitwise OR among the so far recovered atoms of D, at a certain iteration. This vector initially set to be zero.
 c. The remainder R, which is a Boolean vector initially equivalent to the MSV y. At each iteration, this vector updates by taking out the contribution of the so far recovered MSV ŷ, from the original one y. After each iteration the energy of the remainder, i.e. the Hamming weight of the vector R, reduces.
 d. The error level e, that is a scalar set to be the Hamming weight (energy) of the MSV y. This value updates at each step by counting the number of bits in the recovered MSV ŷ that are mismatching with the original MSV y. At the beginning of the iterations, the error decreases, however by recovering too many atoms, at a certain point, this value starts to increase.
 e. A scalar value t counts the number of the objects/people so far has been recovered.
 The stopping or ending criteria for the iterative process are closely related to the extra sideinformation and hence it is different for each of the three versions of RMP. It appears in the criterion in the while loop of the Main Greedy Process and for each version is as follows:

 a. For the first RMP version that optimizes (2), the Main Greedy Process runs only for k iterations which is equivalent to selecting k atoms of D, i.e. estimating only k object/persons.
 b. For the second RMP version that optimizes (3), the iterations continue to add more atom indices to {circumflex over (Λ)}, as long as the error level e is higher than the noise level ε. As soon as e falls below the noise level, the iterations stop. This avoids recovering too many more atoms that actually would start representing the noisy parts rather than the true objects' silhouettes.
 c. For the third RMP version that optimizes (4), the iterations continue to add more atom indices to {circumflex over (Λ)}, as long as the error level e is decreasing fast enough. More precisely, the error between two consecutive steps must decrease with a difference of more than α. The rational behind, as mentioned before, is that the error value at the beginning of the iterations starts to decrease very fast. However, after recovering the true objects, at a certain point, the algorithm starts to recover the noisy part of the MSV that results in increasing the error level. It is assumed the noise is not adversarial, and it cannot be modeled (fitted) by atoms that are designed to estimate the true objects′/peoples' silhouettes. That explains why an attempt to model/recover the noisy parts leads to an increase in the error level which force the iterations stopping. The parameter α that weights two terms of (4) appears to be the required limit for the error's decreasing speed, in order to continuing the iterations.
 The algorithm allows to retrieve the occupancy vector in realtime. Moreover, RMP is robust to the occlusions, since is able to detect/localize objects or people even if they occlude each other, as
FIGS. 1 and 2 demonstrates.  In one embodiment the algorithm can be used for tracking objects. In this case, the occupancy vector x retrieved for a frame can be used in the next frame as another observation vector in conjunction with the current frame's MSV y. In this case, the interframes correlations are advantageously exploited to improve tracking.
 In another embodiment the algorithm can take into account radio signals emitted from the objects or people in order to retrieve efficiently their positions.
 In another embodiment the output of the algorithm according to the invention is not a vector comprising points on the 2D ground, but 3D volumes. The algorithm then can determine 3dimensional positions. In such a case, the system is able to determine for example the position of a basket player just as he is jumping. In this embodiment many cameras 2 are needed. Moreover, the calibration step has to be 3dimensional.
 If objects to be detected are people, the shape of atoms that gives a high precision for retrieving the positions is half rectangular and half ellipsoid. Other shapes, i.e. a rectangular shape or an ellipsoid shape, can be used, whereas the optimal shape also depends on the inclination of cameras. Between two occupied ground points, the minimal spatial distance is representative of the average width of a standing person. A typical value is between 60 cm and 70 cm. The sensitivity of RMP to the model that generates the human silhouettes, in the case of a half rectangular and half ellipsoid shape, is 30%. In other word RMP can use the same dictionary design in order to robustly detect/localize people in the range of 1.20 m till 2.20 m if atoms of the dictionary D are designed so that they approximate silhouettes of people with height 1.70 m. Hence, the proposed approximate shape for atoms can handle 30% mismatch, i.e. the tolerance is equal to 30%.
 If many cameras 2 are used, the dimensions of y and x, and consequently of D, increase. In this case some possibilities exist in order to further reduce the complexity cost, i.e. the total computational time and the memory storage.
 In one embodiment the dimensionality of the observations can be reduced. The dimension of the observation vector y is by default equal to the sum of each camera resolution. To reduce the computation cost, all images are down scaled.
 The complexity cost depends on N the number of ground plane points to locate as occupied or not. In another embodiment the resolution of the cameras and the sparsity of the people present in the scene are considered for discretizing the ground. According to one aspect of the invention a nonregularly spaced sampling process is used for discretizing the ground.
 Known solutions are known for discretizing the visible part of the ground into a fixed number of points regularly spaced as shown in
FIG. 6 a.FIG. 6 b shows the nonregularly spaced sampling process. Points regularly spaced in the image plane of all cameras are mapped to the ground to form points called the “sample points”. The mapped location points are quantized to avoid points spaced with less than few centimeters. Although the grid proposed inFIG. 6 b has less number of points, regions of interest have higher density of points, i.e. higher spatial resolution compared to the known solution ofFIG. 6 a.  In another embodiment a further reduction in the search space can be achieved by measuring the activity of a sample point according to three possible assumptions:
 1. Assumption 1 (Foreground pixels only): Sample points are ground plane points belonging to the foreground pixels of at least one camera.
 2. Assumption 2 (Intersecting foreground pixels): Sample points are ground plane points belonging to the foreground pixels of all the cameras observing the corresponding points.
 3. Assumption 3 (Least significant silhouette): Sample points are ground plane points corresponding to a significant foreground silhouette in all the cameras observing the corresponding points.
 The assumption 3 is contained in the assumption 2 that is contained in the assumption 1. In other words, the assumption 3 is more restrictive than assumption 2 and the assumption 2 is more restrictive than assumption 1. When an assumption reduces the search space, it may have the counter part of potentially removing correct locations.

FIG. 7 a shows an example of five views of cameras with overlapping fieldofviews.FIG. 7 b illustrates an example of foreground silhouettes (made of shadows, people's reflection, missed regions) andFIG. 7 c shows the silhouettes used to model their presence in the set of planar and omnidirectional cameras.
Claims (49)
1. A method for automatic localization of objects in a mask, comprising the steps of:
building a dictionary of atoms, wherein each atom models the presence or one object at one location;
iteratively determining the atom of said dictionary which is most correlated with said mask, until ending criteria are met.
2. The method of claim 1 , wherein said mask is computed by a foregroundbackground extraction process from an image acquired by at least one camera,
and wherein each step comprises the determination of the atom which is most correlated with said mask, and the adaptation or a remainder by taking out said atom.
3. The method of claim 2 wherein said mask is a binary mask.
4. The method of claim 1 , wherein said dictionary comprises a list of atoms at each of a plurality of uneven spaced positions.
5. The method of claim 1 , wherein each atom models the different images taken by a plurality of different cameras of one object at one location.
6. The method of claim 1 , wherein said method recovers an occupancy vector according to the formula
with the constraint of minimizing the number of elements of said occupancy vector different to zero and the constraint W_{H}(x)≦k, where x is said occupancy vector, y is a multisilhouette vector, D is the dictionary of atoms, ⊕ is a bitwise XOR operator, W_{H}(·) is the Hamming weight of Boolean vector and k is an integer and positive number.
7. The method or claim 1 , wherein said method recovers an occupancy vector according to the formula
with the constraint or minimizing the number of elements of said occupancy vector different to zero and the constraint W_{H}(y⊕D·x)≦ε, where x is said occupancy vector, y is a multisilhouette vector, D is the dictionary of atoms, ⊕ is a bitwise XOR Operator, W_{H}(·) is the Hamming weight of a Boolean vector and ε is an integer and positive number.
8. The method of claim 1 , wherein said method recovers an occupancy vector according to the formula
{circumflex over (x)}=argmin(αW _{H}(x)+W _{H}(y⊕D·x))
{circumflex over (x)}=argmin(αW _{H}(x)+W _{H}(y⊕D·x))
with the constraint or minimizing the number or elements or said occupancy vector different to zero, where x is said occupancy vector, y is a multisilhouette vector, D is the dictionary or atoms, ⊕ is a bitwise XOR operator, W_{H}(·) is the Hamming weight of a Boolean vector and α is a regularization parameter.
9. The method of claim 1 comprising
a) selecting the most correlated atom or said dictionary or atoms with said multisilhouette vector for each possible location
c) updating a remainder of said multisilhouette vector taking out the contribution of said most correlated atom
e) repeating steps a) to c) until meeting ending criteria.
10. The method or claim 9 wherein said ending criteria depend on an a priori knowledge of the number or objects to be detected.
11. The method or claim 9 wherein said ending criteria depend on an upper bound error level or the energy level or said remainder.
12. The method of claim 9 wherein said ending criteria depend on the decreasing of an error level e.
13. The method of claim 9 wherein said most correlated atom corresponds to a maximal statistic.
14. The method of claim 13 wherein said maximal statistic is the sum of two parameters depending on said atoms, said remainder, said multisilhouette vector and a second regularization factor.
15. The method of claim 14 wherein the first parameter of said two parameters is the cover defined by the formula
16. The method of claim 14 wherein the first parameter of said two parameters is the cover defined b the formula
17. The method of claim 14 wherein the first parameter of said two parameters is the cover defined by the formula
18. The method of claim 14 wherein the second parameter of said two parameters is the fitness defined by the formula
19. The method of claim 1 , comprising a preprocessing step and a main greedy process step.
20. The method of claim 19 , wherein said preprocessing step reduces the search space of said main greedy process step.
21. The method of claim 1 wherein said atoms in said dictionary depend on the shape of said objects.
22. The method of claim 1 wherein said dictionary depends on the position, the zoom, the focus and the resolution of said at least one camera.
23. The method of claim 1 wherein said location is defined by an adaptive discretization or a ground.
24. The method of claim 23 wherein said adaptive discretization comprises the mapping of points regularly spaced in an image plane of said at least one camera to samples points of said ground and a quantization of said samples points on said ground.
25. The method of claim 1 wherein said objects are people in a crowded environment.
26. The method of claim 25 , wherein the method select an atom that has a minimum distance with previous selected atoms.
27. The method of claim 26 , wherein said minimum distance is comprised between 60 cm and 70 cm.
28. The method of claim 25 wherein a halfcylinderhalf spherical shape is used to approximate the silhouette of a person in a view of said at least one camera.
29. The method of claim 1 wherein there is at least two cameras.
30. The method of claim 24 wherein said dictionary is a matrix wherein the number of rows corresponds to the resolution of said at least one camera and the number of columns corresponds to the number of said samples points multiplied by the number of possible shapes of said atoms.
31. The method of claim 24 wherein said adaptive discretization is a function of the topology of said cameras and of the activity of the scene.
32. The method of claim 24 comprising the measuring the activity of said samples points according to said assumption:
Sample points are ground plane points belonging to the foreground pixels of at least one camera.
33. The method of claim 24 comprising the measuring the activity of said samples points according to said assumption
Sample points are ground plane points belonging to the foreground pixels of all the cameras observing the corresponding points.
34. The method of claim 24 comprising the measuring the activity of said samples points according to said assumption
Sample points are ground plane points corresponding to a significant foreground silhouette in all the cameras observing the corresponding points.
35. The method of claim 1 comprising
a. acquiring a multisilhouette vector y
b. defining an upperhound on the number of the objects W_{H}(x)≦k
c. defining a regularization factor λ
d. defining a dictionary D
e. creating an output support set {circumflex over (Λ)}
f. Initializing
g. performing a preprocessing step for reducing the search space to a set U⊂{1, 2, . . . , N}
h. computing the sequence of statistics according to the formula
wherein d_{j′ }are said atoms, R is said remainder, is bitwise AND operator, W_{H}(·) is the Hamming weight of a Boolean vector and λ is a regularization factor,
i. repeating the step h, for each point between the number of points on the search space
l. finding the argmax of said sequence of statistics
m. updating said output support set according to the formula {circumflex over (Λ)}{circumflex over (Λ)}∪{j}
n. updating said recovered multisilhouette vector according to the formula ŷŷd_{1}, where d is the atom corresponding to said argmax of said sequence of statistics
p. updating an error according to the formula eW_{H}(ŷ⊕y), wherein ⊕ is the bitwise XOR operation between vectors
r. repeating steps h. to q until t≦k.
36. The method of claim 1 comprising
a. acquiring a multisilhouette vector y
b. defining an upperbound on the noise level W_{H}(x)≦ε
c. defining a regularization factor λ
d. defining a dictionary D
e. creating an output support set {circumflex over (Λ)}
f. initializing
g. performing a preprocessing step for reducing the search space to a set U⊂{1, 2, . . . , N}
h. computing the sequence of statistics according to the formula
wherein d_{1 }are said atoms, R is said remainder, is the bitwise AND operator, W_{H}(·) is the Hamming weight of a Boolean vector and λ is a regularization factor,
i. repeating the step h. for each point between the number of points on the search space
l. finding the argmax of said sequence of statistics
m. updating said output support set according to the formula {circumflex over (Λ)}{circumflex over (Λ)}∪{j}
n. updating said recovered multisilhouette vector according to the formula ŷŷd_{1}, where d is the atom corresponding to said argmax of said sequence of statistics
p. updating an error according to the formula eW_{H}(ŷ⊕y), wherein ⊕ is the bitwise XOR operator between Boolean vectors
r. repeating steps h. to q. until e>ε.
37. The method of claim 1 comprising
a. acquiring a multisilhouette vector y
b. defining a regularization parameter α
c. defining ti regularization factor λ
d. defining a dictionary D
e. creating an output support set {circumflex over (Λ)}
f. initializing
g. performing a preprocessing step for reducing the search space to a set U⊂{1, 2, . . . , N}
h. computing the sequence of statistics according to the formula
wherein d_{j′ }are said atoms, R is said remainder, is the bitwise AND operator, W_{H}(·) is the Hamming weight of a Boolean vector and λ is the regularization factor,
i. repeating the step h. for each point between the number of points on the search space
l. Finding the argmax of said sequence of statistics
m. updating said output support set according to the formula {circumflex over (Λ)}{circumflex over (Λ)}∪{j}
n. updating said recovered multisilhouette vector according to the formula ŷŷŷd, where d is the atom corresponding to said argmax of said sequence of statistics
p. updating an error according to the formula eW_{H}(ŷ⊕ŷ), wherein ⊕ is the bitwise XOR operator between Boolean vectors
r. repeating steps h. to q. until e_{p}−e>α.
38. A method for automatic localization of objects in an image taken by at least one fixed camera acquiring a multisilhouette vector wherein said method recovers an occupancy vector by using said multisilhouette vector and a dictionary of atoms, each atom modeling the presence of a single object at a given location of said image.
39. The method of claim 38 , comprising a plurality of iterative steps, wherein at each of said iterative step the atom of said dictionary that best matches said image is determined.
40. The method of claim 38 , which takes into account a sparsity constraint, i.e. the constraint or minimizing the number of nonzero elements of said occupancy vector x.
41. A nontransitory computer readable medium storing a program causing a computer to execute instructions executable to compute the method of claim 1 .
42. A system for automatically detecting objects in a mask, comprising
at least one fixed camera for acquiring video frames;
computation means for calibrating said at least one fixed camera;
computation means for extracting foreground silhouettes in each acquired video frames;
computation means for discretizing said ground plane into a nonregular grid of potential location points;
computation means for constructing a dictionary of atoms, each atom modeling the presence or a single object at a given location of said ground plane;
computation means for finding, objects location points with the method of claim 1 ;
means for propagating the result in said at least one fixed camera view.
43. The system of claim 42 , wherein said mask is computed by a foregroundbackground extraction process from an image acquired by at least one camera.
44. The system of claim 42 comprising at least two cameras.
45. The system of claim 42 wherein said at least one camera is a planar and/or omnidirectional camera.
46. The system of claim 42 wherein said at least one camera is an IR camera.
47. The system of claim 44 wherein said cameras have overlapping fieldofviews.
48. The system of claim 42 wherein said objects are people in a crowded environment.
49. The system or claim 42 wherein all said computation means belong to an image processing system.
Priority Applications (2)
Application Number  Priority Date  Filing Date  Title 

US12/779,547 US8749630B2 (en)  20100513  20100513  Method and system for automatic objects localization 
US14/267,598 US20140254875A1 (en)  20100513  20140501  Method and system for automatic objects localization 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

US14/267,598 US20140254875A1 (en)  20100513  20140501  Method and system for automatic objects localization 
Related Parent Applications (1)
Application Number  Title  Priority Date  Filing Date  

US12/779,547 Continuation US8749630B2 (en)  20100513  20100513  Method and system for automatic objects localization 
Publications (1)
Publication Number  Publication Date 

US20140254875A1 true US20140254875A1 (en)  20140911 
Family
ID=44508447
Family Applications (2)
Application Number  Title  Priority Date  Filing Date 

US12/779,547 Expired  Fee Related US8749630B2 (en)  20100513  20100513  Method and system for automatic objects localization 
US14/267,598 Abandoned US20140254875A1 (en)  20100513  20140501  Method and system for automatic objects localization 
Family Applications Before (1)
Application Number  Title  Priority Date  Filing Date 

US12/779,547 Expired  Fee Related US8749630B2 (en)  20100513  20100513  Method and system for automatic objects localization 
Country Status (2)
Country  Link 

US (2)  US8749630B2 (en) 
EP (1)  EP2386981A3 (en) 
Cited By (1)
Publication number  Priority date  Publication date  Assignee  Title 

US9165208B1 (en) *  20130313  20151020  Hrl Laboratories, Llc  Robust groundplane homography estimation using adaptive feature selection 
Families Citing this family (13)
Publication number  Priority date  Publication date  Assignee  Title 

FR2981771B1 (en) *  20111021  20131108  Commissariat Energie Atomique  Method for locating objects by resolution in the three dimensional space of the scene 
JP2013191135A (en) *  20120315  20130926  Fujitsu Ltd  Authentication system, processing device, and program 
JP6032921B2 (en) *  20120330  20161130  キヤノン株式会社  Object detection apparatus and method, and program 
US9260122B2 (en) *  20120606  20160216  International Business Machines Corporation  Multisensor evidence integration and optimization in object inspection 
JP5919538B2 (en) *  20120615  20160518  パナソニックＩｐマネジメント株式会社  Object detection apparatus and object detection method 
JP5692215B2 (en) *  20120615  20150401  カシオ計算機株式会社  Imaging apparatus, imaging method, and program 
US9244159B1 (en) *  20130131  20160126  The Boeing Company  Distinguishing between maritime targets and clutter in rangedoppler maps 
KR20150037091A (en) *  20130930  20150408  삼성전자주식회사  Image processing apparatus and control method thereof 
US9367922B2 (en) *  20140306  20160614  Nec Corporation  High accuracy monocular moving object localization 
US9794525B2 (en) *  20140325  20171017  Ecole Polytechnique Federale De Lausanne (Epfl)  Systems and methods for tracking interacting objects 
CN106462951B (en) *  20140610  20190705  特拉维夫大学拉莫特有限公司  For handling the method and system of image 
US9361524B2 (en)  20141020  20160607  King Abdullah University Of Science & Technology  System and method for crowd counting and tracking 
US10049462B2 (en)  20160323  20180814  Akcelita, LLC  System and method for tracking and annotating multiple objects in a 3D model 
Citations (1)
Publication number  Priority date  Publication date  Assignee  Title 

US20050265582A1 (en) *  20021112  20051201  Buehler Christopher J  Method and system for tracking and behavioral monitoring of multiple objects moving through multiple fieldsofview 
Family Cites Families (9)
Publication number  Priority date  Publication date  Assignee  Title 

US7688349B2 (en)  20011207  20100330  International Business Machines Corporation  Method of detecting and tracking groups of people 
US8131011B2 (en) *  20060925  20120306  University Of Southern California  Human detection and tracking system 
US8116564B2 (en)  20061122  20120214  Regents Of The University Of Minnesota  Crowd counting and monitoring 
KR100847143B1 (en) *  20061207  20080718  한국전자통신연구원  System and Method for analyzing of human motion based silhouettes of realtime video stream 
US20090002489A1 (en) *  20070629  20090101  Fuji Xerox Co., Ltd.  Efficient tracking multiple objects through occlusion 
US8358806B2 (en) *  20070802  20130122  Siemens Corporation  Fast crowd segmentation using shape indexing 
US8374388B2 (en) *  20071228  20130212  Rustam Stolkin  Realtime tracking of nonrigid objects in image sequences for which the background may be changing 
US8363926B2 (en) *  20080206  20130129  University Of Central Florida Research Foundation, Inc.  Systems and methods for modeling threedimensional objects from twodimensional images 
US20090296989A1 (en)  20080603  20091203  Siemens Corporate Research, Inc.  Method for Automatic Detection and Tracking of Multiple Objects 

2010
 20100513 US US12/779,547 patent/US8749630B2/en not_active Expired  Fee Related

2011
 20110513 EP EP11166041A patent/EP2386981A3/en not_active Withdrawn

2014
 20140501 US US14/267,598 patent/US20140254875A1/en not_active Abandoned
Patent Citations (1)
Publication number  Priority date  Publication date  Assignee  Title 

US20050265582A1 (en) *  20021112  20051201  Buehler Christopher J  Method and system for tracking and behavioral monitoring of multiple objects moving through multiple fieldsofview 
NonPatent Citations (1)
Title 

Haritaoglu et al.; "Ghost: A Human Body Part Labeling System Using Silhouettes"; IEEE; 20 August 1998; Fourteenth International Conference on Pattern Recognition, 1998. * 
Cited By (1)
Publication number  Priority date  Publication date  Assignee  Title 

US9165208B1 (en) *  20130313  20151020  Hrl Laboratories, Llc  Robust groundplane homography estimation using adaptive feature selection 
Also Published As
Publication number  Publication date 

US20110279685A1 (en)  20111117 
EP2386981A3 (en)  20120328 
US8749630B2 (en)  20140610 
EP2386981A2 (en)  20111116 
Similar Documents
Publication  Publication Date  Title 

Li et al.  Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans  
Zhou et al.  Object tracking using SIFT features and mean shift  
Tso et al.  Classification of multisource remote sensing imagery using a genetic algorithm and Markov random fields  
Kong et al.  A viewpoint invariant approach for crowd counting  
Zhang et al.  Lowrank sparse learning for robust visual tracking  
Yang et al.  Fast multiple object tracking via a hierarchical particle filter  
Zhou et al.  Visual tracking and recognition using appearanceadaptive models in particle filters  
US7529388B2 (en)  Methods for automatically tracking moving entities entering and exiting a specified region  
Wei et al.  Convolutional pose machines  
Zhang et al.  Robust visual tracking via consistent lowrank sparse learning  
US6771818B1 (en)  System and process for identifying and locating people or objects in a scene by selectively clustering threedimensional regions  
Tordoff et al.  Guided sampling and consensus for motion estimation  
US9665777B2 (en)  System and method for object and event identification using multiple cameras  
Jeyakar et al.  Robust object tracking with backgroundweighted local kernels  
US7929730B2 (en)  Method and system for object detection and tracking  
Kratz et al.  Tracking pedestrians using local spatiotemporal motion patterns in extremely crowded scenes  
US20040017930A1 (en)  System and method for detecting and tracking a plurality of faces in real time by integrating visual ques  
US7813528B2 (en)  Method for detecting objects leftbehind in a scene  
US7835542B2 (en)  Object tracking systems and methods utilizing compresseddomain motionbased segmentation  
Jourabloo et al.  Poseinvariant 3D face alignment  
Lanz  Approximate bayesian multibody tracking  
Hoseinnezhad et al.  Visual tracking in background subtracted image sequences via multiBernoulli filtering  
US6542621B1 (en)  Method of dealing with occlusion when tracking multiple objects and people in video sequences  
Lin et al.  Hierarchical parttemplate matching for human detection and segmentation  
US6937744B1 (en)  System and process for bootstrap initialization of nonparametric color models 
Legal Events
Date  Code  Title  Description 

AS  Assignment 
Owner name: ECOLE POLYTECHNIQUE FEDERALE DE LAUSANNE (EPFL), S Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ALAHI, ALEXANDRE;GOLBABAEE, MOHAMMAD;VANDERGHEYNST, PIERRE;REEL/FRAME:032809/0736 Effective date: 20100517 

STCB  Information on status: application discontinuation 
Free format text: ABANDONED  FAILURE TO RESPOND TO AN OFFICE ACTION 