EP1854083A1

EP1854083A1 - Object tracking camera

Info

Publication number: EP1854083A1
Application number: EP06707263A
Authority: EP
Inventors: Sven Fleck
Original assignee: Eberhard Karls Universitaet Tuebingen
Current assignee: Eberhard Karls Universitaet Tuebingen
Priority date: 2005-02-24
Filing date: 2006-02-24
Publication date: 2007-11-14
Anticipated expiration: 2026-02-24
Also published as: ATE497230T1; DE102005009626A1; EP1854083B1; DE502006008806D1; WO2006089776A1

Abstract

The camera has an image sensor unit for producing image data, and a processing unit for processing the image data in the camera. The processing unit exhibits a region of interest (ROI)-sampling unit (20) for selecting image areas for tracking of object, and a tracking unit (21) with a particle filter (24). The units (20, 21) are provided to determine tracking data of objects to be tracked on the basis of the image data. Independent claims are also included for the following: (1) a method for processing of image data in a camera (2) a multi-camera system with two cameras.

Description

Camera for tracking objects

The invention relates to a camera for tracking objects with an image sensor unit for generating image data and to a processing unit for processing the image data transferred from the image sensor unit to the processing unit. The invention also relates to a multi-camera system having at least two cameras and to a method for processing image data in a camera for tracking objects.

Tracking applications based on a network of distributed cameras are becoming increasingly popular in the field of security technology for monitoring airports, train stations, museums or public places, as well as in the field of industrial image processing in production lines and vision-guided robots. Traditional centralized approaches have many disadvantages here. Today's systems typically transmit the complete raw image stream of the camera sensor via expensive and distance-limited connections to a central computer and then have to be processed there all. The cameras are thus typically regarded only as simple sensors and the processing takes place only after elaborate transmission of the raw video stream. This concept quickly reaches its limits in multi-camera systems and cameras with high resolutions and / or frame rates.

The invention is thus based on the problem to provide an object tracking by cameras, which is able to work with multiple cameras and bandwidth-limited networks.

According to the invention, a camera for tracking objects, with an image sensor unit for generating image data and a processing unit for processing the image data transferred from the image sensor unit to the processing unit, is provided for this purpose An operation unit has an ROI selecting unit for selecting image areas of interest for object tracking and a tracking unit for detecting tracking data of objects to be tracked from the image data.

According to the invention, the processing of the image data thus already takes place in the camera, so that it is not necessary to transmit the complete, raw video stream in full resolution to an external processing unit. Instead, only the resulting tracking data is transmitted. In addition, by using the Region of Interest (ROI) selector, the image data to be processed is already severely limited in its amount, so that the processing of the data can be done in real time, which is of great importance in tracking applications. Since only the resulting data has to be transmitted by the camera, the use of standard network connections becomes possible in the first place. In addition, no external computer is required to calculate the tracking data, as this is already done inside the camera. An optionally existing central computer can then be used for higher-level tasks.

In a further development of the invention, the tracking data can be output at a signal output of the camera, the tracking data having a significantly reduced amount of data compared with the quantity of image data generated by the image sensor unit, in particular reduced by a factor of about 1000.

On the one hand, the selection of image areas which are of interest for object tracking and, on the other hand, the calculation of the tracking data within the camera contributes to this considerable reduction of the amount of data to be transmitted according to the invention. A camera image in VGA resolution requires about a third of the 100 Mbps standard Ethernet bandwidth, this being achieved without using the so-called Bayer mosaic, otherwise the triple bandwidth is needed. By contrast, according to the invention, a reduction to a few hundred kilobits per second is made possible, since only the results are transmitted. Since the raw video stream according to the invention is no longer limited by the bandwidth of the connection to the outside, sensors with very high spatial and lateral resolution can be used in the camera according to the invention. Two reasons are responsible for this: On the one hand, because of the proximity of the processing unit directly at the sensor, a higher transmission speed is technically much easier to implement than outside the camera, on the other hand, as already mentioned, by the ROl selector unit, the current camera image only on selected and For example, evaluated dynamically changing regions. This requires region of interest (ROI) enabled camera sensors, such as CMOS sensors.

In a further development of the invention, the tracking data are provided in the form of a particular approximated probability density function. Advantageously, the probability density function is approximated by a plurality of nodes.

By means of a particular approximated probability density function, the target data exclusively of interest for a tracking application, such as position and speed of an object to be tracked, are calculated and then output by the camera. The approximation of the probability density function by a plurality of support points whose position and number may be adaptively changed, a significant reduction of the computational effort to be performed is achieved. Nevertheless, it has been shown that a precision sufficient for tracking applications can be achieved. In a development of the invention, parallel processing means are provided in the processing unit for the parallel processing of the interpolation points of the probability density function and data dependent thereon.

In this way, a very fast processing of many support points can be done. For example, a hundred support points are provided for a hundred identical hardware circuits. As a result, the invention makes it possible to realize tracking tracking with high precision in real time.

In a development of the invention, the tracking unit implements a so-called particle filter, in which a probability density function (p (X _t | Z _t )) is approximated using an approximation step on the basis of reference points, a calibration step and a measurement step. X _t denotes the state at time t and Zt all measurements up to and including time t. In the approximation step, the probability density function is sampled and thus new interpolation points for approximating the state vector X _{t are} determined. In the prediction step, the new state vector X _{t of} an object to be tracked is determined by means of old measurements Zu and an old state vector X _M and taking into account a stored motion model, and in the measuring step the new state vector Xt is weighted taking into account a new measurement. In the approximation step, the approximation of the probability density function p (Xt | Z _t ) resulting from all new state vectors is newly approximated by interpolation points.

The use of a so-called particle filter in the tracking unit allows fast processing of even large amounts of image data and yet high precision object tracking is achieved. In a development of the invention, the tracking unit transmits tracking data of objects to be tracked, in particular a prediction comparison object, to the ROI selection unit in order to select the image areas of interest for the processing as a function of the tracking data.

By selecting the image areas of interest on the basis of tracking data, it can be ensured with high probability that only relevant image areas are evaluated. For example, it is possible to use the tracking data to calculate back to a comparison object of the object to be tracked, and it is then decided on the basis of this comparison object which image areas from the current camera image should be selected. In the case of an object to be tracked, which moves at a constant speed, the comparison object would thus correspond to the image in the last camera shot, only its position would be shifted in contrast.

In a development of the invention, the prediction comparison object is generated by means of a stored parametric model which is adaptively adaptable.

In the case of more complicated objects to be tracked, for example, changes in the object that are not contained in the movement model can be taken into account, for example rotations of a human head that lead from the same direction to completely different views of the head. It is essential that the adaptive adaptation is made only if one is sure that one has the object to be followed in front of him. For example, an adaptation of the stored parametric model may not take place if only slight truths are found over the entire camera image. Probability values are determined. If in this case the location with the highest probability of updating the motion mode would be used even though the object to be tracked is no longer in the image area, the motion model would be adapted so that a subsequent retrieval of the object to be tracked no longer possible. In the adaptive adaptation of the motion model, care must therefore be taken that probability values are not only relatively, but also absolutely evaluated, in order ultimately to detect whether the probability density function p (Xt | Zt) is unimodal.

In a development of the invention, the image data of the image area selected by the ROI selection unit is converted into a color histogram in the processing unit and the tracking unit determines the tracking data on the basis of the color histogram.

The use of a color histogram has advantages in terms of robustness of the processing algorithms in terms of rotations, partial occlusion and deformation. For example, the HSV color space (hue saturation value) is used, which offers advantages over red, green, blue. Alternatively, the RGB color space (red-green-blue) or the CMY color space (cyan-magenta-yellow) can be used.

In a development of the invention, the ROI selection unit controls the image sensor unit as a function of the tracking data in such a way that only those image data are transferred from the image sensor unit to the processing unit which correspond to the image areas selected by the ROI selection unit.

Thus, according to the invention, the bandwidth from the sensor to the processing hardware can be significantly reduced by transferring only the combination of image areas at all to the processing required for processing the tracking algorithm is necessary. This happens regardless of the physical resolution of the sensor. These regions of interest are generated dynamically from frame to frame and transmitted to the sensor. Of course, the sensor must allow such direct access to image areas, but this is the case with today's CMOS sensors.

In a development of the invention, the image sensor unit and the processing unit are integrated in a common housing.

This makes it possible to accommodate the image sensor unit and the processing unit spatially close to each other and also protect against environmental influences. Since the requirements for an external connection of the cameras in relation to the bandwidth are very low, there are only few restrictions on the positioning of the cameras. This is especially true when multiple cameras are in communication with each other via a wireless network.

In a development of the invention, the processing unit has a network unit.

The camera according to the invention can thereby be integrated into a network, for example a wireless network, without problems. That this is possible at all, is due to the very low bandwidth in the invention, which is required for a transmission of the results calculated in the camera to the outside.

In a development of the invention, a control unit and setting means are provided in order to change setting parameters of the camera, in particular alignment, image detail and magnification, as a function of the tracking data. Since the camera calculates the tracking data itself, a control unit in the camera can then also carry out the tracking of the camera. It is essential that no signal transmission to the outside is required for this purpose. The failure of a network to which the camera is connected, is thus not detectable from the outside. Even if there is no longer any connection from the camera to a central evaluation station, the tracking of the camera still maintains the impression of continuous monitoring, which, as soon as the connection is established again, can continue seamlessly.

The problem underlying the invention is also solved by a method for processing image data in a camera for tracking objects, in which the following steps are provided:

Transferring image data from an image sensor unit to a processing unit of the camera,

Generating tracking data on objects to be tracked in the processing unit using probabilistic methods and

Selecting regions of the image data in dependence on the tracking data, so that only image data are selected in which there is an increased probability that they contain information about objects to be tracked.

With the method according to the invention, it is possible to transmit only the result data of an object tracking from the camera to the outside, so that already by the externally required transmission bandwidth is substantially reduced. In addition, only those image data are selected for the processing which are more likely to contain information about objects to be tracked, for example by means of a feedback of the traceable data. cking data to a selection unit. This creates the opportunity to realize an object tracking by means of cameras even with high spatial and temporal resolution in real time.

In a further development of the invention, the step of selecting regions of the image data includes driving the image sensor unit in such a way that only image data are transferred from the image sensor unit to the processing unit, where there is an increased probability that they contain information about objects to be tracked.

Thereby, the amount of image data to be transmitted by the image sensor unit can be significantly reduced.

In a development of the invention, the step of generating tracking data comprises approximating a probability density function by means of a plurality of interpolation points.

In this way, the computational effort to generate the tracking data can be significantly reduced. In addition, circuits for processing the individual support points in hardware or software can be executed in parallel, so that a very fast generation of the tracking data is possible.

In a development of the invention, the step of generating tracking data includes the generation of image data of a comparison object based on a probability density function of the objects to be tracked and at least one stored parametric model of the objects to be tracked.

In this way, the calculated tracking results can be converted back into image data and these image data of a comparison object. jects can then be compared with the current camera image to assess the quality of the tracking results and adjust them if necessary. In addition, the image data of the comparison object can be used to select only those image data by means of the selection unit, which essentially correspond to the image detail of the comparison object.

The problem underlying the invention is also solved by a multi-camera system having at least two cameras according to the invention, in which each camera has a network unit and the at least two cameras are connected to one another via a network, in particular Ethernet or WLAN.

Since the cameras according to the invention require only a small bandwidth for transmitting the tracking results to the outside, multi-camera systems with the cameras according to the invention can be realized on the basis of standard network applications. This is also possible with wireless network connections, for example. The communication over the network can of course be bidirectional. The cameras can not only output the result data, but also receive information about objects to be tracked or control signals for setting and aligning the camera optics via the network.

In a development of the invention, the processing unit of at least one of the cameras is designed to process tracking data of another camera.

In this way, an object to be tracked can for example be transferred from one camera to the next. In a development of the invention, a central processing unit is provided in the network for evaluating the tracking data transmitted by the at least two cameras.

With a central processing unit then further, the tracking data using evaluations can be made. For example, typical motion sequences can be used for object recognition or to recognize emergency situations.

Further features and advantages of the invention will become apparent from the claims in conjunction with the following description of preferred embodiments of the invention taken in conjunction with the drawings. In the drawings show:

1 is a schematic representation of a camera according to the invention for object tracking,

FIG. 2 shows a schematic representation of a multi-camera system according to the invention, FIG.

3 shows a block diagram of a preferred embodiment of the camera according to the invention,

4 shows a schematic representation of a multi-camera system according to the invention in an application for beach monitoring,

5 is a schematic representation of another embodiment of a camera according to the invention,

6 shows a schematic representation of a multi-camera system according to the invention, 7 is a schematic representation to illustrate the method according to the invention,

8 shows a representation of different time scales for use in the method according to the invention,

9 shows several representations for the contour-based determination of a region-of-interest in the method according to the invention and

10 shows representations of a probability density function of a tracked object according to the method according to the invention.

The illustration of FIG. 1 shows a camera according to the invention for object tracking 10, which has an image sensor unit 12 and a processing unit 14 in a common housing. The image sensor unit 12 is designed, for example, as a CMOS sensor and supplies image data to the processing unit 14. In the processing unit 14, tracking data are generated which characterize an object to be tracked, at least in terms of position and speed and also, for example, in terms of shape, color and the like. For this purpose, the processing unit 14 has a so-called tracking unit in which the tracking data are generated. Furthermore, the processing unit 14 has a region of interest (ROI) selection unit, with which the image sensor unit 12 can be controlled in such a way that only the image areas that are of interest for the object tracking are transferred to the processing unit 14. These are, for example, dynamically changing image areas, with the ROI selection unit also selecting the image areas taking into account the tracking data. From the image sensor unit 12 to the processing unit 14, only those image areas are transmitted in which a large probability is that they can provide information about the object to be tracked.

The combination of a ROI selection method and the generation of the tracking data within the camera 10 itself enables the result output of the camera 10, symbolized by a double arrow 16, to require only a very small bandwidth and that this result transmission can take place over a standard network. In addition, the generation of the tracking data within the camera 10 can be done so fast that real-time applications can be realized. The structure of the camera 10 will be explained in more detail below.

2 shows a multi-camera system with several cameras 10a, 10b, 10c according to the invention. Each of the cameras 10a, 10b and 10c is constructed identically to the camera 10 of FIG. The cameras 10a, 10b, 10c are connected to each other via a network 18. By externally triggering or synchronizing via the connection of the cameras can be ensured that they work synchronously. A data exchange with the network 18 can be bidirectional, so that tracking data of an object to be tracked can be passed from the camera 10a to the camera 10b, for example, when the object to be tracked leaves the detection area of the camera 10a. In the same way, the tracking data can also be transferred from the camera 10a to the camera 10c, and depending on which detection area an object to be tracked changes, the camera recognizing the object to be tracked can then output further tracking results.

In the block diagram of Fig. 3, the structure of the camera 10 of Fig. 1 is shown in more detail. The image sensor unit 12 generates image data and supplies it to the processing unit 14, the processing unit 14 in FIG. 3 being indicated merely by means of a dashed outline. tet is. The image data from the image sensor unit 12 are first transferred to a ROl selection unit 20, which initially only looped through the image data or cached in a cache so that the double or multiple transmission of overlapping image areas is avoided. The task of the ROI selection unit 20 is to control the image sensor unit 12 so that only the image areas of interest for further processing are forwarded. How the ROI unit 20 determines these image areas of interest will be explained below. If the ROI unit 20 does not fulfill a buffering function, the image sensor unit 12 can also pass on the image data while bypassing the ROI unit 20.

Reference numeral 22 thus provides image data of image areas in which there is a high probability that they contain information about the objects to be tracked.

This image data is passed to a filter 24 which is optional and which then provides the filtered data at 26. The filter 24 can, for example, convert the image data from 22 into a color histogram in the HSV color space (Hue-Saturation Value). Alternatively, the filter 24 can also implement a color histogram in the RGB color space (red-green-blue). The implementation in color histograms has the advantage that the robustness of the subsequent evaluation is significantly increased, for example, against rotations and / or changes in shape of an object to be tracked.

The filtered image data 26 are then fed to a comparison unit 28, in which a comparison measurement is performed and the image data 26 corresponding to the object to be tracked are compared with similarly prepared data of a comparison object. The resulting weights of all nodes must then be normalized. The comparison unit 28 then gives an approximate Probability density function 30, which simultaneously represents the central output of the camera 10. The probability density function 30, which is efficiently approximated by means of several nodes, represents the result of the tracking unit and only requires a small bandwidth for transmission over a network. The approximated probability density function 30 may then be output via a network I / O unit 32 and supplied to further units that perform further processing based on this result.

For example, in a unit 34, a maximum likelihood condition, i. the state in which the probability density function is maximum is calculated. In the present approximation by support points, this means that the support point is used with the highest weight. Furthermore, an expected value can be calculated in the unit 34. The unit 34 may also output the result of its evaluation via the network I / O unit 32 to a network. A control unit 36 uses the probability density function 30 for control applications. For this purpose, the control unit 36 generates control signals for a so-called pan-tilt unit, on which the camera 10 is mounted. By means of this pan-tilt unit, the camera 10 can be tracked to an object to be tracked. Alternatively, the control signals of the control unit 36 may also be output to a robot controller or CNC machine controller.

Further units 38, which use the probability density function 30 for further processing, generate, for example, commands for passing persons / objects into a multi-camera system when a person traverses the field of view from one camera to the next. In this regard, it should be noted that the initialization of a target object basically by presenting in front of the camera and training is done. However, it is also possible, and for surveillance applications, to make the initialization of the target object triggers on the first object that is moving. A movement is interpreted as meaning if a difference to the previous camera image or to several preceding camera images is greater than a predefined threshold value. The units 34, 36 and 38 may output their respective results via the network I / O unit to a network or, if there is no network, to a signal line.

The probability density function 30 is also supplied to a so-called update unit 40, in which a time index of the probability density function being calculated is reduced by one in order to classify the probability density function just calculated no longer as the current value but as the most recent old value. The update unit 40 is thus the first station of a feedback loop within the tracking unit 21.

In this feedback loop, on the one hand, a prediction is made as to how the probability density function is likely to appear at the next time step, and based on this prediction, a comparison object is again generated which, as already described, is then compared in the comparison unit 28 with the currently detected object. In addition, in this feedback loop, a weighting of the individual nodes is made and based on this weighting, it is decided whether a redistribution of the support points for the next pass of the loop is required.

Thus, at 42, there is a probability density function that initially differs from the probability density function 30 only by its one-time reduced time index. At 42, however, the already described sampling of the approximated probabilities keitsdichtefunktion be made on the basis of the weighting of the individual support points.

This probability density function of FIG. 42 is linked for prediction to a motion model 44, which in the illustrated embodiment is also in the form of a probability density function. In the simplest case, i. when moving with constant velocity in one direction, linking the probability density function of 42 with the motion model of 44 would only cause a coordinate shift. The linking of the motion model of FIG. 44 with the probability density function of FIG. 42 takes place in a prediction unit 46. Within the prediction unit 46, a convolution of the motion model is performed with the probability density function, as set forth in the equation found below the unit 46.

In the approximation step between 42 and 46, a new interpolation point distribution is generated on the basis of the weighting of the interpolation points, with interpolation points of high weight receiving a number of successors corresponding to the weighting in the last iteration, but all of them are initially arranged at the same position. In the prediction at 46, the position of the new nodes is scattered after applying the motion model. The movement model is to be applied only once by a new support point, only then the position is scattered. Support points with low weighting receive no successor.

As a result of the prediction in unit 46, a new probability density function is output at 48, which correspondingly represents a predicted position based on the knowledge previously available. " _O

- 18 -

In order to be able to perform a comparison of this prediction at 48 with the image data acquired by the image sensor unit 12, the prediction of the probability density function from 48 in a rendering unit 50 is linked to a parametric model from 52. The rendering step in render unit 50 generates the image data of a comparison object. In the simplest case of an object moving linearly at a constant speed, the image data of the comparison object would thus correspond to the object displaced by a certain distance.

The parametric model from 52 can be adapted depending on external circumstances. This is of importance, for example, when objects with complex geometry are to be traced, whose shape may even change, whose projection changes as a function of a rotational position or with changing illumination. When adapting the parametric model in 52, however, it must be ensured that an adaptation is only carried out if it is very likely that it is also the object to be tracked, which has now changed its appearance. For example, the environment of a support point of the probability density function with the relatively highest weighting may not be used for adaptation at each step. If, in fact, the object to be tracked is no longer located in the viewed image section, an adaptation then carried out would result in the parametric model being changed in such a way that recognition of the object to be tracked is not possible. Remedy can, however, be created, for example, that the environment of a support point with the relatively highest weight is additionally tested for absolute weighting and above a defined weighting, so if it can be assumed with great certainty that it is the object to be tracked , the environment of this support point is used for adaptation. AIs model can serve an image region (ROI) of the target object. Alternatively, the model 52 can also be a so-called AAM implementation (Active Apperance Model), whereby this non-rigid and optionally textured model, in particular in the case of changes in shape, is advantageous. Also a three-dimensional AAM is possible. As already stated, the filter 24 can be completely eliminated. It is also possible to use a contour-based method as a model, where the state determines the shape of the contour, for example with splines.

As a result of the rendering step in FIG. 50, image data of a comparison object is thus available at 54. These image data of the comparison object at 54 will now be compared with the currently recorded image data at 22. In order to ensure comparability of the image data of the comparison object with the currently recorded image data, these image data from 54 are subjected to the same filtering as the image data from FIG. 22, so that a filter unit 56 identical to the filter unit 24 is provided correspondingly and then the filtered image data of FIG Comparative object present. As has already been described, a comparison of the image data of the object to be tracked currently recorded by the image sensor unit 12 and the image data of the comparison object is then compared with one another in comparison unit 28. According to the equation shown below the comparison unit 28, the comparison measurement corresponds to a weighting of the new state Xt according to the new measurement Zt. As already stated, the probability density function 30 results as a result of the comparison measurement in the comparison unit 28.

In the special case, when working with color histograms, it is sufficient to save the already filtered representation as a model, since Here the result of the filtering is always the same and not dependent on the state X _t . So can be used directly to 58 the model. Thus, it does not need to be calculated for each sample in each iteration through steps 52-50-54-56-58. The steps 52-50-54 serve only the ROI determination. In this way, the relatively expensive filtering step 56 can be saved. An adaptation of the model in Figure 58 is possible by blending the filtered representation of the current image data of the highest-weighted interpolation point in Figure 26 with the filtered representation of the model in Figure 58.

Moreover, the image data of the comparison object is also supplied to the ROI selection unit 20 at 54. The ROI unit 20 then controls the image sensor unit 12 to request only those regions of interest corresponding to the image regions of the image data of the comparison object of FIG. 54. As a result, the amount of data that must be output from the image sensor unit 12 is significantly reduced. In addition, the ROI selector 20 implements a caching method to save overlap of ROIs of the same iteration so that even overlapping areas of different image areas of interest need only be transferred once.

In the ROI unit 20, starting from the comparison _object induced by the state X _t , the image region (ROI) is determined, which in fact is only needed to determine this state, that is, this hypothesis which manifests itself in the comparison object. to rate. This is done technically for each sample or sample X _t ^(l) .

It can be seen from the illustration of FIG. 3 that the camera according to the invention and the method implemented are highly suitable for parallel processing. So only have to determine the probability density function 30, or for determination the approximation of the probability density function by multiple nodes, all nodes are merged and normalized. The other explained calculation steps can be carried out separately for each support point and, for example, can also be implemented in parallel hardware. The camera according to the invention and the method according to the invention are therefore particularly suitable for real-time applications.

The invention can also be applied to cameras with more than one sensor element. For example, a stereo camera is possible or even the combination of a conventional image sensor and a thermal image sensor. Such a combination is of particular interest for surveillance applications. Fusion of the results from the two different sensors would then be performed in unit 38 in, for example, FIG. 3.

4 shows a multicamera system according to the invention in a schematic representation in a possible application scenario. Today, lifeguard swimming areas are monitored by the sea or by a lake to save injured or exhausted people from drowning. A bathing section is monitored by a multi-camera system with cameras 60a, 60b, 60c, 60d and 60e. The cameras 60a, 60b, 60c, 60d and 60e are interconnected by means of a wireless network, not shown. The cameras are mounted on a pier 62 and on rescue towers 64, 66. By means of a suitable monitoring algorithm, for example implemented in the unit 38 of FIG. 3, it is to be monitored whether there is a critical situation, for example if a float 68 is in trouble. This can be done, for example, by recording and checking movement sequences and by checking the total number of persons in the water in a balance-like manner. Changes in the total number of people in the water that last longer may then For example, an alarm will be triggered. Lifeguards and ambulances should also be equipped with wireless, network-enabled devices, such as PDAs (personal digital assistant) or laptops with network connection. It is very essential for this application of the invention that the cameras according to the invention only output the result data and therefore only make small demands on the computing capacity on display devices which are likewise located in the network. Therefore, with the multi-camera system according to the invention it is possible to display the results of all cameras 60a, 60b, 60c, 60d and 60e on an external device with low computing power, for example a so-called PDA. Of course, a communication between the lifeguards can take place via the same network. In addition to lifeguards, for example, a surfer 70 whose surfboard has a networkable display unit could be informed about the danger situation. Of course, the cameras 60a, 60b, 60c, 60d and 60θ can also be realigned, programmed, configured and parameterized via the network. In addition, the cameras 60a, 60b, 60c, 60d, and 60e may also be connected to a non-local network, such as the Internet.

Another possible application of the cameras according to the invention is in a so-called indoor navigation with a mobile phone. The camera is part of a modern mobile phone. Optionally, the mobile phone has other sensors, such as inertial, inertial, and position sensors. The mobile phone also has a computing unit in which a localization algorithm is implemented. For example, entering an airport, a three-dimensional map of the airport is transmitted to the mobile phone along with additional symbolic aspects, such as terminal names, restaurants, and the like. The state of the overall system X _t designates the position within the building in this embodiment. When walking around with the appropriately equipped mobile phone image sequences are continuously recorded. The probabilistic tracking method then allows these measurements ultimately to crystallize a current position that can then be output, for example on the 3D map.

The schematic representation of FIG. 5 shows a further embodiment of a camera 71 according to the invention. The camera

71 is in and of itself identical to the embodiments already described, in the detection range of an image sensor unit

72 but a panorama mirror 74 is arranged. This panoramic mirror 74 is spaced from the image sensor unit 72 and allows an omnidirectional view for the tracking, that is, it can be tracked in all directions simultaneously. The captured image regions are to be warped accordingly using known calibration techniques.

With the camera according to the invention and the method according to the invention, it is thus now possible to automatically track a person within a camera view by means of tracking methods and thus to output only the position of the person instead of the live video stream. By using the camera according to the invention, only a very low bandwidth requirement is imposed on a data connection from the camera to the outside and it is thereby possible without any problems to perform monitoring tasks within a network of cameras. In fact, when using the cameras according to the invention due to the low bandwidth requirement to the network any decentralized architecture and a virtually unlimited expandability of the network with cameras is possible.

In practice, today's surveillance technology, it is often the case that the live video streams of a large number of cameras on a large Number of monitors are displayed. If a person is to be prosecuted, such as a potential thief in a department store or a suspected person at the airport, the observer on the one hand must perform the tracking manually, ie, not lose sight of the person on the monitor. On the other hand, after leaving the viewing angle of a camera, he has to switch further into the camera closest to him and adjust to the new angle of view. As already mentioned, it is now possible with the invention to follow a person automatically and in the following the representation or visualization of the obtained information according to the invention will be described.

With the invention it is possible to integrate the information of several inventive so-called smart cameras and then to visualize them in a common model, in particular a three-dimensional world model. This makes it possible for the path to be visualized in a 3D model - decoupled from the respective cameras, ie across camera views. In this case, the angle of view can be freely selected for the person, for example with the person "flying in." The use of three-dimensional models for visualizing monitoring results according to the invention therefore makes it possible to use less abstract representations than known visualizations In addition, the invention makes it possible to provide the monitoring results visualized in a common coordinate system at any location of a network and thus to have them available in ubiquitous form Coordinate systems and be embedded in a three-dimensional world model. FIG. 6 shows an overview of an installation according to the invention. The reference numeral 80 shows the outline of a building entrance in which a total of six smart cameras 82, 84, 86, 88, 90 and 92 according to the invention are positioned. All the cameras 82 to 92 are connected to a visualization unit 94, which may be designed, for example, as a portable visualization client in the network. In the visualization unit 94, the monitoring results, for example the results of a person tracking, are embedded in a three-dimensional model. The connections of the cameras 82 to 92 with the visualization unit 94 are only indicated schematically, can be set up any type of network connection in any configuration and topology, for example, as a bus connection, alternatively as wireless network connections. In addition, illustrations of the viewing angle of the individual cameras 82 to 92 in the form of a respective snapshot are also included in FIG. 6.

The representation of FIG. 7 schematically illustrates the steps that are carried out in the visualization according to the invention. The smart cameras 82 to 92 each output a probability density function approximated by interpolation points. This probability density function can be output in spatial coordinates. In the example shown in Fig. 7, the probability density function is output via two-dimensional coordinates x, y. The output probability density function can then be represented, for example, three-dimensionally, with a ground plane representing the coordinate plane x, y and the value of the probability density function being plotted upward from this ground plane. This three-dimensional representation is designated by reference numeral 96 in FIG. The reference numeral 98 in the illustration of FIG. 7 denotes a plan view of the illustration of FIG. 96. The Values of the probability density function can then be represented, for example, in color-coded form.

These probability density functions can then be visualized in an SD model 100, such that positions, paths, and textures of individuals appear in the 3D model. As has already been explained, the viewing angle to this 3D mode II is arbitrary and, as shown in FIG. 7, it is possible, for example, to choose a bird's-eye view, but it is also possible to choose a perspective "flying along" with the tracked person.

The method according to the invention, including the visualization, will be explained again below.

In a first step, a three-dimensional model of the environment or of a building to be monitored is recorded or read in, for example in the form of a CAD file (computer-aided engineering). The smart cameras are or will be installed in a suitable location in the building and added to a network. The smart cameras must then be calibrated relative to the three-dimensional model. Preferably, the three-dimensional model is georeferenced and after calibration, the outputs of the smart cameras are georeferenced with it.

In the actual tracking operation, for example, a person runs into the field of view of a smart camera and is automatically detected by the smart camera and recorded as a new target object and tracked with the particulate filter method already described. This is possible for other people, so that a multi-person tracking can be realized. The visualization of the tracking then takes place in the three-dimensional model, wherein different display modes can be provided. For example, with a single person mitfliegend, from the perspective of individual cameras or by graphical visualization of the previous path of a person. The representation of a person or an object in the three-dimensional model takes place by means of a generic three-dimensional person model. Optionally, the person's current appearance can be mapped as a texture to the three-dimensional person model or represented as a sprite, ie as a graphic object superimposed on the visualization model.

It is essential that the results of the visualization are available throughout the network and thus ubiquitously. For example, in a control room, on a PC, but also on mobile devices such as PDAs (Personal Digital Assistant) or smart phones with wireless network interface, which are even decoupled from the smart cameras operated (WLAN). Each user can select their own display mode, regardless of the other users and the smart cameras.

According to a specific embodiment, it is provided that a user with his network-capable visualization client, for example a PDA / smartphone, himself moves in the field of vision of one or more smart cameras and thereby simultaneously also inputting the tracking, in other words by the smart cameras themselves is pursued. After visualizing the monitoring results on his PDA / Smartphone, the user can thus directly see his own position and thereby make a self-localization. Building on this effect, a navigation system can be operated for such a user, which, in contrast to GPS (Global Positioning System), also operates with high precision within a building. As a result, for example, services can be offered, such as route guidance to a specific office, even across floors, or in an airport terminal. Visualization on the mobile device also makes it easier for the user to find his way around.

In a further special embodiment, for example, friends or buddies can be visualized in the three-dimensional model. If the user himself is in the field of view of the smart cameras, this is particularly interesting, because he then sees directly on his mobile device, who is in his vicinity, or where his friends are currently. This can be used, for example, in singles contact services, where, if the coincidence of common preferences or the like has been established, the position of the potential partner can be released from the network for the other party so that both can see each other on their mobile terminals, and optionally also be guided to each other by a routing function. This is possible, for example, in a nightclub or a hotel complex, but not limited in range. In particular, it is important that when using geo-referenced visualization models, two persons can also be in separate camera networks and still be able to obtain information on one another when the camera networks are networked with one another.

In another specific implementation, more advanced requests may be implemented, such as "what happened?" An answer could be that a new person has joined, that a person is entering a safety-critical area in the airport, another request may be "where?" ring. Such a request can be answered by specifying a three-dimensional position and systems based thereon can be used, for example, to answer the question of where an abandoned suitcase is located in an airport. For the visualization of the tracking results of importance, the output of the respective tracking position is no longer in coordinates of the image plane of the respective camera, but using the calibration in a global coordinate system, for example in a georeferenced global world coordinate system (WCS). The determined tracking positions can thereby be localized on the earth.

It is not absolutely necessary to use so-called stereo cameras, which spatially capture a certain angle of view and thereby can output the three-dimensional position of a person. Alternatively, an average person height can be assumed, and the height in camera pixels can be used to infer the true height of the person using the camera calibration. In this way, an approximate distance to the camera can be calculated. If several cameras overlap with respect to their field of view, a distance measurement to the camera or cameras is possible even without the assumption of an average person height. In this way, the two-dimensional image plane of a smart camera can be extended to a world coordinate system.

As a three-dimensional model for the visualization of the tracking results, for example, an Internet-based world-wide representation can be used, in which georeferenced contents can be embedded. An example of this is the "Google Earth" visualization accessible via the Internet, in which, for example, three-dimensional models of buildings can be embedded, and such a world-wide representation can also be used to visualize the tracking results of the decentralized smart camera network the positions of people in this presentation are indicated by green dots, where the extent of the dots indicates a confidence of how confident a person is - -

is located at the position shown. But also textured models of the respective person can be used for visualization.

One possibility of simplification arises from the fact that, when the camera is permanently mounted, a background model is detected, in which the recorded scene is presented without moving objects, for example without persons. The smart camera builds a background model from this scene in which, for example, a running average is formed over several temporally successive images in order to eliminate the noise. Alternatively, the background model may be calculated using thresholds of temporal change. In this way, the smart camera has a background model available, so that segmentation can be realized in operation by difference formation methods and optionally additionally by known erosion and dilation methods. This segmentation includes just all moving objects and can be used for the tracking process as a region-of-interest (ROI). Only in this segmented areas can be a person to be tracked. This segmented area, which is potentially incoherent, is a superset of the actual tracking, as several people can be in the picture at the same time. In this way, the required in the smart camera computational reading can be reduced because only those areas are further processed by the segmentation, in which a person to be tracked can be at all.

By means of the described segmentation method, an automatic initialization to movement is also made possible. This can simplify the tracking of multiple objects or multiple people. The initialization responds to motion relative to the background model. To make new objects very fast To track additional support points can preferably be placed at positions in the image, where people can leave the field of vision or enter. Incidentally, this is not necessarily the edge of the picture. For example, if the camera is mounted on a corridor, the entrance area could also be in the center of the image. Such positions, at which additional support points are provided, can be specified or also be set up adaptively, for example, by sufficiently long training to be learned.

As already stated, the visualization takes place in a three-dimensional and preferably georeferenced visualization model. The smart cameras continue to work in their respective image plane and a conversion into world coordinates is then carried out taking into account a camera calibration. As already stated, several cameras can be used together to determine the position of a person or an object in the room by means of known stereo methods.

In object tracking, two different approaches can be chosen according to the invention.

On the one hand, a so-called decentralized tracking can be performed by running in each smart camera own particulate filter. If there is a moving object in the field of view of a smart camera, a particle filter runs for this object. If two moving objects move within the field of view of the smart camera, then two particle filters are set up accordingly. The integration of the results of the tracking into a uniform three-dimensional model then takes place only at the level of the tracking results. First, the tracking results of all cameras are drawn into the three-dimensional model. Technically, this is done by transmitting the tracking results in the network, in particular to the visualization unit 94 and the there then following visualization. In the simplest case, passing the tracking results between the smart cameras can be done so that if two cameras provide very similar coordinates in the three-dimensional model, then these two results will be unified into one moving object.

Alternatively, a so-called central tracking can be performed. Logically / algorithmically, only a single particle filter per moving person or moving object is operated across all smart cameras. A state X consists here of the position of the person or of the object directly in world coordinates, this state X is held by the visualization unit 94 and each support point above this state X can be understood as a position hypothesis in world coordinates. Each smart camera then receives these coordinates from the visualization unit 94 to perform its own measurement. The joint processing of position hypothesis and measurement result is thereby already carried out at the measurement level, correspondingly in the smart camera itself. In this case, the visualization unit 94 has tasks of a central processing unit.

When decentralized tracking is used, moving objects or persons located in the overlap area of the field of view of two cameras are passed from one camera to the next in that both cameras provide a similar position of that person or object in world coordinates. Obviously, the position of one and the same moving object would be exactly the same position with perfect calibration of the two cameras. The two tracking results of the two cameras can be linked to a person. Additional security can be achieved by comparing the particular appearance of the object or person with each other to ensure that the right person is assigned. A handover may also apply to a be delayed ment, where not just another person or another movable object is located by chance next to the person to be passed.

In the case of central tracking, the tracking results are in any case decoupled from the respective smart cameras. A person thus simply leaves the image plane of a first camera and enters the image plane of a second camera, the handover is thus implicitly done, as is calculated directly in world coordinates.

The calibration of the cameras in global, in particular georeferenced, coordinates can be carried out using standard methods, but a so-called analysis-by-synthesis approach can also be used. For this purpose, the three-dimensional visualization model is used as the calibration object and the camera parameters are iteratively changed until selected points of the image plane of the camera coincide with the corresponding points of the three-dimensional visualization model, ie until the real camera view coincides optimally with the view of the visualization model. Alternatively, a smart camera can also be provided with one or more angle sensors in order to obtain information about the respective viewing direction of the camera. The position of the camera can also be determined by known surveying techniques relative to the environment, since the environment exists as a 3D model, so that the position relative to this model is known.

In the following, alternative embodiments of the invention are described, which relate to the way in which the monitoring results are determined.

With respect to the time scale used, the tracking, that is, the tracking of a moving object or a person, becomes only one Time scale executed, namely the scale with the frame rate of the image sensor of the smart camera. In order to increase the robustness of the tracking, it is now optionally provided to carry out the tracking simultaneously in different time scales λ. In this case, the time scale λ should indicate the duration until the next-time evaluation of a current sensor image, this being specified in units of frames of the sensor. According to the method described so far, the particle filter for tracking a moving object or a person always runs completely for each sensor image, so that λ = 1. This means that changes in the current sensor image always have an immediate effect on the particle filter and thus the tracking result.

A new sensor image basically has an effect on the weight of a support point relative to other support points and, if appropriate, on the adaptation, if adaptive methods are provided. Thus, if an object, if only temporarily, does not behave as it is assumed in the motion model, then, given a time scale of λ = 1, it will immediately have an effect, even if the object is still approximately in time average Movement model behaves.

For example, if a person is running behind an object for a short time and is thus covered from the camera's point of view, support points that have actually tracked the person so far are immediately penalized or less heavily weighted because they are badly weighted in the measurement step because they are not direct have proven to be successful. If the person then appears again behind the object and is thus visible again, these interpolation points must first be confirmed again. This does not always work in the desired robust manner because, due to the reduced weighting of the support points in the previous measurement step, there are not as many support points in the immediate vicinity of the person behind the object. If, in addition, an adaptation of the appearance is made, then there is also the danger that the person obscuring the object is taken over as an apearance of the person. Although this can be prevented by means of an adaptation dependent on the confidence, nevertheless the quality of the monitoring result suffers from these effects.

On a higher time scale with λ> 1, however, such a masking disappears, since such a higher time scale behaves like a temporal lowpass.

According to the invention, it is therefore provided to track each object to be tracked or each person to be tracked on different timescales, on different timescales at the same time. The object to be tracked can thus be viewed over the full probability density function over time. Just as the state of the object to be tracked is covered by interpolation points, the time scale can also be covered by interpolation points. Alternatively, as shown in the illustration of FIG. 8, a plurality of time scales run in parallel, namely λ = 1, 2, 4, 8, 16,..., In order to cover the image space acquired by the camera sensor over all timescales.

If, for example, the application of different timescales during the execution of the measurement step, the Appearance would be strongly adapted, the tracking method would therefore assume that the person has the appearance of extremely fast "turned" into an obstructing obstacle, so is on a higher time scale, for example, λ = 2, still retain the original appearance of the person.After re-entering the field of view of the camera sensor, this retained higher-temporal appeal would then be favored, for example, favored by a weighting between timescales, which results at smaller values with simultaneous use of multiple timescales and, for example, a simple comparison of the appearance to the same But on the basis of different time scales, this leads to very robust results in the temporary obscuring of objects to be tracked by obstacles. The basis for the application of different time scales is the assumption that an object to be tracked behaves in much the same way as the movement model and can change its appearance at different speeds or analogously behaves according to the appearance model and deviates from the movement model, but not both happens at the same time. Both alternative assumptions are monitored and tracked by the timescales, and then the right thing crystallizes out. The so-called Markov assumption states that the current state is defined only by the previous states. The use of different timescales also requires only the last state for time scales with λ> 1 and therefore satisfies the Markov assumption, even if the last state lies farther in the past than in the time scale with λ = 1.

Technically, a time scale with λ> 1 is realized in that in an iteration in which no new sensor image is to be processed, the time-consuming measuring step is omitted. Instead, the object is predicted only according to the motion model and optionally the appearance model. Since it is already known at a certain time scale in advance when a measurement is to take place, the motion model and the optional Appearance model due to the deterministic nature of all iterations that contain no measurement, for efficiency reasons in one step at a time run. In the illustration of FIG. 8, all iterations that contain no measurement can be recognized by the fact that in the different time scales of FIG. 8, no vertical line is drawn at these iterations. The computational effort for the above-described extension of the time scales or the uses of multiple time scales is almost twice as high when using the scheme described above than without this extension. With regard to the already explained possibility of segmentation of the background image into immovable areas and areas in which potentially movable objects can emerge, the use of multiple time scales can also be used as a control entity for occlusion of objects to be detected. However, the use of multiple timescales can also be used on moving cameras where segmentation is not directly applicable. In addition, if a tracked person is not covered by one static obstacle, but by another person, then the use of multiple time scales can also help with existing segmentation methods, since they only segment moving objects to the background, but not between moving objects or people.

In the following, the possibility according to the invention of adapting the appearance of an object to be tracked or of a person to be tracked, that is to say its appearance and appearance, will now be discussed. In tracking a person, not only a state X is tracked by means of the particle filter already discussed, but a whole probability density function over that state X, approximated by interpolation points. In an analogous manner, it is possible to proceed for the appearance of an object to be tracked. Normally, the appearance of the target object in the particle filter is only common for all nodes and also fixed. A limited adaptation can be carried out by means of the so-called α-blending, but here too only exactly one Appearance of the target object is provided at all times. In addition to different hypotheses about the current state X of the target object, now also several Appearances A of the target object should be tracked simultaneously. In addition, these two aspects should be tracked over several time scales λ. The goal is therefore to adapt the Appearance while pursuing several Appearances simultaneously. For this purpose, the Appearance is defined as a part of the state, according to X _ne u "- (X, A), ie the new state depends on the previous state X and the Apperance A. The already described particulate filter method does not need to be changed. Analogous to the movement model, there is an Appearance model that predicts a new one from the old Appearance.

There are several ways to implement this appearance model. The aim is a particularly low-dimensional parameterization, since the complexity, caused by the number of nodes, of a particle filter grows exponentially with the number of degrees of freedom, making it very inefficient. For example, a low dimensional parameterization can be an analytic appearance model that uses an analytical model of the whole distribution instead of sampling the appearances directly with its own landmarks. There are two options for this:

1. Use of a parametric model, which is learned by means of statistical methods from training data. In the case of LJ monitoring tasks, however, this is only possible if the objects or persons to be tracked can be trained in advance.

2. The use of an analytical model to avoid a base-based approach. For this purpose, for example, a so-called running average from the last Appearances or preferably a so-called α-Blending from the last Appearance and the current can be used. In the context of the invention, the tracking of persons and objects can also be contour-based. The methods described so far are based primarily on the color registration of objects to be tracked. Contour-based tracking methods can be realized with the invention, the already described basic structure of the method and the structure of the smart cameras remains unaffected. To implement a contour-based tracking method, each node X now clearly describes a contour, for example the control points of a spline. For this purpose, a spline is generated in image coordinates, which is superimposed over the sensor image. Now the difference of this contour estimate to the current sensor image is calculated. For example, as shown in FIG. 9, in particular at regular intervals along the contour, points are considered at which the distance to the next edge in the sensor image is calculated perpendicular to the contour. These vertical lines drawn along the contour in FIG. 9 have a definable maximum length up to which an edge is sought. If no edge has been found up to this maximum length, this maximum length is assumed, thus limiting the difference upwards and limiting the search range. The sum or the squared sum of these differences is used in the previous Gaussian function and thus leads to a one-dimensional difference value for this interpolation point.

In the context of the invention, the region-of-interest (ROI) can only consist of the superposition of these vertical lines and only this superposition of the vertical lines must be transmitted by the smart camera or the sensor. For all support points together, therefore, the overlay of all these vertical lines from the smart camera alone is to be requested. The illustration of FIG. 9 shows in the upper left image the contour resulting from a support point X and the points spaced along this contour. In Fig. 9 top right then the addressed vertical lines are located at all points. In 9 bottom left, the contour can be seen together with the vertical lines and in Fig. 9, bottom right, only the vertical lines are shown, which are ultimately to be requested as ROl from the sensor.

Instead of a contour, an Active Appearance Model (AAM) can also be used, as is known in the art.

The contour-based methods can also be linked to the histogram-based. A support point X then consists of the concatenation of both state variables. When calculating the weight of each interpolation point in the measuring step, in this case the results of the contour measurement and the previous histogram-based measurement are summed up in a weighted manner. The weighting can be adjusted.

In addition to the position of the object, the state X can also include its speed in terms of direction and magnitude, and possibly also the angular orientation of the object. In the case of a contour-based tracking, the state then contains the coding of the contour, as described, for example, the control points of a spline.

In the illustration of FIG. 10, the visualization of the monitoring result by visualization of the probability density function of a person over time t is shown by way of example. Such visualization is generated by volume rendering methods and traces the trajectory of a tracked person, with different gray or color codes representing the probabilities of residence along the path.

An application of the invention can be made, for example, in the detection of abandoned suitcases, for example in railway stations or airports. These are fixed cameras and, like already described, uses multiple timescales. It should thereby be recognized objects that have been added on a time scale. As with a bandpass, this filters out objects that change too fast, such as people walking around or picture noise. Similarly, too low frequencies are filtered out, so the background or sufficiently slow changes in the background.

The detection of stray suitcases in an airport can be combined in a particularly advantageous manner with the monitoring of persons, since it is of particular interest to track the person who has parked the suitcase, both before parking but also afterwards. For this purpose, the system can track all recognizable persons in the field of view of the cameras. It should be noted that these persons do not necessarily have to be displayed to the user. For example, if one of the persecuted persons turns off a suitcase, the system can promptly present it to the user by following the suitcase and the path of the associated person who has potentially parked that suitcase. Then both the path before parking as well as after parking is shown, since all lying in the field of view persons were followed as a precaution. The user can thereby be shown only the important information without flooding it with information of no interest to the application. The user can thus immediately clarify the "what?" Question, namely an abandoned suitcase, and clearly follow the "where?" Question in the three-dimensional visualization model. The security staff at the airport can visualize this visualization embedded in a three-dimensional model on a mobile visualization client and, since they are also tracked by the system and thus localized, route planning to the target person or suitcase is calculated. This route planning is continuously updated, since the movement of the tracked target person so flows in real time. Further aspects and features of the invention will become apparent from the following scientific paper, which also describes also realized examples.

Intelligent camera for tracking objects in real time

Sven Fleck

WSI / GRIS, University of Tübingen Sand 14, 72076 Tübingen, Germany

Phone: + (49) 7071 2970435, Fax: + (49) 7071 295466, email: fleck@gris.uni-tuebingen.de web: www.gris.uni-tuebingen.de

overview

Today, object tracking applications using distributed sensor networks are becoming increasingly popular, both in surveillance technology (airports, train stations, museums, public facilities) and in industrial image processing (visual control and factory automation). , Traditional, centralized approaches have several disadvantages such as limited transmission bandwidths, high computation time requirements and thus limited local resolutions and refresh rates of the cameras used.

This article presents a network-ready intelligent camera ("smart camera") for probabilistic tracking of objects. It is capable of tracking objects in real time and demonstrates an approach that is very sparing with transmission bandwidth, since the camera only has to transmit the results of the tracking, which are at a higher level of abstraction.

1. Introduction

In today's image processing systems typically only simple sensors are understood to be cameras. Data processing is accomplished only after the complete raw video stream is transmitted over an expensive and often limited distance link to a central processing unit (e.g., a personal computer). From the author's point of view, however, it seems more sensible to carry out the processing physically in the camera itself: what belongs algorithmically to the camera should also be calculated physically in the camera. The idea, then, is to process the information where it occurs - directly at the sensor - and to transmit only the results that are at a higher level of abstraction. This is due to the increasing trend of self-contained and network-enabled cameras.

In the following, a prototype of a network-capable intelligent camera for probabilistic object tracking in real time will be presented for the first time. Object tracking plays a central role in many applications, in particular within robotics (Robotic robotic robots, RoboCup robot football), surveillance technology (person tracking) as well as in the human-machine interface, in motion-capture motion tracking, in the field of augmented reality and for 3D television.

Particle filters have become established as an important type of object tracking today [1, 2, 3]. The visual modalities used include form [3], color [4, 5, 6, 7], or a combination of modalities [8, 9]. The Particle Filtering procedure is described in Section 2. Here, an approach based on color histograms is used, which has been specially adapted to the requirements for technical implementation embedded in the camera. The architecture of the smart camera is described in Section 3. Subsequently, various advantages of the proposed approach will be discussed. Experimental results of this approach are illustrated in Section 4, followed by a summary. 2. Particle filter

Particle filters can handle multiple simultaneous hypotheses and nonlinear systems. Following the notation of Isard and Blake [3], Z _t defines all measurements {zι, ..., z _t } until the time t, X _t describes the. State vector at time t of dimension fc (position, velocity etc. of the target object). Particle filters are based on Bayes' theorem to obtain the a posterior probability density function (pdf) at each time step. to calculate using all available information:

P (Xt] Zt) = (1) p {zt)

This equation is recursively evaluated as follows. The idea of the particle filter is to increase the probability ("samp" weight π, where form steps become

Figure 1. Particle filter loop

• Selection step First the cumulative histogram is calculated over the weights of all nodes. Subsequently, depending on the weight of each support point Tr ^ ₁ , the number of descendants is determined depending on its relative weighting in the cumulative histogram.

Prediction Step In the prediction step, the new state Xt is calculated:

Different models of motion for the implementation of p (X _t \ Xt-ι) are conceivable. Here

• Measurement step In the measurement step, the new state X _{t is} weighted as a function of the new measurement z _t (ie depending on the new camera sensor image).

P (Xt] Zt) = P (Zt] Xt) P (Xt] Zt-!) (3)

The measuring step (3) supplements the prediction step (2), together they implement the Bayes theorem (I) -

2.1 Particle filter based on color histograms

The measurement step in the context of color distributions

As already mentioned, a particle-filtering method is described here which works on color histograms. This enables rotationally invariant object tracking and enables robustness against partial occlusions and deformations of the target object. Instead of working in the standard RGB color space, an HSV color model is used here; A 2D Hue Saturation Histogram in conjunction with a 1D Value Histogram has been developed as a space to represent the Appearance of the target. This causes the following specializations of the above-described abstract measurement step. From the image region ("Region of Interest" - ROI) to the histogram

Each node S _j induces a region of interest (ROI) Pj ¹ 'around its local position in the image space. The size of the image region (H _x , H _y ) is user-defined. In order to further increase the robustness of the color distributions in the case of occlusions, or if background pixels are included in the image region, weighting is used depending on the local distance to the center of the image region. The following weighting function is used here:

,,, / 1 - r ² T <1 * M = \ 0 otherwise where r denotes the distance to the center of the image region. If this kernel is used, you get the following Color distribution for the support point

HiStO _x U (b) = f Σk β ^W ~ ^) δ [I (w) - b]

with bin number &, pixel position w within the image region (ROI), bandwidth a = WH% + H ^ and normalization tion /, where X _f denotes the part of the state X \ that describes the position (x, y) in the image. The 5-function ensures that each summand is assigned to the associated bin, which is defined by its image intensity I, where I is to be understood once in iJ5 space, once in v-space. The representation of the target object is calculated completely analogously, so that now a comparison of this with the histogram of each support point in Histogrammraum can be made. From the histogram to the new weight π

Now, the histogram of the target object is compared with the histogram of each well: for this purpose, the Bhattacharyya similarity [4] is used here, both in the HS and in the ^ histogram singly.

where P _j and q denote the histograms of the landmarks or the target object respectively (in HS and V histogram space respectively). Thus, the more the image region associated with a well is similar to the target, the larger p. The two similarity values pjjg and Pv are then weighted by means of alpha blending and thus combined to a similarity value. The number of bins is variable, as is the weighting factor of alpha blending. The experiments were performed with 10 x 10 + 10 = 110 bins and a weight of 70:30 HS: V (ie between PHS and pv). As a last step, a Gaussian distribution with user-definable variance σ is applied in order to obtain the new weight for the support point sy '

A small Bhattacharyya distance thus leads to a high weight so that the associated support point is more likely to be voiced at the next iteration

3. Smart Camera System

3.1 Hardware description

To demonstrate the prototype, a mVBlueLYNX 420CX camera from Matrix Vision [10] is used as a base, as shown in Fig. 2. The camera contains a sensor, an FPGA, a processor and

Figure 2. The Smart Camera System

an Ethernet network interface. More specifically, it incorporates a Bayer progressive color CCD (Progressive Scan CCD) with a Bayer color mosaic. A Xilinx Spartan II FPGA is used for low-level processing. It also includes a 200MHz Motorola PowerPC processor with MMU and FPU unit running embedded Linux. It is connected to 32MB SDRAM and 36MB FLASH memory. The camera also includes a 100Mbps Ethernet interface. on the one hand for updating in the field ("Field Upgradability"), on the other hand for transmitting the results of the object tracking to the outside For direct connection with industrial controls also several inputs / outputs are available. There is also an analog video output and two serial ports where the monitor and mouse can be connected for debugging and target initialization purposes. The camera is not only intended as a prototype under laboratory conditions, it was also developed to cope with harsh industrial environments 3.2 Camera Tracking Architecture

Fig. 3 shows the architecture of the smart camera.

Figure 3. Smart Camera Architecture

Output of the smart camera

In each iteration, the following is output:

• The probability density function (pdf) approximated by the interpolation point set S _t = {(Z _t ^W , τr _t ^W ), i = 1..N}. This leads to (N * (k + I)) values.

• The mean-state-state E [St] = Σ) i = i ^7r _t - ^ _t ^and thus a value.

• The maximum likelihood state in combination with the confidence π | So two values.

transmission

The output of the smart camera is transmitted via Ethernet using sockets. On the PC side, this data can then be visualized in real time and stored on data carriers for later evaluation.

3.3 advantages

This smart camera approach offers many advantages:

• Low bandwidth requirements of the camera. The raw image data is processed directly in the camera. Thus, only the approximated probability density function (pdf) of the state of the target object has to be transmitted by the camera, which requires relatively few parameters. This allows the use of standard networks (eg Ethernet) with virtually unlimited range. Here, the total data to be transferred adds up to (N * (k + 1) + 3) values per frame. If about N = 100 nodes are used and no velocity model is used (k = 2), 303 values per frame are to be transmitted. This is relatively little compared to when all the pixels of the raw image For example, for raw transfer to VGA resolution, even without Bayer mosaic color conversion, about 307,000 pixel values per frame are already needed. Even at (moderate) 15 fps, this requires a transfer rate of about 37 Mbps, which is about 1/3 of the standard bandwidth of 100 Mbps.

• No calculations outside the camera necessary. Network-capable external devices (PCs or machine control systems in automation technology) no longer have to deal with low-level data processing, which logically belongs to the camera. In this way, high-level applications based on the results (even more) of such smart cameras can be realized instead. Also, mobile devices (PDAs / cell phones) usable e.g. For example, in the case of a monitoring application, they can display the output of the object tracking of all smart cameras via wireless network connection.

In addition, it is possible to connect the smart camera directly to a machine controller (even if it does not have dedicated resources for data processing external data), such as a "visual servoing" robot control. For this purpose, it is even sufficient to transmit only the mean-estimate-state or the maximum-likelihood state including the confidence at their inputs in order to control the machine under real-time conditions.

• Higher resolution and refresh rate of the camera. Since the raw video stream in the proposed approach is no longer limited by the bandwidth of the connection to the outside, sensors with higher spatial and temporal resolution can be used, because due to the proximity of the processing unit directly to the sensor, a higher transmission speed is technically much easier to implement than outside the camera. The conventional technique (camera + external computer (PC)), however, has the following disadvantages:

1. Whenever the entire image would be transferred to the PC in full resolution to do all the processing there, the bandwidth requirements of today's network connections are quickly exceeded. This is even more the case with multi-camera systems, as they have to share the network bandwidth. If, on the other hand, standard camera connections are used, which offer higher bandwidths (such as CameraLink), the distance to the camera is limited to a few meters (quite apart from the fact that there is no decentralized network due to the central host).

2. If only the image regions (ROIs) that are interesting from the point of view of the particle-filtering process (ie, those induced by the interpolation points) were transmitted, the connection between camera and PC becomes part of the feedback loop of the object tracking process. Nondeterministic network effects may then cause the prediction of the object tracking process by the particle filter to be performed according to the states of the nodes, i. ROIs, no longer in sync with the "real world" is running and measured in the wrong places.

• multi-camera systems. As a result of the above advantages, this approach allows for optimal scaling with the number of cameras. This is important for multi-camera systems to work together in a decentralized infrastructure, such as airport surveillance.

• Self-contained system with small form factor. Embedding the process in the camera creates a self-contained system with a very compact form factor. Thus, an installation in places with limited space can be done, or about directly on a robot hand.

• Parameterizability. The implementation allows parameterization of the particle filter in a wide range. This includes the number of nodes N, the size of the image region (ROI) [H _x , Hy), the number of bins in the histogram (in H, S, V), the factor for the mixing ratio HS + V (between hue saturation ( pπs) and Value (py)), the variance vector for diffusion in the motion model, the variance for the Bhattacharyya weighting and the combination of the motion models.

• Advantages of the Particle Filtering process. A method based on a Kalman filter embedded in a smart camera would offer similar advantages as the previously mentioned. However, such a method has several disadvantages, as it can handle only unimodal probability density functions (pdfs) and linear models. A particle-filter method on the other hand approximates the probability density function (potentially arbitrarily shaped) to be output by the camera (pdf) efficient by supporting points, so that only a moderately higher transmission bandwidth compared to a Kalman filter method is required. In contrast, the robustness gain is immense.

4. Results

4.1 Experimental results

Here are some results. However, these are only a part of what is available on the project website [11] in higher quality. In the first experiment, the camera is initialized with a cube object. For this she is trained by presenting the object in front of the camera, she saves the associated color distribution as a reference of the target object. The tracking performance was very satisfactory: the camera can track the target object robustly over time at a refresh rate of 15 fps and a sensor resolution of 640x480 pixels. To achieve greater computational time efficiency, the process works directly on the raw and thus still color-filtered pixels through the Bayer mosaic: Instead of first making an expensive Bayer mosaic color conversion and then ultimately only using the histogram over it, which is not local Contains information, each four pixel Bayer neighborhood is interpreted as an RGB pixel. (The two green values are averaged.) This results in a QVGA resolution as input to the object tracking method. The total bandwidth requirements of the camera are very moderate, only about 30 kB / s are needed (when using 100 nodes). In the first experiment, a cube is tracked. This is moved first vertically, then horizontally and then on a circular path. The approximated probability density function output by the camera (pdf) over time t is illustrated in Fig. 4, projected in the x and y directions.

Figure 4. Probability density function pdf at the iteration time t. Left: x component, Right: y component.

Starting from this figure, Fig. 5 illuminates the circular motion within the cube sequence in detail. For this purpose, a screenshot of the current positions of the interpolation points in conjunction with their weights is given at different times. It should be noted that the fact that the camera is mounted statically here, has not been exploited, so the performance presented is already achieved without a background segmentation as preprocessing.

In the second experiment, the behavior of the smart camera is examined in the context of surveillance applications: The smart camera is trained with the face of a person as a target object. It turns out, that also the face can be successfully tracked in real time. Fig. 6 shows some results during operation.

Figure 5. Circular motion sequence of Experiment # 1. Image (upper line) and approximated probability density function (pdf) (lower line) at iteration # 100, 109, 113, 125, 129, 136, 141. Interpolation points are shown in green, the expected value is marked as a yellow star.

Figure 6. Experiment # 2: Face Tracking Sequence. Image (top line) and approximated probability density function (pdf) (lower line) at iteration # 18, 35, 49, 58, 79.

5. Summary

This article featured a smart camera for real-time object tracking. By using particle filters on HSV color distributions, it provides robust tracking performance because it can handle multiple hypotheses simultaneously. Nevertheless, their bandwidth requirement is very low, since

"Another plan is to automatically adapt and track the view (" Appearance ") of the target object at runtime, to further increase the robustness of object tracking in the event of lighting changes, as well as to build a multi-camera system to take advantage of it to demonstrate the communication that occurs between cameras at this higher level of abstraction (for example, as the basis for persecuting people in a surveillance application).

thanksgiving

We would like to thank Matrix Vision for their generous support and successful cooperation. literature

[1] N.D.F. Arnaud Doucet and N. Gordon, Sequential Monte Carlo Methods in Practice. Springer Verlag, 2001.

[2] "Special issue on: Sequential state estimation; From Kalman filters to particulate filters," Proceedings of the IBEE, vol. 92, no. 3, 2004.

[3] M. Isard and A. Blake, "Condensation - Conditional Density Propagation for Visual Tracking," 1998.

[4] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-based object tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 05, pp. 564-575, 2003.

[5] K. Okuma, A. Taleghani, N. de Freitas, J.J. Little, and D.G. Lowe, "A boosted particle filter: multitarget detection and tracking," in EOCV 2004: 8th European Conference on Computer Vision, 2004.

[6] K. Nummiaro, E. Koller-Meier, and L.V. Gool, "A color based particle filter," 2002.

[7] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, "Color-based probabilistic tracking," in European Conference on Computer Vision, ECCV, 2002, LNCS 2350, Copenhaguen, Denmark, June 2002, pp , 661-675.

[8] P. Prez, J. Vermaak, and A. Blake, "Data fusion for visual tracking with particles," Proceedings of IEEE, vol. 92, no. 3, pp. 495-513, 2004.

[9] M. Spengler and B. Schiele, "Towards Robust Multi-cue Integration for Visual Tracking," Lecture Notes in Computer Science, vol. 2095, p. 93ff., 2001.

[10] "Matrix vision," http://www.matrix-vision.com.

[11] "Project's Website," www.gris.uni-tuebingen.de/~sfleck/matrixtracking.

Claims

claims

A camera for tracking objects, comprising an image sensor unit (12) for generating image data and a processing unit (14) for processing the image data transferred from the image sensor unit (12) to the processing unit (14), characterized in that the processing unit (14 ) has an ROI selection unit (20) for selecting image areas of interest for object tracking, and a tracking unit (21) for detecting tracking data of objects to be tracked from the image data.

2. Camera according to claim 1, characterized in that at a signal output of the camera (10), the tracking data can be output, wherein the tracking data compared to the image sensor unit (12) generated amount of image data has a significantly reduced amount of data, in particular about the factor 1000 decreased.

3. Camera according to claim 1 or 2, characterized in that the tracking data are provided in the form of a particular approximated probability density function.

4. Camera according to claim 3, characterized in that the probability density function is approximated by a plurality of nodes.

5. A camera according to claim 4, characterized in that in the processing unit (14) parallel processing means for parallel processing of the nodes of the probability density function and data dependent thereon are provided.

6. A camera according to claim 3, 4 or 5, characterized in that the tracking unit (21) implements a so-called particle filter in which a probability density function is approximated by means of an approximation step, a prediction step and a measurement step.

7. Camera according to claim 6, characterized in that in the prediction step for each support point (i) a new state vector (X _t ') of an object to be tracked by means of old measurements (Zu) and an old state vector and taking into account a stored movement model is determined in the measuring step the new state vector (X _t ') is weighted taking into account a new measurement (Z ₁ ¹ ) and in the approximation step; the approximation of the probability density function (p (X _t [Z _t ) resulting from all new state vectors (Xt ')) is approximated by interpolation points.

8. Camera according to at least one of the preceding claims, characterized in that the tracking unit (21) passes tracking data of objects to be tracked, in particular a prediction comparison object, to the ROI selection unit (20) in order, depending on the tracking data for to select the processing of interesting image areas.

9. A camera according to claim 8, characterized in that the prediction comparison object is generated by means of a parametric model that is adaptively adaptable.

10. A camera according to at least one of the preceding claims, characterized in that in the processing unit (14) the image data of the image selected by the ROI selection unit (20) - 5 -

area are converted into a color histogram and the tracking unit (21) determines the tracking data based on the color histogram.

11. The camera according to claim 1, characterized in that the ROI selection unit (20) controls the image sensor unit (12) in dependence on the tracking data such that only image data from the image sensor unit (12) are transmitted to the processing unit (14) which correspond to the image areas selected by the ROI selection unit (20).

12. Camera according to at least one of the preceding claims, characterized in that the image sensor unit (12) and the processing unit (14) are integrated in a common housing.

13. Camera according to at least one of the preceding claims, characterized in that the processing unit (14) comprises a network unit (32).

14. Camera according to at least one of the preceding claims, characterized in that a control unit (36) and adjusting means are provided to change in dependence of the tracking data setting parameters of the camera (10), in particular alignment, image detail and magnification.

15. A method for processing image data in a camera (10) for tracking objects, characterized by the following steps:

Transferring image data from an image sensor unit (12) to a processing unit (14) of the camera (10), Generating tracking data on objects to be tracked in the processing unit (14) using probabilistic methods and

16. The method according to claim 15, characterized in that the step of selecting areas of the image data includes driving the image sensor unit (12), so that only image data from the image sensor unit (12) to the processing unit (14) are transmitted, in which increased likelihood that they contain information about objects to be tracked.

17. The method of claim 15 or 16, characterized in that the step of generating tracking data includes approximating a probability density function by means of a plurality of nodes.

18. The method according to claim 15, wherein the step of generating tracking data includes generating image data of a comparison object based on a probability density function of the objects to be tracked and at least one parametric model of the objects to be tracked.

The method according to claim 18, characterized in that the step of generating tracking data includes a similarity measurement between the image data of the comparison object and the image data transmitted from the image sensor unit (12). - -

20. The method according to claim 18 or 19, characterized in that in the step of selecting areas of the image data only those image data are selected by the image sensor unit (12), which substantially correspond to the image detail of the comparison object.

21. The method according to claim 15, wherein the step of generating tracking data includes generating a color histogram based on the image data and evaluating the same.

22. The method according to at least one of the preceding claims 15 to 21, characterized by displaying the tracking data, in particular a probability density function of a tracked object, in a three-dimensional environment model.

23. The method according to claim 22, characterized in that the three-dimensional environment model in world coordinates, in particular georeferenced constructed.

24. Multicamera system with at least two cameras according to at least one of the preceding claims 1 to 14, characterized in that each camera (10a, 10b, 10c) has a network unit (32) and the at least two cameras (10a, 10b, 10c) have Via a network (18), in particular Ethernet or WLAN, communicate with each other.

25. Multi-camera system according to claim 24, characterized in that the processing unit (14) of at least one of the cameras (10a, 10b, 10c) is designed to process tracking data of another camera (10a, 10b, 10c).

26. Multi-camera system according to claim 24 or 25, characterized in that a central processing unit for evaluating the at least two cameras (10a, 10b, 10c) transmitted tracking data is provided in the network.

27 multi-camera system according to claim 26, characterized in that in the network, in particular in the central processing unit, a visualization unit is provided for displaying the tracking data in a three-dimensional environment model.