WO2020239210A1

WO2020239210A1 - Method, apparatus and computer program for tracking of moving objects

Info

Publication number: WO2020239210A1
Application number: PCT/EP2019/063881
Authority: WO
Inventors: Roberto HENSCHEL; Timo VON MARCARD; PROF. DR. Bodo ROSENHAHN
Original assignee: Gottfried Wilhelm Leibniz Universität Hannover
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2020-12-03
Also published as: DE112019007390T5

Abstract

The invention relates to a method for tracking of moving objects within a defined area, wherein the method comprising the steps of: - providing a recorded video sequence of said defined area, the video sequence has a plurality of video frames and a temporal length; - providing inertial data for at least one object, said inertial data for an object has been recorded by an inertial measurement unit arranged on and assigned to the corresponding object; - detecting at least one object in said video frames of said video sequence using an object detector unit; - generating a plurality of tracklets for said at least one detected object based on object detecting in said video sequence, each tracklet includes trajectory data of a trajectory of the corresponding detected object for a certain tracklet time period within the temporal length of said video sequence using a processing unit; - assigning one of said inertial measurement units to one or more tracklets based on the trajectory data of the corresponding tracklet and the inertial data within the tracklet time period of the corresponding tracklet such that the inertial data are consistent with the trajectory data of the respective tracklet using said processing unit.

Description

Method, apparatus and computer program for tracking of moving objects

The invention relates to a method for tracking of moving objects within a defined area using a camera system. The invention relates further to an apparatus for tracking of moving objects within a defined area as well as a computer program for this purpose.

Multiple people tracking (MPT) in video sequences has been an active field of re- search for decades. Several applications exist where trajectories are required for further analysis and interpretation. This could be to understand social interactions of humans, support urban planning [6], secure areas against dangerous behavior or to provide an automatic analysis of player’s performance in sports.

Most state-of-the-art MPT approaches tackle this problem in two steps: First, a person detector is applied to each frame of the image sequence. Then, an optimi- zation problem is formulated, which clusters all detections such that ideally each cluster represents the trajectory of a person and false detections remain unconsid- ered.

A crucial part of this strategy is to derive a measure whether two detections belong to the same person or not. Typically, this involves a motion model or person ap- pearance. A motion model attempts to assign likelihoods to observed person movements. This is very generic and only depends on corner coordinates of de- tection boxes. However, as soon as the motion becomes more dynamic, simple motion models are insufficient and tracking degrades. In particular, most motion models assume low and constant velocities, which holds for pedestrians only within a short temporal window.

Another complementary strategy is to model relations between detections based on person appearance. Here, CNN-based feature representations are used to evaluate if two detections show the same person. A major advantage of utilizing appearance information over motion models is that they allow to relate detections which are temporally far apart. This facilitates to re-identify people even after long- term occlusions or if they temporally fall out of the camera view.

Despite the enormous progress with artificial neural network-based appearance features, it remains challenging to differentiate persons wearing similar or identical clothing. A prototypical example for such a situation is sport player tracking, where team members wear almost identical dresses. Another challenge arises if people change appearance throughout a sequence, e.g. they put on a jacket or open an umbrella. Then the assumption of appearance constancy is violated and conse- quently tracking accuracy degrades.

In W. Jiang and Z. Yin. Combining passive visual cameras and active IMU sensors to track cooperative people. In International Conference on Information Fusion (Fusion), pages 1338— 1345, 2015, a method for people tracking in videos is dis- closed. Each person to be tracked is equipped with an inertial measurement unit (IMU). An IMU-equipped person has to be manually localized in the first video frame. Then, IMU information is used to recover the trajectory in situations where the visual tracker fails.

Flowever, incorporating additional sensory input for the task of MPT creates a very different problem setup compared to the vision-only methods, because there exists no linking between the persons detected in the video frames and the inertial data recorded by the body-worn IMUs.

Flence, it is an object of the present invention to provide an improved method and apparatus which use both information for tracking objects in an automatic manner without a manually localization. These object is solved by the inventive method according to claim 1 , by the in- ventive apparatus according to claim 10 and by the inventive computer program according to claim 12.

In accordance to claim 1 , a method for tracking of moving objects within a defined area is proposed. The inventive method comprised the steps of:

- providing a recorded video sequence of said defined area, the video se- quence has a plurality of video frames and a temporal length;

- providing inertial data for at least one object, said inertial data for an ob- ject has been recorded by an inertial measurement unit arranged on and assigned to the corresponding object;

- detecting at least one object in said video frames of said video se- quence using an object detector unit;

- generating a plurality of tracklets for said at least one detected object based on object detecting in said video sequence, each tracklet includes trajectory data of a trajectory of the corresponding detected object for a certain tracklet time period within the temporal length of said video se- quence using a processing unit;

- assigning one of said inertial measurement units to one or more track- lets based on the trajectory data of the corresponding tracklet and the inertial data within the tracklet time period of the corresponding tracklet such that the inertial data are consistent with the trajectory data of the respective tracklet using said processing unit.

The inventive method uses a video sequence with a plurality of video frames (sin- gle images) and inertial measurement units attached to one or more persons to be tracked, especially to each person. Conceptually, the idea is to incorporate local inertial measurement unit motion measurements in order to disambiguate the as- signment of detections to person trajectories. Since inertial measurement units are body-worn, the corresponding motion measurements are unique for each person. Similar to appearance, this property facilitates to track and re-identify persons even after long-term occlusions. Hence, such a tracking approach is predestinated for scenarios where it is possible to equip people with an inertial measurement unit and appearance is less informative or not available. The latter could be the case if night-vision is used or privacy concerns prohibit processing or storing color images of people.

Even though motion information is available through IMU measurements it still poses a very challenging problem. From IMU data alone it is not possible to gener- ate stable 3D trajectories due to unknown initial states and accumulating drift caused by double integration of acceleration signals. If this were possible, we could easily associate each detection box to the closest IMU trajectory projected to the image. Hence, instead of working on pre-computed IMU trajectories, we have to associate 3D orientation and acceleration measurements to 2D motion infor- mation observed in the video. For example, this requires to relate IMU orienta- tions, which are elements of SO(3), to image data being a two-dimensional pixel array. Further, IMU measurements often fit to several people at a time step and the person wearing the IMU might be occluded or out of the camera view.

To address this, a recorded video sequence is provided. The video sequence has a plurality of video frames (single video images) and includes a scene of the de- fined area with the moving objects to be tracked. The moving objects can be hu- mans (people). The moving objects can also be of a non-human nature, e.g. vehi- cles. Furthermore, inertial data for at least one object within the defined area are provided. The inertial data has been recorded by an inertial measurement unit (IMU) which is attached on the object to be tracked. Each inertial measurement unit is assigned to the corresponding object where it is attached. By localizing the inertial measurement unit, the corresponding object can be identified.

The video sequence and the inertial data can be recorded first and saved in a data memory (offline mode). However, it is also possible to use live video sequences and live inertial data just recorded. In the first step, at least one object, advantageously more objects or all objects in the video sequence, is detected in the video frames of the video sequence by an object detector unit. The object detector unit identifies the corresponding object by its coordinates (within the video frame) and/or its height and width. In this cases, the detected object will be identified by a detection area (called detection box) within the video frame, whereby the detection area includes the image data of the detected object. Sometimes, a colored box is drawn around the object to visualize the detection. A detection of an object within the video sequence is called detected object.

In the next step, for each detected object within the video sequence a plurality of tracklets are generated by a processing unit. For each object which was detected in one or more coherent video images, one tracklet is generated. Each tracklet in- cludes trajectory data of a trajectory of the detected object for a certain tracklet time period within the temporal length of said video sequence. A tracklet time pe- riod can be a length of 0.5 seconds to 1 .5 seconds. Trajectory data of a tracklet contains the orientation of the object (in relation to the camera view) and/or the po- sition in the defined area or in the video frame. Thus, a tracklet holds information about a short part of the track (trajectory) of the moving object recorded in the de- fined area. A tracklet can be generated based on a detection of an object in one video frame so that for each video frame in which the object was detected a track- let is generated. However, in the most cases it is more advantageous if a tracklet is generated based on a detection of one object in more than one video frame (e.g. 15 video frames to generate one tracklet).

Then, one of said inertial measurement unit is assigned to one or more tracklets based on the trajectory data of the corresponding tracklet and the inertial data within the tracklet time period of the corresponding tracklet. In the best case, each tracklet was assigned an inertial measurement unit. The assignment is conducted such that the inertial data are consistent with the trajectory data of the respective tracklet using. Further the assignments to the tracklets are performed simultane- ously, e.g. by using a mathematical model.

Based on the assignment of the inertial measurement units to the tracklet, an ob- ject regarding to the tracklet can be identified by its inertial measurement unit, be- cause each measurement unit is assigned to exact one object. Periods in which the object was hidden or was outside the recording area are no longer a problem, since the inventive method allows a re-assignment and re-identification.

According to an embodiment, the step of providing the video sequence includes recording the defined area using a camera system including at least one camera to generate the video sequence with the temporal length.

According to an embodiment, the inertial data are provided for a plurality of ob- jects, wherein each object is equipped with at least one inertial measurement unit. It is possible to equip all objects to be tracked with at least on inertial measure- ment unit so that an inertial measurement unit is arranged on and is assigned to each object. However, it is also possible that not every object is equipped with an inertial measurement unit.

According to an embodiment, a plurality of objects in the video frames of the video sequence are detected, wherein a plurality of tracklets for each detected object is generated. The tracklets of a detected object are part of the complete trajectory of the detected object. The trajectory data of the Tracklets of a detected object can form the trajectory of the detected object.

According to an embodiment, one of said inertial measurement unit is assigned to each generated tracklet. According to an embodiment, with respect to one of said tracklets, an assignment probability is calculated for all inertial measurement units based on the inertial data of the inertial measurement units within the respective tracklet time period, the as- signment probability indicating how consistent are the inertial data of an inertial measurement unit and the trajectory data of the respective tracklet, wherein one of said inertial measurement units being assigned to the respective tracklet based on the calculated assignment probabilities.

In other words, for each inertial measurement an assignment to a tracklet is evalu- ated to calculate an assignment probability. Then, an inertial measurement unit is assigned to said tracklet based on the calculated assignment probabilities. Fur- ther, this evaluation is conducted for all tracklets. The assignment probability could be, for example, a cost function. In an embodiment, the concrete assignment of an IMU to a tracklet is then determined taking into account all assignment probabili- ties of the IMUs and the tracklet data in a global context.

According to an embodiment, one of said inertial measurement unit is assigned to a first tracklet and the same inertial measurement unit is assigned to a temporal following tracklet based on the calculated assignment probabilities, if the trajectory data of the first tracklet and the trajectory data of the second tracklet are reasona- ble with respect to spatio-temporal aspects and/or if the inertial data of said as- signed inertial measurement unit to the tracklet time periods of the first and the second tracklet are reasonable with respect to movement aspects. Movement as- pects can be orientation and/or acceleration aspects.

In other words, the trajectory data of the first tracklet and the corresponding inertial data of the inertial measurement unit within the tracklet time period of the first tracklet and the trajectory data of the second tracklet and the corresponding iner- tial data of the inertial measurement unit within the tracklet time period of the sec- ond tracklet have to be reasonable with respect to spatio-temporal aspects and/or movement aspects. According to an embodiment, at least one assignment of one of said inertial meas- urement units to one of said tracklets is determined based on all assignment prob- abilities of all inertial measurement units and tracklets and all trajectory data of all tracklets in a global context. This can be realized by using a mathematical model which takes into account all assignment probabilities of all inertial measurement units related to all tracklets as well as the trajectory data of all tracklets. Based on video object detection (e.g. based on the trajectory data), two tracklets that belong to the same detected object and follow each other in time are connected to each other.

According to an embodiment, an orientation is computed for at least one of the ob- jects by inputting image data of the object into an artificial neural network which has learned an assignment of the image data of the object to an orientation of the object, wherein at least one tracklet for said at least one object being further gen- erated based on said detected orientation of said object.

A person orientation, for example, is defined in terms of the normal vector of the torso’s coronal plane projected to the ground plane. The artificial neural network has learned the mapping of image data within the detection box and the orienta- tion of the object shown in the image data of the detection box. However, this 2D projection for the orientation is replaceable by a 3D definition of the torso’s orienta- tion.

Even though global orientation is constant, the perceived orientation as seen from the camera varies. According to an embodiment, the detected orientation of the object is corrected based on the position of the detected object within the video frames. In a global context this person has a constant orientation. However, due to per- spective effects the perceived orientation of that person with respect to the view point of the camera is different at every point in the image. This is compensated by considering a correction angle derived from the detection box within the image.

According to an embodiment, for at least one of the objects (advantageously for a plurality of objects or for all objects) a trajectory is determined based on the trajec- tory data of those tracklets to which the inertial measurement unit of the object has been assigned and the inertial data of the inertial measurement unit of the object.

In accordance to claim 10, an apparatus for tracking of moving objects within a de- fined area is proposed. The inventive apparatus comprises at least one inertial measurement unit, an object detector unit and a processing unit, wherein the ap- paratus is arranged for conducting the method as described above.

Further, in accordance to claim 12, a computer program for tracking of moving ob- jects within a defined area is proposed. The computer program is arranged to exe- cute the method as described above.

The invention is described in more detail in the following figures:

Figure 1 schematic representation of the inventive apparatus;

Figure 2 schematic representation of a camera view after the object detection by the object detection box;

Figure 3 graph representation of the generated tracklets;

Figure 4 representation of person orientation;

Figure 5 representation of the visual heading artificial neural network.

Figure 1 shows a representation of the inventive apparatus 10 for tracking of mov- ing objects 200, 220 within a defined area 100. The defined area 100 corresponds to the recording view of the camera 12 depicted in figure 1. In the example of fig- ure 1 there are two objects in the form of persons 200, 220 within the defined area 100. However, the invention is not limited to a certain number of persons.

The camera 12 is recording a video sequence of the defined area 100 including the persons 200 and 220. The recorded video sequence is transferred to a pro- cessing device 14 for further processing.

Furthermore, each person 200, 210 to be tracked is equipped with an inertial measurement unit 20, 22. The person 200 is equipped with the inertial measure- ment unit 20 and the person 220 is equipped with the inertial measurement unit 22. Each inertial measurement unit 20, 22 is recording inertial data of the move- ment of the corresponding object 200, 220. The recorded inertial data are trans- ferred to the processing device 14.

The processing device 14 has an object detector unit 16 for detecting the objects 200, 220 to be tracked in the video sequence. The object detector unit 16 uses each video frame to recognize the objects 200, 220. Over the whole video se- quence, each object can be tracked visual.

The result of the object detection of the object detector unit is transferred to a pro- cessing unit 18. Further, the inertial data form the inertial measurement units 20, 22 are also transferred to the processing unit 18 of the processing device 14.

The processing unit 18 is arranged to generate a plurality of tracklets for the de- tected objects 200, 220 based on object detecting in the video sequence. Each tracklet includes trajectory data of a trajectory of the corresponding detected object 200, 220 for a certain tracklet time period which is calculated form object detec- tion.

Furthermore, the processing unit 18 is arranged to assign the inertial measure- ment units 20, 22 to each tracklet based on the trajectory data of the correspond- ing tracklet and the inertial data within the tracklet time period of the corresponding tracklet such that the inertial data are consistent with the trajectory data of the re- spective tracklet.

If an inertial measurement unit has been assigned to each tracklet, a complete tra- jectory could be calculated based on the trajectory data of the tracklets with the same inertial measurement unit and the inertial data of the inertial measurement unit. The complete trajectory is saved in a digital memory 30.

Figure 2 shows the result of the object detection from the object detection unit 16 on the example of one video frame 41 . If the objects 200, 220 are detected, a de- tection box 300, 320 is drawn around each detected person 200, 220. The detec- tion box 300, 320 includes the image data of the corresponding detected object 200, 220 of the relevant video frame 41 . Figure 3 to 5 show in detail the tracking according to the present invention. The in- vention follows the tracking-by-detection paradigm and group detections to short tracklets in a first step. Then the tracking task can be formulated to assign IDs (in- ertial measurement unit IDs) to tracklets, such that all tracklets with identical IDs correspond to person trajectories in the video.

In the context of the present invention, the tracking task is solved by incorporating motion information from body-worn inertial measurement units (IMUs). A graph la- beling problem is formulated to find an optimal assignment of IMU IDs to tracklets, such that the resultant trajectories are visually smooth in the video and consistent with measured IMU orientations and accelerations.

The IMU signals are integrated at different conceptual levels: For each potential detection to IMU assignment, we require that the person orientation as seen by the camera is consistent with the corresponding IMU orientation. Orientation con- sistency alone is very ambiguous and hence the invention enforce spatio-temporal consistency if two detections are associated to the same IMU ID. Here, the com- plementary characteristics is employed of short-term detection box motion features and longterm IMU acceleration features. Figure 3 illustrates the graph and shows an exemplary labeling solution.

In order to solve the tracking task, an undirected weighted graph Q = (V,E,C,L) is created, where V is the vertex set comprising all tracklets of the entire sequence and E is the edge set containing all edges that connect a pair of tracklets. Vertices and edges may obtain a label I element of L, where the label set L = {1 ,2,3, ...,P} contains an IMU ID for all P persons wearing an IMU. At this point, the notion of an assignment hypothesis H = (v,l) is introduced, which associates a label I element of L to tracklet v element of V. Associated to each hy- pothesis are assignment costs c'_v element of C and indicator variables x'_v which take value 1 if H is selected, and 0 otherwise. Additionally, for pairs of hypotheses sharing the same label and whose vertices are connected by an edge e element of E, compatibility costs c'e element of C are considered modeling the likelihood that two tracklets belong to the same person.

The tracking task is then to select hypotheses for the entire sequence that mini- mizes the total costs. This can be casted into a binary optimization problem:

where the feasibility set F is subject to

The subset

comprises all tracklets v that contain a detection in frame t. Eq. (2) ensures that each tracklet v is assigned to at most one label and Eq. 3 guaran- tees that a label is not assigned to more than one tracklet at a time. Next, the unary and pairwise potentials are described in detail. Specifically, con- sistency features are introduced which are later mapped to costs c'_v and c'e. In or- der to provide a measure for the likelihood of an assignment hypothesis H = (v, I), the person orientation in each detection box of tracklet v is calculated and com- pared those orientations to the temporally aligned orientation measurements of

IMU I.

The person orientation is defined as the normal vector of the torsos coronal

plane projected to the ground plane as illustrated in Figure 4 (a). The projected normal is used as this comprises less degrees of freedom and people usually move in a rather upright pose.

Hence, given the image data Id of detection d, the invention calculates the heading nd of the person. However, the observed heading in Id depends on the person po- sition in the image, see Figure 4 (b). To see this, consider a person walking on a straight line parallel to the image plane of a non-moving camera. In a global con- text this person has a constant orientation. However, due to perspective effects the perceived orientation of that person with respect to the view point of the cam- era is different at every point in the image. To compensate this, a correction angle derived from the detection box within the image is considered. Let ad be the angle between the vector defined by the camera center and box position pd, and the depth-axis of the camera. In order to compensate the perspective influence, the perceived orientation is rotated by ad and obtain the prediction nd, cmp. Figure 4 (b).

In order to obtain the person heading from image data, an artificial neural network is used to learn the mapping ¾ ^l® *¾ . More specifically, a VGG16 pretrained on ImageNet is used to regress the heading, which also incorporates the aforemen- tioned perspective correction (PC) in the last layer. This network as the Visual Heading Network (VHN) is shown in Figure 5 as a graphical illustration of the net- work architecture. In an example setting of the present invention, IMUs are consistently placed at the back of each person such that the local sensor z-axis corresponds to the normal vector of the torsos coronal plane. The measured torso orientation vector m,t of IMU I at time t is defined:

where

is the local z-axis vector, Ri,t element of SO(3) is the measured IMU orientation mapping the local sensor coordinate frame to the global coordi- nate frame and P projects the normal vector to the ground plane.

Finally, we define the unary orientation feature representing the likelihood of hy- pothesis H as

where F denotes the cosine similarity, Nd corresponds to the number of detections of tracklet v and td represents the time stamp of a detection d.

Further, pairwise features are defined which represent the compatibility of two hy- pothesis FI = (v, I) and FI’ = (n', I). Two hypotheses are said to be compatible, if the assignment of a joint label I to v and v’ is reasonable with respect to spatio-tem- poral aspects.

Box Features. Within a short temporal window a person cannot move arbitrarily fast. Flence, the tracklets of a compatible hypothesis pair should be spatially close and corresponding detection boxes should be similar in size. For each detection box d, a rough 3D position estimate

is calculated by projecting the dete- tion box foot point to the 3D ground plane of the scene. Hence, for detections d of v and d’ of v’ let V3D(d, d’) denote the velocity in 3D from d to d’. Let N(v,v’) be the set of all pairs of detections between H and H’ considered for the feature. The ve- locity feature between H and H’ can be defined as:

Additionally, we compare the detection box heights of both hypotheses. Let hd de- note the height of detection box d in pixels. The compatibility measure D_h(d, d’) is defined based on the heights of detections d and d’ according to

where the factor in front of the fraction compensates for the temporal distance be- tween d and d’:

Finally, a box height feature is defined as

Both, fvei and fheight are features which are meaningful within short temporal win- dows. However, this invention has an object to focus on sequences where people get occluded or fall out of the camera view quiet often and for longer time periods. Hence, in the following we utilize acceleration measurements to link hypothesis which cover larger temporal horizons.

Acceleration Feature. Ideally, the position pt1 element of R³ at time ti of an IMU can be recovered by double integration of the corresponding acceleration signal a according to

where to, pto, and vto denote initial time, initial position and initial velocity, respec- tively. Please note that a in this case represents the gravity-free acceleration in global coordinates.

Let pto be the 3D position of detection d and pti the 3D position of d’. After double integration of the acceleration signal, Eq. (10) can solve for the initial velocity, which it is denoted as ViMu(d, d’). Concurrently we a persons velocity Vd at initial time to is approximate in terms of finite differences of neighboring detections of d. Hence, for a compatible hypothesis pair H and H’ the velocity differences

should be small for all possible detection pairs d element of v and d’ element of v’. The acceleration feature is defined as the set of all such differences according to

The graph labeling problem defined in Eq. (1 ) is a binary quadratic program. This program can reformulate as an equivalent binary linear program (BLP) by introduc- ing slack variables: Each product of variables is replaced by a new variable

zi,n,n’ and the following constraints are added:

Tracklet generation. Reliable tracklets can be generated by grouping detections. Temporally subsequent detections can be connected if their intersection over un- ion is above 0.7. For example, the maximal tracklet length can be set to 15 frames.

Visual Heading Network. The overall network architecture is depicted in Figure 5. It contains the VGG16 architecture, which is truncated after its last pooling layer. The layers FC1 , FC2 and FC3 are fully connected layers with 16, 16, and 2 neu- rons, respectively. To output an orientation vector n that is within the unit sphere S1 , a hyperbolic tangent activation functions is used. The VGG16 is normally trained on ImageNet with an invariance for horizontal flipping. To undo this, the layers FC1 , FC2 and FC3 can be trained together with the last convolutional layer of VGG16, while keeping the weights of all other layers fixed. During training, a dropout layer with p = 0.3 between the fully connected layers are added to avoid overfitting. Finally, the network parameters are learned by minimizing the cost function (5), for given ground-truth detections and corresponding IMU heading vectors of the VIMPT training sequence.

Graph edge settings. In the graph G, weighted edges e element of E are created between two nodes v and v’ in the following cases. If the shortest temporal dis- tance between all detections of v and v’ is at most 12 frames, a short-term edge can be established associated to costs derived from box features. Similarly, long- term edges can be established associated to costs derived from acceleration fea- tures between all detections of v and v’ if the temporal distance is between 12 and 150 frames.

Feature to cost mapping. In order to transform unary and pairwise features to costs, different strategies can be used. For orientation and box features a logistic regression model is learned that predicts optimal costs based on ground-truth tra- jectories in the training sequence of the dataset. This did not work satisfactory for the acceleration feature. We observed that noise in 3D position estimates destroys much of the expressiveness of this feature. Instead, a threshold can use d to indi- cate if two hypothesis are highly incompatible. Hence, a high constant cost can as- sign to an edge if min facc(H,H’) > d.

Reference numbers

10 apparatus/system 12 camera

14 processing device 16 object detection unit

18 processing unit

20 IMU

22 IMU

30 digital memory

41 video frame

100 defined area

200 object/person

220 object/person

300 detection box

320 detection box

Claims

Patent claims

1. Method for tracking of moving objects within a defined area, wherein the method comprising the steps of:

2. Method according to claim 1 , wherein the step of providing said video se- quence includes recording said defined area using a camera system includ- ing at least one camera to generate the video sequence with the temporal length.

3. Method according to claim 1 or 2, wherein inertial data are provided for a plurality of objects, wherein each object is equipped with at least one inertial measurement unit.

4. Method according to one of the preceding claims, wherein a plurality of ob- jects in said video frames of said video sequence are detected, wherein a plurality of tracklets for each detected object is generated.

5. Method according to one of the preceding claims, wherein one of said iner- tial measurement unit is assigned to each generated tracklet.

6. Method according to one of the preceding claims, wherein, with respect to one of said tracklets, an assignment probability is calculated for all inertial measurement units based on the inertial data of the inertial measurement units within the respective tracklet time period, the assignment probability indicating how consistent are the inertial data of an inertial measurement unit and the trajectory data of the respective tracklet, wherein one of said in- ertial measurement units being assigned to the respective tracklet based on the calculated assignment probabilities.

7. Method according to claim 6, wherein one of said inertial measurement unit is assigned to a first tracklet and the same inertial measurement unit is as- signed to a temporal following tracklet based on the calculated assignment probabilities, if the trajectory data of the first tracklet and the trajectory data of the second tracklet are reasonable with respect to spatio-temporal as- pects and/or if the inertial data of said assigned inertial measurement unit to the tracklet time periods of the first and the second tracklet are reasonable with respect to movement aspects.

8. Method according to claim 6 or 7, wherein at least one assignment of one of said inertial measurement units to one of said tracklets is determined based on all assignment probabilities of all inertial measurement units and track- lets and all trajectory data of all tracklets in a global context.

9. Method according to one of the preceding claims, wherein an orientation is detected for at least one of the objects by inputting image data of the object into an artificial neural network which has learned an assignment of the im- age data of the object to an orientation of the object, wherein at least one tracklet for said at least one object being further generated based on said detected orientation of said object.

10. Method according to claim 9, wherein the detected orientation of the object is corrected based on the position of the detected object within the video frames.

1 1 . Method according to one of the preceding claims, wherein for at least one of the objects a trajectory is determined based on the trajectory data of those tracklets to which the inertial measurement unit of the object has been assigned and the inertial data of the inertial measurement unit of the object.

12. Apparatus for tracking of moving objects within a defined area, wherein the apparatus comprising at least one inertial measurement unit, an object de- tector unit and a processing unit, wherein the apparatus is arranged for con- ducting the method according to one of the preceding claims.

13. Apparatus according to claim 12, wherein the apparatus comprising a cam- era system with at least one camera for recording the video sequence.

14. Computer programm arranged to execute the method according to one of the claims 1 to 1 1 , if the computer program is running on a computer.