GB2551239A

GB2551239A - A computer implemented method for tracking an object in a 3D scene

Info

Publication number: GB2551239A
Application number: GB1704261.5A
Authority: GB
Inventors: Terekhov Vladislav; Tusch Michael
Original assignee: Artofus Ireland Ltd
Current assignee: Artofus Ireland Ltd
Priority date: 2016-03-17
Filing date: 2017-03-17
Publication date: 2017-12-13
Also published as: GB201604535D0; WO2017158167A2; WO2017158167A3; GB201704261D0

Abstract

Method for tracking an object in a 3D scene captured by multiple image sensors and inferring the objects 3D position using captured images, wherein: each image sensor has a floor plane equation; the object is tracked by each image sensor in relation to the floor plane defined by that image sensors equation. Tracking may be performed in real-time without calibration. Object may be a human person and detected features may include face, head, shoulders, full figure, eyes, lips, ears, hands. Image sensor coordinates and field of view may be detected. Field of view may be rotated. Object trajectories from plural image sensors may be merged and mapped to three dimensional locations. Features may be used to estimate body proportions using anthropometry look-up tables. Feature metrics may include size, angle, type, colour, temperature, may have 2D parameters and may be modified using correction coefficients in response to thresholds. Distances and depth may be measured. Trajectory or route may be projected to ground plane. Trajectories and characteristics may be interpolated to create points spaced equally in time. Future and past trajectories may be estimated. Track record metadata (eg. speed, acceleration, motion, movement direction) may be generated to analyse and classify object behaviour.

Description

A COMPUTER IMPLEMENTED METHOD FOR TRACKING AN OBJECT IN A 3D SCENE

BACKGROUND OF THE INVENTION 1. Field of the Invention

The field of the invention relates to video and image analysis, and specifically to tracking an object within a scene captured by a plurality of image sensors and inferring a 3-dimensional position of the object within the scene, and to related systems, devices and computer program products. A portion of the disclosure of this patent document contains material, which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. 2. Technical Background

Video based tracking from multiple cameras enables a broad array of applications, such as video surveillance, human computer interaction, computer animation, gesture analysis, crowd analysis and so on. The multiple cameras are often positioned in order to provide an optimum coverage of a scene that is surveyed. However tracking from multiple cameras with different field of views may become challenging when an object appears in one of the camera’s field of view and either reappears in another one or just completely disappears. Additionally, an object may appear differently from one camera to another due to changes for example in the camera location or lighting conditions. It is therefore crucial for the information provided by each of the multiple cameras to be presented in a coherent way, such that camera handoff is achieved smoothly.

The present invention addresses the above vulnerabilities and also other problems not described above. 3. Discussion of Related Art

The prior art of video and image analysis and to object detection and tracking is broad and deep. Reference may be made to US8472746B2, US8565512B2, US6252982 Bl, US8098891B2, US8537200B2, US9225970B2, US8447141B2, US8098891B2, US20080137989A1. US20140072170A1 for example discloses a method for performing video content 'analysis to detect humans or other objects of interest. The detection of humans may be used to count a number of humans, to determine a location of each human and/or perform crowd analyses of monitored areas. The method comprises determining foreground pixels and comparing the foreground pixels to predetermined shapes. US8098891B2 for example discloses a method to perform multi-human three-dimensional (3D) tracking with a plurality of cameras. At each view, a module receives each camera output and provides 2D human detection candidates. A plurality of 2D tracking modules is connected to the CNNs (convolutional neural networks), each 2D tracking module managing 2D tracking independently. A 3D tracking module is connected to the 2D tracking modules to receive promising 2D tracking hypotheses. The 3D tracking module selects trajectories from the 2D tracking modules to generate 3D tracking hypotheses.

Reference may also be made to WO 2016/120634, WO 2017/009649 and WO 2016/193716, the contents of which are hereby incorporated by reference.

SUMMARY OF THE INVENTION

The invention is a computer implemented method for tracking an object in a 3D scene captured by a plurality of image sensors and inferring the object’s 3D position in relation to the 3D scene by using information acquired from image or images captured by the plurality of image sensors, where each image sensor is associated with a floor plane equation, and the object is tracked in relation to the floor plane defined by this equation.

Optional features include any one or more of the following: • the method is performed in real-time without requiring any calibration. • the method further comprises the following steps: detecting the object and/or features of the object from one or more image sensors, calculating a floor plane equation for each image sensor, estimating in relation to the floor plane: coordinates and field of view for each image sensor, estimating a trajectory of the detected object in relation to the floor plane for the one or more image sensors that has detected the object, merging the trajectories and field of views of each image sensor into one combined trajectory and one combined field of view projected on the floor plane that corresponds to the plurality of image sensors, and mapping the combined trajectory into 3D coordinates. • the object and/ or features of the object are detected using an object feature detection algorithm. • the detected object is a human and the features include one or more of the following: face, head, head and shoulders, full figure, eyes, lips, ears and hands. • different detected features are used to estimate body proportions or used to estimate the size of feature of the object that has not been detected. • the method further comprises estimating the sizes of one or more features of a human by using anthropometry tables. • features have one or more of the following metrics: size, angle, type, colour information, temperature and position, and in which correction coefficients are used to correct any of the metrics if the metrics are below or above pre-defined thresholds. • the metrics have two dimensional space (2D) parameters and/or three dimensional space (3D) parameters. • real distances and/or depth information are estimated in relation to each image sensor module. • the method tracks one or more objects as the one or more objects move through the different field of views of each image sensor. • the method is able to infer if an object seen by multiple image sensors, is the same object. • the floor plane equation is calculated by taking into account a detected human feature trajectory, such as the head trajectory, and the size of a detected feature of the human, such as the size of the full figure. • the trajectory of the detected object or features of the detected object is projected on the floor plane. • the trajectories are interpolated to have regular detected points as a function of time. • a ‘future trajectory and ‘past trajectory’ are estimated for each trajectory detected by each image sensor, wherein the ‘future trajectory’ predicts where the object is going to move next and a ‘past trajectory estimates where the object comes from. • the method further comprises comparing a trajectory from an image sensor to a second trajectory from a second image sensor and making a decision of merging both trajectories if they are associated to the same object. • the field of view of one image sensor is rotated before merging it with the field of view of a second image sensor. • each image sensor is able to report the position in 3D of a detected object on the combined field of view of the plurality of image sensors. • a track record is generated to achieve a complex analysis of an object behaviour, wherein the track record holds metadata about the detected object such as speed, acceleration, angle, motion and movement direction. • the method comprises the step of analysing the trajectories as a function of time in order to describe, classify or infer a detected human’s behaviour or intent or needs.

Any one or more of the methods defined above may be implemented using one or more GPUs which provide the computational resources to execute the appropriate steps or algorithms.

Another aspect is a computer vision system that implements any of the methods defined above.

Another aspect is a sensor module that includes an embedded computer-vision engine that implements any of the methods defined above. Such a sensor module can be part of an IoT (Internet of Things) system.

Another aspect is a smart home or office system or other physical or logical environment including one or more computer vision systems that implements the process of any of the methods defined above.

BRIEF DESCRIPTION OF THE FIGURES

Aspects of the invention will now be described, by way of example(s), with reference to the following Figures, which each show features of the invention:

Figure 1 is an image illustrating human elements (an ‘element’ is any sort of detected feature), and elements created by either data analytics or ‘element parameters’ processing.

Figure 2 is a set of images illustrating human, non-human objects and elements created by either data analytics or ‘element parameters’ processing.

Figure 3 is a diagram illustrating the major steps for generating a track record for a detected object.

Figure 4 is a diagram, illustrating a process of predicting human elements or features.

Figure 5 is a diagram schematically illustrating the optical flow (ray diagram) when imaging a human using a pin-hole camera model; the image is projected to a sensor area.

Figure 6 is a diagram illustrating a 2D to 3D conversion procedure.

Figure 7 is a graph showing trajectory detection within a plane.

Figure 8 is a diagram illustrating a camera field of view in the ZY projection and the detection of a trajectory within a plane and estimation of a floor plane.

Figure 9 is a diagram illustrating a process for estimating a floor plane.

Figure 10 is a diagram illustrating a process for estimating a camera position.

Figure 11 is a set of images illustrating an example of a field of view of a camera, the intersection of the field of view with the floor plane, and the area representing the field of view of the camera on the floor plane.

Figure 12 is a set of images illustrating the trajectory of a human on a floor plane and a convex polygon bounding the trajectory.

Figure 13 is a graph illustrating detected points of a human trajectory as a function of time.

Figure 14 is a set of graphs illustrating a trajectory smoothing procedure

Figure 15 is a set of images illustrating detected points of a human trajectory as a function of time, as well as a prediction of a past trajectory and a future trajectory.

Figure 16 is a diagram illustrating an example of trajectory processing.

Figure 17 is a diagram illustrating a track record (1'R) concept that comprises basic and complex events.

Figure 18 is a set of images illustrating a track record processing from multiple sensors.

Figure 19 is a set of images illustrating- the estimation of a field of view from multiple sensors.

Figure 20 is a diagram illustrating an example of the process of estimating a field ot view from two sensors.

DETAILED DESCRIPTION A computer-implemented method is provided for tracking an object within a scene captured by a plurality of image sensors and inferring the 3D position within the scene of the object using information acquired from image or images captured by the plurality of image sensors. The object or features of the object are detected and tracked by one or more of the plurality of image sensors. The trajectories of the object (or features of an object) for each image sensor are estimated and merged together to form one combined trajectory as seen from the plurality of image sensors

The method may have one or more of the following features: • The images captured by the plurality of image sensors are 2D images. • The detected object is represented in relation to its position within the 3D scene. • A floor plane equation is calculated for each image sensor. • The coordinates of each image sensor are estimated in relation to the floor plane. • The field of views of each image sensor projected on the floor plane are estimated. • The field of views are merged together to form one combined field of view for the plurality of image sensors. • Each image sensor may have a different location, lighting condition and field of view. • Sensors may include one or more of the following sensors: sensors operating in the visible spectra, sensors operating in the infra red spectra, thermal sensors, ultra sonic sensors, sensors operating in the non-visible spectra and sensors for acceleration or movement detection. • One or more objects may be tracked as they move through the different field of views. • The method is able to infer if an object seen by multiple image sensors is the same object. • The method is able to infer the past trajectory of an object. • The method is able to predict the future trajectory of an object that has disappeared from one sensor. • The method performs automatic “hand-off’ from one image sensor to another. • The method is performed in real-time without requiring any calibration. • Real distance and/or depth information are estimated in relation to each image sensor module. • The floor plane equation is calculated by taking into account a detected human head trajectory and an estimation of the full body size of the detected human. • Each image sensor module is able to report the position of a detected object on the combined field of view. • The trajectories are interpolated to have regular detected points as a function of time. • A track record is generated to achieve a complex analysis of an object behaviour (angle, speed, acceleration, movement).

Figure 1 shows an image used to assist in a detailed explanation of the key segments of a human body. The data analytics used to analyse the image is an extended block. One purpose of the data analytics is for image analysis and the detection of particular features. These detected features may include the face 102, the head and shoulders 101, the full figure 103, and the hands 104. The face may be referred as Face, head and shoulders as HS, the group formed by the face with head and shoulders as HS-Face, full figure as FF. Non-human features may be referred as NH. Detected features may have one or more of the following metrics: size, angle, type, color information, temperature, and position. These metrics may have 2 dimensional space (2D) parameters and/or 3 dimensional space (3D) parameters. A detected feature may also be referred to as an “element”. “Element parameters” are therefore 2D or 3D parameters that define or relate to the face 102, head and shoulders 101, or any detected features.

The human body has well determined relationships between single and multiple elements. The “Ratio Between Elements” RBE may be determined in accordance with Equation (1) as follows:

(1) where Ek is the value associated with detected element k, En is the value associated with detected element n, Vmn is the minimum ratio value, and Vmax is the maximum ratio value.

Figure 2 shows an image including a non-human object and a human ‘object’. The data analytics may produce detections such as a wheel 203, an object length 201 and an object height 204. Particular objects may have well determined relationships between elements such as, the distance between front and back wheels 202. Additionally, some elements may have well known parameters, such as a wheel diameter 203 or a car width. The ratio between elements may also be determined in accordance with Equation (1). The body proportions of a human 205 may also be estimated by using calculated or known proportions of a non-human object, such as the car width.

Figure 3 shows a flow diagram with an example of the procedure for generating a track record for an object detected by an image sensor. The diagram represents the key steps of the procedure; each block in the diagram is described in detail in the following figures. First, the block 302 detects and analyses human and non-human elements or features. The sizes and distances from the sensor of the detected objects and of the different features of the detected objects are estimated. Some features of the detected object may also be missing or may not be detected. Even though parts or features of the detected object are missing from the captured image, their sizes may still be estimated or predicted from the sizes of the features that the system has been able to detect. From the estimated or predicted sizes and distances, the block 303 is able to map 2D coordinates to pseudo-real 3D coordinates. The block 304 then produces an estimation of the floor plane equation in the form of a*x + b*y + c*z + d = 0. The position of the image sensor is estimated (305) in relation to the previously estimated floor plane equation. The field of view of the image sensor as projected on the floor plane is also estimated (306). The object trajectory is generated in the block 307 which involves a number of steps such as a smoothing procedure, a procedure for the recovery of lost or missed data and procedure for a prediction of extra parameters. Finally, a track record is generated in the block 308 enabling the analysis of the detected object and its behaviour.

Figure 4 shows a flow diagram with an example for detecting and predicting an object or feature of an object as shown in the block 302 of Figure 3. In this example, a human is detected and its features are estimated or predicted. First, data or metadata on the detected elements or features are received in the block 402. It may include data or metadata related to HS, Face, FF, HS-Face and/or NFI data. Based on the received data or metadata, a number of steps may be performed. If FIS data is available 402 and Face data is available 403, then the block 404 is executed in order to predict the size of the Face. If the size of FIS is known, then the size of the Face may be estimated by using available anthropometry tables. The size of the Face can be determined for example in accordance to Equation (2) as follows:

(2) where SFace is the Face size, SHS is the HS size and RBEhs_face is the RBE value.

Prediction of the face size may be required in some cases, such as if the HS detection ratio is stronger in comparison with the face detection ratio. A detection ratio in relation to an object or feature of an object may be defined as the strength value of a reported detection, wherein for example, a low value may indicate that the object or feature of an object was not detected. The block 405 calculates the centre of face to avoid strong visual shift of Face with new size. The calculated Face centre may then be used to calculate a new upper left position of Face. The block 418 processes data similarly as the block 406. The block 431 calculates an upper left position of face by using anthropometry tables.

Hence, when an object or feature of an object has been detected, missing features may be estimated or predicted. As a result, a 2D or 3D avatar corresponding to the detected object or detected feature may be automatically generated from the estimated or predicted features.

In addition, the system may also calculate a rotation angle for the head. Face data or metadata may therefore also contain the calculated rotation angle. In some cases, due to the value of the rotation angle, the system may become unstable and this may lead to detection errors, as well as inaccuracies in estimating distances and/or 3D coordinates. Hence, the block 407 checks if the rotation angle value is less than a first threshold (THR1) or if the rotation angle is higher than a second threshold value (THR2). The thresholds value may be predefined and may depend on a number of parameters, such as the area of the sensor where the human is detected or the position of the image sensor. If the angle is less than THR1 or higher than THR2, then block 409 is executed. A new size of the Face is then calculated by using an angle correction coefficient, as selected by a pre-defmed look-up table or function. The correction coefficient may take into account a number of parameters such as the position of the camera or the location of the face on the sensor area.

In other cases, video analytics may also estimate a Face size with a slightly bigger size than reality 412. This may also be compensated by another correction coefficient by using a pre-defmed look-up-table or function. The block 413 may in this case calculate a new centre for the face to avoid a strong visual shift of the Face. The new calculated Face centre may then be used to calculate a new upper left position of Face. The block 414 calculates a new size of Face by using the correction coefficient, as selected from a look-up-table or function.

Following the blocks 409, 431, or 425, if FF is not present 410 and FIS is present 411, the block 430 is executed. If the size of HS is known, the size of FF may be estimated or predicted by using available anthropometry tables.

If FF is present 410, and a non-human object is also detected 427. The sizes of features of the detected human, such as the size of FF, may be estimated from the size of features of the detected non-human object. As an example, the width of a car may be detected and estimated, and may be correlated to the size of a detected FF 428. If the FF size is known, the sizes of HS and Face may then be estimated by using anthropometry tables 429.

If HS is not detected 402 and Face is also not detected 415, then the size of HS-Face may be used to estimate the Face size 416. The block 417 calculates a Face upper left position by using anthropometry metrics. The blocks 419 to 424 perform similar functions as blocks 412-413, 414, 407- 409. The block 425 calculates the size of HS by using Face size and anthropometry metrics.

Figure 5 shows a projection of a human body onto a sensor area 504 assuming a pinhole camera model (D). The Euclidean space is shown comprising a Z-axis 501, X-axis 502, and Y-axis 503. The ZY projection is shown. This projection is using the height of Face, HS, FF, and NH.

The relationships between the features of the human and its projected features represented by line segments are given in the following equations. The head is represented as a line segment EF, and the projected head onto the sensor area corresponds to a line segment BC. Similarly, the full figure is represented as a line segment EG, and the projected full figure onto the sensor area is a line segment AC. The relationships between line segments AC, AD, and CD can be determined in accordance to Equation (3) as follows:

(3)

The relationships between line segments DE, EF, BC, and CD can be determined in accordance to Equation (4) as follows:

(4)

The relationships between line segments DG, DE, AD, and CD can be determined in accordance to Equation (5) as follows:

(5)

The relationships between line segments EG, DG, AC, and AD can be determined, in accordance to Equation (6) as follows:

(6)

Similar equations are used to determine the relationships between line segments in the XY projection (507 and 508). The following features of the human may be projected: Face, HS, FF, and NH. The 3D coordinates may be given by the different lengths of line segments. The Z, Y and X coordinates correspond to the line segments lengths EzyGzy, DzyGzy, and ExyGxy respectively.

The sizes of the different features of the body are represented by line segments and may be known or estimated from available anthropometry metrics.

Figure 6 shows a flow diagram of the procedure as depicted in the block 303 in Figure 3. The block 602 converts HS, FACE, FF, and NII 2D coordinates into 3D coordinates by using equations (3), (4), (5), and (6). If the equation of the floor plane is ready 603 and the camera position is known 604, then the block 605 is executed. The block 605 is a projection of 3D coordinates to the floor plane or XY plane.

Figure 7 shows the Euclidian space comprising a Z-axis 701, X-axis 702, and Y-axis 703. The set of points represent a trajectory of a detected human as projected on a floor plane 705. The floor may be assumed to be flat.

Figure 8 shows the projection of a trajectory of a detected human as shown in Figure 7 onto the ZY plane. The area between the lines 803 and 804 represents a camera field of view as projected vertically onto the ZY plane. The set of points 805 represent the trajectory of the head of the detected human. The trajectory points are assumed to be all on the same plane 806 corresponding to the plane 705 shown in Figure 7. By assuming that, for a moving human, the distance from the floor to head is always similar, the real floor plane is 808 and its equation may be estimated. The length of the line segment 809 gives the vertical position of the camera. The angle between camera and the floor plane is shown as 810.

Figure 9 shows the process of estimating the floor plane equation as seen in the block 304 of Figure 3. The block 902 checks if the plane equation is ready. The plane equation may be calculated dynamically. The block 903 checks the presence of a 3D point. If a 3D point exists, then the point will be loaded into an array of 3D points. If the size of the array is exceeding a certain threshold 905, then the block 906 is executed. The plane equation may be estimated by using multiple regression models for the array of 3D points. This plane is a plane limiting the object from the top (or head) as is shown in 806. To estimate the real floor plane equation 808, the distance from the floor to head corresponding to the full body size is estimated 807. The Full body size may also be estimated using anthropometry tables.

Figure 10 shows the process of estimating the camera position as shown in block 305 in Figure 3. If the floor plane equation has been estimated 1002 and the camera position has not been estimated 1003, then the block 1004 is executed. The distance between the centre of the camera and the floor plane equation can be determined in accordance to Equation (7) as follows:

(7)

The block 1005 then calculates the angle 810 (Figure 8) between the camera and the floor plane.

Figure 11 shows an example with a camera Field Of View (FOV). The centre A of the FOV is shown 1101. The horizontal FOV is the area bounded by triangle ABD and triangle ACE. The vertical FOV is the area bounded by triangle ABC and triangle ADE. A detected human 1102 is shown standing on a floor plane 1103. The intersection of the triangle ABD with the floor plane is the line LI. The intersection of the triangle ACE with the floor plan is the line L2. The intersection of the triangle ADE with the floor plane is the line L3. The intersection of the triangle ABC with the floor plane is the line LA. The FOV on the floor plane is a polygon F1FGI, and corresponds of the intersection of the FOV of the camera with the floor plane.

Figure 12 shows a set of images with a human standing on a floor plane. The trajectories of the human on the floor plane are shown as 1203 and 1205. The area bounding the different trajectories on the floor plane for both examples are shown as 1202 and 1204. The bounding areas are calculated by searching for the minimum and maximum positions whilst the object is moving during a certain period of time. The size of the bounding area may be small enough to make an assumption about the nature of the object behavior. Different examples of behavior may be one or more of the following: chaotic movements, motion, no motion, small motion, important motion or acceleration, periodic movement. The size of the bounding box may be limited by a pre-defined threshold or by a function. The function may have arguments such as size of object, time period and type of object.

Figure 13 shows a graph with detected positions of a human as a function of time, represented by the axis 1302. The circles 1303 show when the detected human XYZ coordinates have been detected. The detected coordinates may have an irregular position as a function of time. The nature of the irregularity may come from for example, the image sensor, electronic components, software, middleware or firmware etc. However, the coordinates 1303 may be used to recalculate a new set of coordinates as shown by the triangles. The new coordinates may be calculated using a number of interpolation techniques, such as linear interpolation or polynomial interpolation.

Figure 14 shows the acceleration (axis 1401) of a detected object as a function of time (axis 1402). A limit for die acceleration 1405 may be set depending for example on the detected object, its size and/or type. The limit may also be dependent on the time. For example, the maximum acceleration for a human may be set to 1.9 (m/s2). If the acceleration is above the set limit, it may correspond to a false detection by the data analytics, or the position or sizes of the object may not have been estimated correctly. Hence the points located above the acceleration limit, such as the point 1404, may be rejected from the overall object trajectory as shown in a new acceleration trajectory 1408. A new point 1410 has been interpolated thanks to the value of the neighbour points of the rejected point 1409. The object trajectory may also be smoothed.

Figure 15 shows the trajectory 1503 of a detected human in the time domain 1502. The triangles correspond to the known detected points as a function of time, with the first known detected point being B, and the last detected point being C. From the known detected points, it may be possible to also predict points for the time period corresponding to a time earlier than point B (‘Past’), and for the time period corresponding to a time later than point C (‘Future’). Hence the trajectory of the human in the past (between point A and B), and in the future (between point C and D) may be predicted.

The human trajectory 1503 is projected onto an XY plane. A regression or polynomial approximation may be used to estimate the equation for the line segment 1511. The line segment may then be used to estimate trajectory points depicted as circles 1512. The estimation of the trajectory points may be done for example by interpolation in the XY or XYZ domain and or by using acceleration/speed values. The area shown as 1513 represent a tolerance region for which trajectory points 1512 may exist. Similarly, the trajectory of the human in the past 1516 may also be estimated.

Figure 16 shows a flow diagram for processing the object trajectory. The block 1602 estimates the bounding region for the trajectory as shown in Figure 12. If the bounding region size is less than a pre-defined threshold 1603, then the block 1614 is executed. The block 1614 may output a flag “trajectory has no motion” which may be used by the track generator procedure. The block 1604 processes the trajectory in the time domain in order to remove any irregular data point as shown in Figure 13. The block 1605 estimates the trajectory equation as a polynomial or set of polynomials. The block 1606 estimates the speed of the trajectory in the time domain by for example, using the derivative of the trajectory equation. The block 1607 estimates the acceleration of the trajectory7 in the time domain, by for example using the derivative of the trajectory/ speed equation. The block 1608 recalculates the trajectory by using a smoothing procedure and by removing outliers with acceleration values higher than a pre-set limit as described previously in Figure 14. The block 1609 may predict a past trajectory as shown by the segment AB in Figure 15. The block 1610 may calculate a tolerance region for a certain period of time in the past 1517. The block 1611 may predict a future trajectory as shown by the segment CD in Figure 15. The block 1612 may calculate a tolerance region for a certain period of time in the future 1513. The block 1613 may output a flag “trajectory has motion” which may be used by the track generator procedure.

Figure 17 shows an example of a Track Record (TR) 1701 comprising a number of events. A track record for an object is a per-object record of events relating to a detected object, which describes a track activity. The track activity may include parameters such as motion, rotation, area or volume change. A track record may be created in real-time. For example: if the track position has changed, the event “MOTION” may be created with one of the following parameter: motion value in percentage, direction of motion, angle of motion in either polar or spherical coordinate system. If the track area has changed, the event “AREA CHANGED” may be created with one of the following parameters: value in percentage, name of new area. If the track angle has changed, the event “ANGLE CHANGED” may be created and associated with a parameter such as the angle in polar or spherical coordinate system. These examples of events may be included in the basic event layer 1702. Complex events may be created from an analysis of basic events. Complex events may be formed from a sequence of basic events. As an example, a sequence of “AREA CHANGED” events for a detected object may be interpreted as the complex event of the object “APPROACHING” or “LEAVING”. Complex events may be included in the complex event layer 1703. Complex events thus describes, classify or infer a detected object’s behaviour. TR may also include additional information describing one or more of the following: colour, texture, shape.

Figure 18 shows a detected human 1801 captured by a first image sensor and its trajectory projection onto the XY plane 1802 and 1803. The field of view of the floor plan for the first image sensor is shown as 1807. The human trajectory as seen from the first sensor is 1804. The first sensor may have a number of tolerance regions (such as future region 1805 and past region 1804) and a TR data corresponding to the detected human. The field of view on the floor plane for a second image sensor is shown as 1811. The human trajectory as seen from the second sensor is 1808. The tolerance regions from the second sensor are shown as 1809 (past region) and 1810 (future region). The trajectories 1804 and 1808 as seen from the first and second image sensor may be either connected, disconnected or partially intersecting each other for one or more time period. For example 1812 shows a time period in which both trajectories are similar. The amount of time in which the trajectories are similar may vary. In some cases, the tolerance area of the future region of one image sensor may intersect the tolerance area of the past region of another image sensor. The trajectory 1808 and supporting tolerance regions may be altered such that the regions for both trajectories match each other. In order to achieve a specific precision, a number of rotations may be performed. In the case that trajectories from the different image sensors may intersect each other in the same time period, then the TR data of 1804 and 1808 may be compared. If the set of metrics and events are matching, the similarity of the trajectories may be analysed, and both trajectories may therefore be combined into a single trajectory. If both trajectories do not intersect or match each other, but during the same time period the trajectory of the past region of one image sensor intersects or matches the trajectory of the future region of another image sensor, then the trajectories may be compared. The speed, acceleration and size of the intersection may be taken in the account. The comparison may be achieved by using complex functions of cross-related thresholds or look-up-table. If the merging of the trajectories (1804 and 1808) from the two different image sensors achieves a positive result, then the field of view of one image sensor may be altered. As an example, 1811 may have a new position and orientation and may be rotated along with the corresponding trajectory 1808 of the image sensor. Figure 18 shows an example of the procedure for merging the field of view of two image sensors such that automatic hand-off is achieved between one image sensor and another image sensor. However, the procedure can be extended to any number of image sensors.

Figure 19 shows a human 1901, its trajectory 1902 and the floor field of view 1903 as seen from a first sensor. The trajectory 1904 and floor field of view 1905 as seen from a second sensor is also shown. The trajectory 1906 and floor field of view 1907 as seen from a third sensor is also shown. After TR processing is performed, the trajectories 1902, 1904, and 1906 may be unified as a single trajectory. The field of views 1905 and 1907 may be rotated. The resulting field of view is a complex polygon 1908. The resulting field of view can be calculated for 3D space as well by using the same procedures. This procedure may be also extended to any number of image sensors.

Figure 20 shows an example with a flow diagram for merging the trajectories and field of views of two image sensors. The procedure may be extended to any number of image sensors. The track records from both image sensors are processed in the block 2017. It comprises a step for analysing and merging the trajectories. The field of views may also be rotated. The block 2018 performs the procedure of merging the field of views.

Image sensor module. An image sensor may comprise a module implementing the methods as described above. The image sensor module may receive a video stream and analyse the video on a frame-by-frame basis, and may subsequently report the presence of an object along with additional information on the object such as estimated position and distance of the object from the sensor and/or other attributes of the object. The trajectory of the object may also be analysed and reported. The sensor module may not stream video to another device. The sensor module may be a SoC that includes a GPU; the GPU may itself be programmed to implement some or all of the methods described above.

Examples of applications for video based tracking from multiple cameras are wide and may include surveillance systems, gesture analysis or computer animation systems, as well as scene interpretation, such as detecting, tracking and counting humans or other objects in an office, shop, or on a street.

Note

It is to be understood that the above-referenced arrangements are only illustrative of the application for the principles of the present invention. Numerous modifications and alternative arrangements can be devised without departing from the spirit and scope of the present invention. While the present invention has been shown in the drawings and fully described above with particularity and detail in connection with what is presently deemed to be the most practical and preferred example(s) of the invention, it will be apparent to those of ordinary skill in the art that numerous modifications can be made without departing from the principles and concepts of the invention as set forth herein.

Claims

1. A computer implemented method for tracking an object in a 3D scene captured by a plurality of image sensors and inferring the object’s 3D position in relation to the 3D scene by using information acquired from image or images captured by the plurality of image sensors, where each image sensor is associated with a floor plane equation, and the object is tracked in relation to die floor plane defined by this equation.

2. The method of Claim 1, in which the method is performed in real-time without requiring any calibration.

3. The method of any of Claim 1-2, further comprising the following steps: (i) detecting the object and/or features of the object from one or more image sensors, (ii) calculating a floor plane equation for each image sensor, (iii) estimating in relation to the floor plane: coordinates and field of view for each image sensor, (iv) estimating a trajectory of the detected object in relation to the floor plane for the one or more image sensors that has detected the object, (v) merging the trajectories and field of views of each image sensor into one combined trajectory and one combined field of view projected on the floor plane that corresponds to the plurality of image sensors, and (vi) mapping the combined trajectory into 3D coordinates.

4. The method of any preceding Claim in which the object and/ or features of the object are detected using an object feature detection algorithm.

5. The method of any preceding Claim, in which the detected object is a human and the features include one or more of the following: face, head, head and shoulders, full figure, eyes, lips, ears and hands.

6. The method of any preceding Claim, in which different detected features of an object are used to estimate body proportions or used to estimate the size of feature of the object that has not been detected.

7. The method of any preceding Claim, in which the method further comprises estimating the sizes of one or more features of a human by using anthropometry tables.

8. The method of any preceding Claim , in which features of an object have one or more of the following metrics: size, angle, type, colour information, temperature and position, and in which correction coefficients are used to correct any of die metrics if the metrics are below or above pre-defined thresholds.

9. The method of Claim 8, in which the metrics have two dimensional space (2D) parameters and/or three dimensional space (3D) parameters.

10. The method of any preceding Claim, in which real distances and/or depth information are estimated in relation to each image sensor module.

11. The method of any preceding Claim, in which the method tracks one or more objects as the one or more objects move through the different field of views of each image sensor.

12. The method of any preceding Claim, in which the method is able to infer if an object seen by multiple image sensors, is the same object.

13. The method of preceding Claim , in which the floor plane equation is calculated by taking into account a detected human feature trajectory, such as the head trajectory, and the size of a detected feature of the human, such as the size of the full figure.

14. The method of any preceding Claim, in which the trajectory of the detected object or features of the detected object is projected on the floor plane.

15. The method of any preceding Claim, in which trajectories of the detected object or features of the detected object are interpolated to have regular detected points as a function of time.

16. The method of any preceding Claim, in which a ‘future trajectory’ and ‘past trajectory’ are estimated for each trajectory detected by each image sensor, wherein the ‘future trajectory’ predicts where the object is going to move next and a ‘past trajectory estimates where the object comes from.

17. The method of any preceding Claim, further comprising comparing a trajectory from an image sensor to a second trajectory from a second image sensor and making a decision of merging both trajectories if they are associated to the same object.

18. The method of any preceding Claim, in which the field of view of one image sensor is rotated before merging it with the field of view of a second image sensor.

19. The method of any preceding Claim, in which each image sensor is able to report the position in 3D of a detected object on the combined field of view of the plurality of image sensors.

20. The method of any preceding Claim, in which a track record is generated to achieve a complex analysis of an object behaviour, wherein the track record holds metadata about the detected object such as speed, acceleration, angle, motion and movement direction.

21. The method of any preceding Claim, in which the method comprises the step of analysing trajectories of the detected object or features of the detected object as a function of time in order to describe, classify or infer a detected human’s behaviour or intent or needs.

22. The method of any preceding Claim in which one or more GPUs provide the computational resources to execute the algorithms.

23. A computer vision system that implements any of the methods defined above.

24. A sensor module that includes an embedded computer-vision engine that implements any of the methods of Claims 1-22.

25. Smart home or office system or other physical or logical environment including one or more computer vision systems that implements the process of any methods of Claims 1-22.