WO2014154839A1

WO2014154839A1 - High-definition 3d camera device

Info

Publication number: WO2014154839A1
Application number: PCT/EP2014/056225
Authority: WO
Inventors: Ravi Tej TADI; Leandre Bolomey; Davide MANETTI; Sébastien Lasserre
Original assignee: Mindmaze S.A.
Priority date: 2013-03-27
Filing date: 2014-03-27
Publication date: 2014-10-02

Abstract

High definition 3D camera device comprising at least one image and depth sensors unit comprising at least one depth sensor and at least two image sensors for stereoscopic imaging, wherein the 3D camera device further comprises a first object detection unit for recognizing and detecting the position and orientation of at least one target object using a low-resolution depth map of a first resolution, the low-resolution depth map being generated from data obtained from the depth sensor; a tracking control unit for driving the image sensors in order to track a region of interest (ROI) of the target object on each image of the image sensors, the ROI being determined from the position and orientation of the target object given by the first object detection unit; a depth map generation unit for generating a high-resolution depth map of the ROI of a second resolution higher than the first resolution, wherein the high-resolution depth map is generated from data obtained from the image sensors; and a second object detection unit for recognizing and detecting the position and orientation of the target object from the high-resolution depth map. Corresponding method for video tracking objects.

Description

High-definition 3D Camera Device Field of the invention

The present invention relates to a high-definition 3D camera device, and to a related method, that allows tracking objects at near and far distances as well as tracking at the same time large and small objects. The present invention relates in particular to a high-definition 3D camera system and to a related method that allows real-time video tracking on an embedded system like a pad, mobile phone, notebook or any other mobile device, and/or in an automotive environment. Description of related art

Stereoscopy, or 3D imaging, encompasses all techniques to create illusion of depth from two-dimensional pictures. The history of stereoscopy started almost at the same time as that of photography. The basic principle of stereoscopy relies on the fact that the human perception of depth is done in the brain from slightly different plane images perceived by each eye. This principle has been transposed to stereoscopic apparatus comprising for example two synchronized image sensors capturing each a plane image, which are then each exposed exclusively to one eye of the observer, so that the observer can perceive an illusion of depth. As illustrated in Figure 14, the objects appearing in front of the display 16 belong to the negative parallax space whereas the objects perceived behind the display 16 belong to the positive parallax space.

The stereoscopy principle is very efficient. However, the discomfort and user fatigue is probably a major issue for its large scale development. The sources of eyestrain are multiple. High visual disparity, a vertical shift (or vertical parallax), a keystone distortion, or a bad synchronization between the two images are fateful for the user comfort. One crucial aspect to consider while doing the stereoscopy is to prevent any deformation of the picture coming from the compression distortion of the minimum and maximum depth into the minimum and maximum parallax space. This condition is called orthoscopic condition. Two stereoscopic models are traditionally used to capture stereoscopic pairs of images: the toed-in and the parallel model.

According to the toed-in model illustrated in Fig. 15, the image sensors are rotated into the yaw axis to converge their respective optical axis on the target object 13. The convergence angle a removes any horizontal parallax on the target object 13. The target object 13 will thus appear in the stereo window at the level of the screen. However, the convergence angle a introduces a horizontal parallax on any peripheral object. The distortion visible on the peripheral image portions increases with the value of the convergence angle a. Therefore, the convergence angle a in a toed-in model should not exceed a maximal angle value traditionally equal to two degrees.

According to the parallel model, both image sensors look straight ahead and aim at infinity objects. To avoid the stereo window to be at the infinity, a convergence correction is done by cropping off the insides of both images, which reduces the horizontal image size. The parallel image sensor configuration respects the orthoscopic condition by maintaining linearity during conversion from the real space to the stereoscopic images, whereas the toed-in image sensor configuration cannot maintain linearity during the conversions from real space to stereoscopic images and introduces additional vertical disparity. The toed-in model is in return more suitable for macro stereoscopy where the target object is well centered and where the peripheral image portions are less important. Both stereoscopic models require post-processing: with the toed-in model, the vertical disparity must be corrected by adjusting horizontal perspective on each image, and the parallel model must crop and operate a HIT (Horizontal Image Translation) to set the desired stereo window on the display surface. In addition to the post-processing operations, the stereoscopic models require massive amount of parameters to adjust the base distance between image sensors, the focal length of the sensors, and/or their orientation a. Some parameters depend from the chosen hardware while some other crucial parameters may vary with the scene to capture and with the stereoscopic effect that the user wants to apply on the target object. From a scene to be captured, the model necessitates the distance to the nearest and to the furthest object and the distance to the target object. The stereoscopic effect settings, for example null, positive, or negative parallax value, chosen by the user is required to define the position of the virtual screen in the scene. The zoom factor set for example by the user additionally modifies the adjustments for the image sensors.

Furthermore, traditional stereoscopic apparatus let a user appreciate and measure the distances in the captured scene, but doesn't provide any intuitive interface to adjust the stereoscopic effect to the target object and/or to modify the zoom factor on for example a region of interest of the captured scene. They often lack electronic support to determine the useful characteristic of the scene (distances to the nearest object, to the furthest object and/or to the target object) as well as meaningful interface to adjust the stereoscopic effect on the target object.

Video tracking is the process of locating one or more moving objects over time using a camera, for example a 3D camera. It may for example, but not exclusively, be used for human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing, etc. Video tracking may require high computing resources due to the amount of data that is contained in video signals, in particular in high definition and/or 3D video signals, that must be processed in order to detect and follow the objects.

The objective of video tracking is to associate one or more target objects in consecutive video frames. 3D video tracking additionally allows determining depth information of the tracked objects, and thus their movements in all three orthogonal directions. Sequential video frames, for example sequential 3D video frames, are analyzed by an algorithm that outputs the movement of the target objects between the frames. Precise 3D object tracking involves the ability to track an object at near and far distances as well as tracking at the same time large and small objects. A precise 3D object tracking device, or apparatus, thus requires image data of the tracked objects in a sufficiently high resolution for all situations, i.e. for when the object is near or far and for big and small objects. Precise multiple object tracking systems or devices are for example used for body motion analysis. Accordingly, for example a skeleton of a body and its movements are reconstructed by tracking the body joint positions and movements. The motion analysis can also be performed on an object that a body is holding. Conventional high precision skeleton tracking systems have several disadvantages. They require a special preparation by placing markers on the body joints, which will then be used for the joint detection. The markers must always be visible during tracking, thus requiring multiple high-definition cameras to be placed at different positions around the area of interest. Such systems or devices further require a very large processing system in order to process the image data. This kind of system is thus very costly. For marker-less tracking systems, the achievable precision of 3D object tracking depends on the resolution of the corresponding depth map.

Conventional marker-less depth sensors made for consumer application have a low resolution, typically 320x240 pixels and lower. This low resolution applied for example to an entire body allows the detection of the main body joints and/or parts, like the head, shoulders, elbows, hands, knees and feet. Low resolution however forbids the skeleton reconstruction of the whole body, in particular of details such as the finger for example. It doesn't either allow the detection and tracking of relatively small objects held by a person, such as for example a golf club, a tennis racket, etc.

A solution for obtaining a high-resolution depth map is to use a high- resolution stereoscopic camera and a processing system for computing the depth map from a stereoscopic pair of images. The main disadvantage of this method is the required processing power necessary for computing the depth map. A high- resolution depth map cannot be computed in real-time with a conventional processing system. For a real-time processing, the resolution of the depth map must be scaled down to that of a conventional depth sensor, for example 320x240 pixels.

Some high-resolution depth camera devices with a combination of depth sensor and stereoscopic camera have been described in US 201 1/0149031 and in Eun-Kyung Lee; Sung-Yeol Kim; Young-Ki Jung; Yo-Sung Ho, "High- Resolution Depth Map Generation by Applying Stereo Matching Based on Initial Depth Information", 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2008 , vol., no., pp.201 -204, 28-30 May 2008. This kind of devices provides a high-resolution depth map. But even though the processing requirements are lowered, the maximum frame rate is not sufficient for high resolution motion tracking. Such devices and methods are furthermore not applicable to embedded systems like mobile phones, pads, notebooks or others. Brief summary of the invention

An aim of the present invention is thus to provide a high definition 3D camera device and a related precise 3D video tracking method that allow efficient real-time tracking with limited computing resources. Another aim of the present invention is to provide an affordable high definition 3D camera device for efficiently tracking moving objects in three dimensions.

These aims and other advantages are achieved with the high definition 3D camera device and with the video tracking method according to the corresponding independent claims.

These aims and other advantages are achieved in particular by a high definition 3D camera device comprising at least one image and depth sensors unit comprising at least one depth sensor and at least two image sensors for stereoscopic imaging, wherein the 3D camera device further comprises a first object detection unit for recognizing and detecting the position and orientation of at least one target object using a low-resolution depth map of a first resolution, the low-resolution depth map being generated from data obtained from the depth sensor; a tracking control unit for driving the image sensors in order to track a region of interest (ROI) of the target object on each image of the image sensors, the ROI being determined from the position and orientation of the target object given by the first object detection unit; a depth map generation unit for generating a high-resolution depth map of the ROI of a second resolution higher than the first resolution, wherein the high-resolution depth map is generated from data obtained from the image sensors; and a second object detection unit for recognizing and detecting the position and orientation of the target object from the high-resolution depth map. In embodiments, the high definition 3D camera device further comprises an object merger unit for merging position and orientation information of the target object from the first object detection unit and from the second object detection unit, and for producing a high-definition 3D image of the target object from the merged information.

In embodiments, the tracking control unit is configured for driving an image cropping unit by determining multiple regions of interest (ROI) in each image from each image sensors.

In embodiments, the first object detection unit and the second object detection unit are configured for each tracking different objects.

In embodiments, the high definition 3D camera device further comprises a user interface unit for allowing a user to select the at least one target object of the first object detection unit and/or of the second object detection unit.

In embodiments, the high definition 3D camera device further comprises means to translate at least one image sensor along an axis and/or means to pivot at least one image sensor, wherein the linear position and/or the pivoting angle of the at least one image sensor are controlled by the tracking control unit. The linear position and/or pivoting angle of are for example determined to obtain a defined stereoscopic effect and/or a defined zoom effect. The defined stereoscopic effect and/or a defined zoom effect for the at least one target object is set through a user interface unit. The disparity range for the stereoscopic effect for the at least one target object is for example set through a user interface unit.

In embodiments, the high definition 3D camera device comprises a first and a second image and depth sensors units having each at least one depth sensor and at least two image sensors for stereoscopic imaging. the high definition 3D camera device for example comprises means to translate at least one image sensor of the first image and depth sensors unit along a first axis; and at least one image sensor of the second image and depth sensors unit along a second axis. The first axis and the second axis are for example parallel to each other. Alternatively, the first axis and the second axis are oblique or perpendicular to each other.

These aims and other advantages are also achieved in particular by a method for video tracking objects, comprising the steps of recognizing and detecting the position and orientation of at least one target object using a low- resolution depth map of a first resolution; determining a Region of Interest (ROI) of the target object; generating a high-resolution depth map of the ROI of a second resolution higher than said first resolution; and recognizing and detecting the position and orientation of the target object from the high-resolution depth map. In embodiments, the method further comprises the step of merging position and orientation information from the low resolution depth map and from the high resolution depth map.

The low-resolution depth map is for example generated from data obtained from a depth sensor. The high-resolution depth map is for example generated from data obtained by two image sensors.

In embodiments, the method further comprises the step of driving each of image sensor in order to track the region of interest (ROI) on each image of the two image sensors. According to the invention, a 3D tracking control unit tracks objects on a low resolution depth map. For one or more selected objects, the tracking control unit grabs details of said tracked objects on a high-resolution depth map. Thus the tracking control unit selects only the data useful for the desired object tracking. The frame rate is maximized to allow high velocity movement analysis with an embedded system.

The first 3D object detection unit detects one or more predefined objects or user selected objects.

The second 3D object detection unit detects one or more predefined objects, user selected objects or objects defined by the tracking control unit.

The tracking control unit controls the region of interests corresponding to the one or more objects being tracked on the image sensors. Thanks to this control, the amount of data generated by the image sensors is reduced to only the data useful for precise object tracking. Thus the depth map generation unit processes only the data of the region of interest.

The tracking control unit determines the region of interest from the position given by the first 3D object detecting unit or from the position given by the second object detecting unit.

The merger unit collects the positions of the one or more tracked objects from both object detecting units and generates a high-definition 3D representation of the one or more tracked objects.

According to embodiments of the invention, a tracking control unit driving a high-definition 3D camera device is disclosed. The tracking control unit for example allows real-time video tracking of one or more objects on a mobile device. According to one aspect of the invention, the tracking control unit uses a plurality of object detecting units to drive a hybrid 3D camera device comprising at least one depth sensor and a multi-view camera, in order to increase the frame rate and depth map definition. The invention allows small to large object detection as well as near to far object detection. One embodiment describes the tracking control unit used for precise skeleton tracking including the main joint like head, shoulder, elbow, hand and details such as finger.

Advantages of the present invention comprise:

• Ability to simultaneously track far and near objects in the visual field. · Real time tracking and recognition of multiple objects in the scene and extraction of the position and orientation.

• Dynamic User Input: users may determine the objects to be tracked in the visual scene.

• Tracking with high precision of intricate movements (fingers) and larger movements (arms & legs)

Applications of the present invention comprise for example, but not exclusively:

• Motor rehabilitation of upper limbs and whole body movements, for example after a brain injury/trauma/stroke/sports injuries and/or amputation: the device of the invention enables real-time capturing and mapping of their movements to the 3D avatars for therapy. Hybrid 3D imaging platform for mobile computing devices: enables precise dynamic user based object tracking, gesture recognition/inputs and 3D rendering to embed into portable devices, for example mobile phones, laptops, touchpads, etc.

• Real time error detection of limb movements, for example in game and/or sports training, for example golf: real time feedback to the player on the swing trajectory and movement metrics.

Tracking users in an automotive environment to provide real time feedback on body posture and movement including head and eye movements and physiological signatures (ECG).

In embodiments, the tracking control unit controls the region of interest (ROI) by means of an image cropping unit instead of controlling image sensors' ROI. This allows for example controlling multiple ROIs for image sensors that do not provide this functionality. The image cropping unit receives the full image from the image sensor and outputs only the image data corresponding to the selected

ROI.

According to embodiments of the invention, the high definition 3D camera device comprises means to individually translate and pivot each image and/or depth sensor. The sensors can for example be translated along an axis. The position and pivoting angle of each image and/or depth sensor are preferably controlled by the tracking control unit.

In embodiments, these positions and pivoting angles are determined in order to obtain a defined stereoscopic effect that is for example set or defined by the user through a user interface. According to other embodiments of the invention, the device comprises at least two image and depth sensors units, each comprising at least one depth sensor and at least two shutter-synchronized image sensors for stereoscopic imaging. The translation axis of one sensors unit could be superposed, oblique or perpendicular with the translation axis of another sensors unit, forming multiple spatial arrangements. These arrangements enlarge the ranges as well as the angles-of-use of the video tracking up to 360°. Advantageous arrangements are

• two sensors units disposed on parallel axis, wherein the field-of-views of the two sensors units is mutually opposed; · three sensors units whose translation axis form the sides of a virtual triangle;

• four sensors units whose translation axis form the sides of a virtual square;

• six sensors units whose translation axis form the sides of a virtual hexagon.

• or any other number of sensors units whose translation axis form a regular or irregular polygon having a corresponding number of sides, wherein each sensors unit is oriented with its field of view outside or inside the polygon. Brief Description of the Drawings

The present invention will be better understood by reading the following description illustrated by the figures, where: Fig. 1 is a schematic view of a 3D camera device according to an embodiment of the invention;

Fig. 2 is a flow chart illustrating the operation of a tracking control unit according to an embodiment of the invention when tracking a single object;

Fig. 3 is a flow chart illustrating the operation of a tracking control unit according to an embodiment of the invention when tracking multiple objects;

Fig. 4 schematically illustrates the implementation of a 3D tracking control unit according to an embodiment of the invention;

Fig. 5 is a flow chart illustrating the operation of a tracking control unit according to another embodiment of the invention when tracking a single object;

Fig. 6 schematically illustrates a 3D camera device according to an embodiment of the invention with moving image sensors;

Fig. 7 is a flow chart illustrating the operation of a tracking control unit according to another embodiment of the invention with near to far objects;

Fig. 8 shows an example of a 360° 3D scanner according to the invention;

Fig. 9 is a flow chart illustrating the operation of a tracking control unit according to another embodiment of the invention with 360° multiple near to far objects; Fig.10 illustrates a high-definition 3D camera device with two image and depth sensors units placed on parallel translation axis with opposite fields of view;

Fig .1 1 is a schematic view of an image sensor;

Fig.12 is a block diagram of a 3D camera device according to the with a user-centric adjustment unit;

Fig.13 is a schematic view of a tracking scene;

Fig.14 is a schematic view of a display system;

Fig.15 is a schematic view of moving image sensors in a 3D camera device according to the invention;

Fig.16 is a flow chart illustrating the operation of a single stereoscopic image sensor control unit;

Fig.17 is a flow chart illustrating the localization of an object of interest by a stereoscopic image sensor using a distance computation unit;

Fig.18 is a flow chart illustrating the operation of a post-processing unit;

Fig.19 is a schematics illustrating user interaction with a converging control unit and an auto-stereoscopic display;

Fig.20 is a flow chart illustrating the zoom operation; Fig.21 is a flow chart illustrating the user activity using a single stereoscopic image sensor control unit;

Fig.22 is a schematics illustrating the operation of multiple stereoscopic image sensor control units; Fig.23 is a flowchart illustrating the user activity using a multiple stereoscopic image sensor control unit.

Detailed Description of Embodiments of the Invention

An embodiment of a high-definition 3D camera system according to the invention is described hereafter with reference to the accompanying drawings. The tracking control unit drives a 3D camera device comprising one depth sensor and two image sensors. The two image sensors are referred to hereafter as a stereoscopic imager. The depth sensor is at a fixed position between the two images sensors, for example at equal distance d from each image sensor, as depicted in Figure 1 . The depth sensor has a low-resolution comparing to the stereoscopic imager. The tracking control unit uses the data coming from the depth sensor for tracking one or more objects, for example a human body where the tracked objects are some joints and/or parts of the body, for example the head, the neck, the shoulders, the elbows, the hand, the torso, the hip, the knees and/or the feet. These joints and/or parts build the skeleton of the tracked body that can be used for gesture recognition or motion analysis of the whole body and/or for detailed gesture recognition and motion analysis of the fingers.

In order to achieve precise motion capture and analysis and/or high resolution of gesture recognition, additional skeletal details are required, which can not be obtained from the low resolution depth map from the depth sensor. Accordingly, details of the hand are for example obtained by tracking the fingers joints, thus leading to: a) marker free motion capture and mapping of movements onto 3D avatars during movements, for example grasping and reaching movements, made by the user; b) applications such as for example post traumatic motor rehabilitation; c) real time game or sports training applications, for example the tracking and analysis of hand, elbow and/or shoulder movements of a user during a golf swing.

Due to the relatively low-resolution of the depth sensor, the fingers joints cannot be tracked directly on the low resolution depth map, i.e. the resolution of the low resolution depth map computed from the data from the depth sensor allows detecting and tracking relatively large elements, of a body for example, but not smaller elements, for example the fingers.

According to the invention, the tracking control unit uses the data from the higher definition stereoscopic imager to generate a high resolution depth map of the smaller parts, for example the fingers, i.e. of parts to be tracked that can not be tracked on the basis of the low resolution depth map because their size is too small relative to the entire image captured.

In order to minimize the computation time required for the generation of the high resolution depth map, the tracking control unit uses the information of the hand position as tracked on the low resolution depth map in order to determine a region of interest on the stereoscopic imager. Thus the high resolution depth map is computed only for the determined region of interest, for example the hand area. This allows generating a high resolution depth map of the region of interest, for example of the hand, as fast as the generation of the low resolution depth map, for example at 90 frames per second for a standard low-resolution depth sensor. The tracking control unit uses the high resolution depth map data to track the smaller elements, for examples fingers joints of the hand of a body. Finally the tracking control unit merges the computed body and hand skeleton in a single map. The overall block diagram of this tracking control unit is depicted in Figure 2. The detailed skeleton allows video tracking smaller parts of a larger tracked ensemble, for example for precise motion tracking of the hand as well as extended gesture recognition of the entire body.

In embodiments, the tracking control unit is able to track multiple object details through the stereoscopic camera. For example the tracking control unit can track the left and right hands and get the finger details of both hands. This is illustrated for example in Figure 3, where the operations of determining a region of interest and computing a high resolution depth map of this region of interest from the image data received from the stereoscopic imager are repeated for each object detail, or smaller part, to track.

In embodiments, the tracking control unit can also track the same object on the low resolution depth map and on the high resolution data from the stereoscopic imager in order to increase the precision of the position for a particular object. For example, if a user wants a precise detection of a joint angle, the tracking control unit sets the stereoscopic imager and its object detection unit to track this particular joint.

In embodiments, the tracking control unit is for example implemented as depicted in Figure 4. A first input frame buffer receives the low resolution depth map from the low-resolution depth sensor. A second input frame buffer receives the data from the high-resolution stereoscopic imager. Data processing is performed by a CPU, DSP, graphic accelerator or a combination of the three. The control unit has a volatile memory to store temporary data and a non-volatile memory to record data. A digital output is used as a global shutter for the depth sensor and stereoscopic imager to ensure that the depth and image data are taken at the same time. A sensor bus controller is used to configure the setting of the stereoscopic imager, especially the region of interest, and the setting of the depth sensor. An output frame buffer may be used for a display device to display the skeleton in real-time. Finally a user input device may be used to control the tracking control unit.

The operation of the tracking control unit as depicted in Figure 2 and in Figure 3, are detailed hereafter.

A first 3D object detection unit gets the low resolution depth map data and information about the one or more objects to track. From these data, it outputs the positions of the objects. The detailed operation of the tracking control unit depends on the object to track. For example in the case of skeleton tracking, the following steps are performed. Firstly a background subtraction, secondly user segmentation and finally a skeletal joint estimation that can use methods such as learning-based, model free or model based.

A second 3D object detection unit is used as a detailed object detection unit and gets information about the one or more objects to track, as well as the one or more high resolution depth maps from the stereoscopic imager of the one or more regions of interest. The operations of the detailed object detection unit are similar to those performed by the first object detection unit, but for tracking other objects, for example for tracking the finger joints.

The first object detection unit determines the object being tracked on the low-resolution depth map and on the high-resolution depth map. In the example of skeleton tracking the first object detection unit sets which body joint needs to be tracked, for example upper limb or all limbs, and the right or left hand fingers. The objects to track can be determined in advance by parameters stored in the non-volatile memory. The object to track can also be selected by the user through an input unit. For example the tracking control unit can display the full skeleton on a touch screen and the user can select the desired joints to be tracked. A position of object subsystem is a buffer storing the actual and last positions of each object detected on the low resolution depth map with its identifier. This unit also computes the velocity (V_x, V_y, V_z) of the object that needs details from the stereoscopic camera.

A region of interest subsystem gets the xyz position and velocity of the object of interest from the position of object subsystem. Then it derives the xyz object position for the next frame. And it transforms this object position into the region of interest on each image sensor. Then it configures the region of interest on the stereoscopic imager.

A fusion subsystem gets the xyz position from the detailed object detection unit and references these positions to the referential of the object tracker subsystem. Then these object positions are added to the position of the object detected by the object tracker subsystem.

A small image area of interest subsystem is implemented as a variable size buffer for the two cropped images received by the stereoscopic imager. A depth map subsystem takes the two stereoscopic images and performs algorithms of feature matching, pixel matching or others for computing depth data for each tracked object and constructing the depth map.

According to another embodiment illustrated in Figure 3, the tracking control unit for high-definition 3D camera device of the invention tracks several objects. The tracking control unit drives a 3D camera device comprising one depth sensor and two image sensors. The two image sensors are referred to hereafter as a stereoscopic imager. The depth sensor is at a fixed position between the two images sensors, for example at equal distance d from each image sensor, as depicted in Figure 1 . The tracking control unit uses the depth sensor data for tracking multiple objects, for example in a human body where the tracked objects are some joints and/or parts of the body, for example the head, the neck, the shoulders, the elbows, the hand, the torso, the hip, the knees and/or the feet. These joints and/or parts build the skeleton of the tracked body that can be used for motion analysis.

In parallel, the tracking control unit uses a 2D object detection unit on each image from the stereoscopic imager to track an object that can be for example a golf club, tennis racket or any other object that the player's body can hold. Once the object is detected in each image, the 3D position of the object is computed. Finally the controller merges the body skeleton and the object's position. To accelerate the 2D object detection, the tracking control unit uses the position of the body part and/or joint that holds the object to define a region of interest on the stereoscopic imager. The overall block diagram is depicted on Figure 5. This skeleton and object detection is useful for example for sport training with a virtual coach.

According to still another embodiment, the position of the image sensors is controlled by the tracking control unit.

According to this embodiment, the tracking control unit drives a 3D camera device comprising one depth sensor and two mobile image sensors. The depth sensor is at a fixed location and the two image sensors are respectively on the left and on the right of the depth sensor, preferably at equal distance from the depth sensor, as depicted on Figure 6. The two image sensor can move in a linear direction and rotate on their axes. This configuration is for example used to track an object that is closer than the minimum range of the depth sensor or very far and not in the stereoscopic plane of the camera.

When an object to track is very close to the depth sensor, the low resolution depth map of the object is corrupted. The object depth data must therefore be retrieved through the stereoscopic imager. The distance between the two images sensor is then set in order to have a stereoscopic effect at least on the object area in order to construct the depth map. For a very close object, the distance between the image sensors is minimal. On the opposite for a very far object, the distance between the two images sensors is maximal. When an object to track is very far from the depth sensor, the object is not seen by the depth sensor. Then the object data has to be retrieved through the stereoscopic imager. If the object is in the stereoscopic area, the object depth map can be reconstructed from the images of the object taken by the stereoscopic imager. Otherwise, the stereoscopic imager needs to be rotated to increase the area of the stereoscopic effect. When the object to track is very far, its position is only detected on the stereoscopic imager and the region of interest is defined by the tracking control unit, based also on the data from the 3D detail object detection unit. The corresponding overall dataflow is depicted on Figure 7. The control of the position of the image sensors thus allows tracking an object from a near to a far distance.

According to still another embodiment, the high-definition 3D camera device comprises a plurality of depth and image sensors units.

According to this embodiment, the tracking control unit drives a 3D high definition camera device comprising a plurality of depth and image sensor units, wherein each depth and image sensors unit comprises one depth sensor and two image sensor units. The depth sensor and the image sensors of each sensors unit are aligned along an axis that forms a side of a polygon. The depth sensors and the image sensors of the 3D high definition camera device thus form a polygon arrangement where each side of the polygon comprises one depth sensor with one image sensor on each side. An example of this configuration is depicted on Figure 8. This configuration, with the sensors' fields of view oriented towards the outside of the polygon, allows tracking objects on 360° around the sensor arrangement. In order to minimize the amount of data in this configuration, the tracking control unit for example defines which depth sensor and/or pair of image sensors is used and the region of interest on each used image sensor. The overall dataflow is depicted on Figure 9. In the above embodiments, the invention has been illustrated with a depth sensor placed at equal distance to each image sensor. Other configurations are however possible within the scope of the invention. The depth sensor can for example be placed at a distance of one image sensor different from the distance to the other image sensor of the stereoscopic imager. Similarly, in the embodiments where the image sensors can translate on an axis for adapting the distance between them to the scene to be captured, either both or only one of the two image sensors may be made mobile along said axis.

An advantage of the 3D high definition camera device of the invention is that it allows the video tracking of one or more objects on a high-resolution depth map without the necessity to compute the full high resolution depth map, but only a small region of interest of it. The computing requirements are therefore minimized, thereby making precise real-time video tracking possible on mobile devices.

Figure 10 illustrates a high-definition camera device according to an embodiment of the invention, with two image and depth sensors units placed on parallel translation axis. This apparatus comprises a 3D tracking control unit as the one illustrated in Figure 9, an I/O interface, a display and a user centric adjustment control. The user centric adjustment control drives the positions d and the angles a of each image sensor in order to track and recognize target objects according to a defined stereoscopic effect. Target objects of the first and/or the second 3D detection units, stereoscopic effects and/or apparatus parameters can for example be set by the user through the I/O interface. Different stereoscopic technologies may be used within the scope of the invention for allowing the synchronous capture, recording and display of stereoscopic data with convenient user control. In particular, sensor orientation and/or translation operations may be used for adapting the stereoscopic effect to a specific object while integrating the zoom function. Moreover, thanks to the tracking control unit of the invention that provides robust location of at least one target object, the user may for example adjust the stereoscopic effect for the target object through a user interface. The high definition 3D camera device of the invention further allows accurate settings to produce a desired near and far disparity on the stereoscopic image. A device with a plurality of stereoscopic imagers as in Figure 22, for example, further provides the user with the possibility to visualize a recorded scene with different parallax effects depending on the tracked target objects.

A preferred embodiment of a user-centric stereoscopic image device is described hereafter with reference to Figures 1 1 to 22. The scene setup is for example an arrangement as illustrated in Figure

13, where the closest object 12, the furthest object 14 and the target object 15 are positioned at respective distances Nc, Fc and Zc from the high definition 3D camera device. In the present illustrative but in no case limiting example, the camera device 7 comprises two moving image sensors 1 with a synchronized shutter 17 (Figure 15) to capture a stereoscopic image of the part of the scene included into the field of view Θ of the image sensors 1 . As described in the block diagram in Figure 19, the user interface unit 1 1 allows inputs from the user to, for example, but not exclusively: • select a target object on the display 32: the user for example draws a wireframe on the image displayed on the display 32;

• assign a stereoscopic effect on the target object 24: the user for example specifies a negative, positive, or null parallax value;

· set a zoom factor 23 by adjusting the focal length of the image sensor units 1 ;

• establish a disparity range 26 or budget which is the tolerated minimum and maximum parallax on respectively the closest object 12 and the farthest object 14 of the captured scene.

The selected image data corresponding to the target object is then sent to the tracking unit 4 which extracts pixel coordinates for the target object in the next data frame. This operation is completed using for example a Tracking-Learning- Detection algorithm that provides a real-time tracking for one or more user- defined objects. The tracking is done for example on the data frame acquired by the left image sensor. A stereo matching operation 27 is then performed to find the pixel coordinates for the target object on the data frame coming from the right image sensor. The pixel disparity between the two set of coordinates are evaluated in the disparity computation unit 28. The localization computation unit uses then the disparity value as well as the camera focal length f and base distance B to establish the localization of the target object and specially its distance Zo from the apparatus. This process is for example performed by a distance computation unit 5 described by a flow chart in Figure 17.

In order to find the optimal settings for the stereoscopic capture, the device of the invention optionally comprises a depth map unit 8. The depth map unit 8 generates a depth map for the visible scene in order to extract the distance to the closest object Nc and to the furthest object Fc. This information is for example useful in the case of the parallel stereoscopic model in order to apply the user- defined disparity range 26 while respecting the orthoscopic condition. This information may also be used to choose between the toed-in and the parallel stereoscopic model since it recognizes if the scene corresponds to settings for macroscopic stereoscopic capture.

The user inputs, distance computation 5, and depth map 8 are then for example conveyed to a converging control unit 6 as illustrated for example in Figure 16. The converging control unit 6 adjusts the image sensor controllers 7 described in Figure 15 to induce the desired stereoscopic effect 24 on the target object 15 in regard with the specified disparity range 26.

In macroscopic stereoscopic capture, the stereoscopic model adopted will preferably be the toed-in model. In this model, with a being the orientation of the image sensors, B being the base distance between the image sensors, and Zc being the distance from the image sensors to the convergence plane with null parallax. Then the toed-in model gives the following relationship:

β

tan a =— Equation 1

Zc

As explained in the background section of the present specification, the a angle is limited in order to avoid extra horizontal parallax on the peripheral side of the image. Therefore, if a exceeds the a_max value, which is typically equal to two degrees, the base distance B of the camera has to be reduced until the condition is satisfied.

if a > cc_max then B = tan a_max Zc Equation 2

Following this condition expressed in Equation 2, a minimal distance Zc for example of 283 millimeters is required for doing macroscopic stereoscopic capture with a moving stereoscopic image sensor characterized by a minimal base distance B of 10 millimeters.

Once the base distance B and the convergence angle a are defined, the information is transferred to a translation controller 19 and to an orientation controller 20.

For non-macroscopic stereoscopic capture, the parallel stereoscopy model is preferred. The converging control unit then uses specified stereoscopic effects 24 and disparity range 26 as well as the computed distance Nc to the nearest object, the distance Fc to the furthest object, and the distance Zo to the target object in order to compute the base distance B between the image sensors. This operation is for example done using the following equations, and with reference to Figure 14:

Z = viewing distance

N = near depth F = far depth

Then the disparity range specified by the user is characterized by the following equations:

Ν · Ε

Equation 3 d^N = z→

F E

dF = —p Equation 4

R = ^ Equation 5

where dN = near disparity as specified by the user dF = far disparity as specified by the user

R = disparity ratio E = user eyes separation (constant value)

The near disparity dN and the far disparity dF are user-specified values while the screen distance is a default value for a camera device display. Therefore, using user-specified disparity values and default screen distance, the converging control unit can compute the near depth value N and the far depth value F for the display system.

The computed parameters for the scene to be captured and the image sensor characteristics provide complementary information to the converging control unit to compute the base distance B between the image sensors.

With reference to Figure 13: Nc = nearest object distance as deduced from the depth map unit 8;

Fc = furthest object distance as deduced from the depth map unit 8;

Zc = virtual screen distance as deduced from the specified stereoscopic effects 24 and object distance Zo;

Wc = virtual screen width.

The image sensor drawn in Figure 10 may be described with: f = lens focal length as deduced from the specified zoom factor 23 w = sensor width (constant) θ' = half field of view

The base distance B between the image sensors 1 can then be computed using the following equations:

Equation 6

R + l

Zc = Equation 7

Wc = 2 - Zc- tan 0 Equation 8 2 - Zc- tan 0 ' - dN- Nc

B = Equation 9

W- (Zc - Nc) + dN- Nc

The base distance B between the image sensors 1 is thus given by Equation 9. The converging control unit is ready to transmit the value to the translation controller 19.

The converging control unit adjusts the convergences of the image sensors 1 by setting the translation and the orientation following the two stereoscopic models. However both stereoscopic models require postprocessing operation 9 as described in Figure 18: the toed-in model must correct the vertical disparity 29 by adjusting horizontal perspective on each image, the parallel model must crop 30 and operate a HIT, Horizontal Image Translation, 32 to set the desired stereo window on the display surface 16.

In the case of the parallel stereoscopic model, the amount of pixel to crop can be computed using the width Wc of the virtual screen 15 computed in Equation 8 and the base distance B computed in Equation 9.

Then the cropping percentage C, and the cropping pixels can be defined as following using X as the initial image resolution:

B

C = Equation 10

Wc + B

Cx = C X Equation 11 Once capture operation 7 and post-processing operation 9 are performed, the generated media are stored in a non-volatile memory 10, such that an auto stereoscopic display can provide the user with synchronous and asynchronous stereoscopic visualization of the captured scene. As described in the block diagram in Figure 19, the user decision is crucial in the current apparatus. Using markers for the minimum and maximum stereoscopic effect, the depth and disparity map visualization, and the real-time stereoscopic visualization as synchronous visual feed-back the user can accurately and safely adjust the zoom factor, the disparity range, and the stereoscopic effect on the target object. Figure 20 is a flow chart of the zoom operation, whereas Figure 21 is a flow diagram of the user activity within the current embodiment.

According to another embodiment, the camera device comprises a plurality of stereoscopic imagers as illustrated for example in Figure 22. The methods for the control of image sensor 1 allow for example the development of multi-rig stereoscopic image sensors where different slices of scene depth are captured using different inter-axial settings. The images of the slices can then be combined together to form the final stereoscopic image pair. This allows important regions of a scene to be given better stereoscopic representation while less important regions are assigned less of the depth budget. It provides user with a way to manage composition within the limited depth budget of each individual display technology.

Figure 23 shows different choices that a user may have when using a camera device according to the present embodiment: either he can choose to record on target object with different stereoscopic effects or he can choose to select multiple target objects and to affect one specific stereoscopic effect on each of the selected objects.

Claims

1 . High-definition 3D camera device comprising at least one image and depth sensors unit, said at least one image and depth sensors unit comprising at least one depth sensor and at least two image sensors for stereoscopic imaging, wherein said 3D camera device further comprises:

- a first object detection unit for recognizing and detecting the position and orientation of at least one target object using a low-resolution depth map of a first resolution, wherein said low-resolution depth map is generated from data obtained from said at least one depth sensor;

- a tracking control unit for driving each of said at least two image sensors in order to track a region of interest (ROI) of the at least one target object on each image of each of the at least two image sensors, said ROI being determined from the position and orientation of said at least one target object given by said first object detection unit;

- a depth map generation unit for generating a high-resolution depth map of the ROI of a second resolution higher than said first resolution, wherein said high-resolution depth map is generated from data obtained from said at least two image sensors; and

- a second object detection unit for recognizing and detecting the position and orientation of said at least one target object from said high- resolution depth map.

2. High-definition 3D camera device according to claim 1 , wherein said device further comprises an object merger unit for merging position and orientation information of said at least one target object from said first object detection unit and from said second object detection unit, and for producing a high-definition 3D image of the target object from said merged information.

3. High-definition 3D camera device according to claim 1 or 2, wherein said tracking control unit is configured for driving an image cropping unit by determining multiple regions of interest (ROI) of each image from each of said least two image sensors.

4. High-definition 3D camera device according to any one of claims 1 to 3, wherein said first object detection unit and said second object detection unit are configured for each tracking different objects.

5. High-definition 3D camera device according to any one of claims 1 to 4, further comprising a user interface unit for allowing a user to select the at least one target object of said first object detection unit and/or of said second object detection unit.

6. High-definition 3D camera device according to any one of claims

1 to 5, further comprising means to translate at least one of the at least two image sensors along an axis and/or means to pivot at least one of said at least two image sensors, wherein the linear position and/or the pivoting angle of said at least one of said at least two image sensors are controlled by the tracking control unit.

7. High-definition 3D camera device according to claim 6, wherein said linear position and/or pivoting angle of said at least one of the at least two image sensors are determined to obtain a defined stereoscopic effect and/or a defined zoom effect.

8. High-definition 3D camera device according to claim 7, wherein said stereoscopic effect and/or said defined zoom effect for the at least one target object is set through a user interface unit.

9. High-definition 3D camera device according to claim 7 or 8, wherein the disparity range for said stereoscopic effect for the at least one target object is set through a user interface unit.

10. High-definition 3D camera device according to any one of claims 1 to 9, wherein said device comprises a first and a second image and depth sensors units having each at least one depth sensor and at least two image sensors for stereoscopic imaging.

1 1 . High-definition 3D camera device according to claim 10, comprising means to translate:

- at least one of said at least two image sensors of the first image and depth sensors unit along a first axis;

- at least one of said at least two image sensors of the second image and depth sensors unit along a second axis.

12. High-definition 3D camera device according to claim 1 1 , wherein said first axis and said second axis are parallel to each other.

13. High-definition 3D camera device according to claim 1 1 , wherein said first axis and said second axis are oblique or perpendicular to each other.

14. Method for video tracking objects, said method comprising the steps of:

- recognizing and detecting the position and orientation of at least one target object using a low-resolution depth map of a first resolution;

- determining a Region of Interest (ROI) of the at least one target object;

- generating a high-resolution depth map of the ROI of a second resolution higher than said first resolution; and

- recognizing and detecting the position and orientation of said at least one target object from said high-resolution depth map.

15. Method according to claim 14 further comprising the step of merging position and orientation data from said low resolution depth map and from said high resolution depth map.

16. Method according to claim 14 or 15, wherein said low-resolution depth map is generated from data obtained from at least one depth sensor.

17. Method according to any one of claims 14 to 16, wherein said high-resolution depth map is generated from data obtained by at least two image sensors.

18. Method according to claim 17, comprising the step of driving each of said at least two image sensors in order to track said region of interest (ROI) on each image of each of the at least two image sensors.