GB2573170A

GB2573170A - 3D Skeleton reconstruction from images using matching 2D skeletons

Info

Publication number: GB2573170A
Application number: GB1806949.2A
Authority: GB
Inventors: Le Floch Hervé
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2019-10-30
Anticipated expiration: 2038-04-27
Also published as: GB201806949D0; GB2573170B

Abstract

Generating 3D skeletons from 2D images of real-world objects, comprising: determining (a set of one or more) 2D skeletons of the objects in each image of a set of images captured simultaneously by cameras; for pairs of simultaneously captured images, finding one 2D skeleton from the set of 2D skeletons in one of the images that matches with just one of the 2D skeletons in the other image; generating one 3D skeleton from the matched 2D skeletons. Object pose estimation may allow rendering alternative views, from new viewpoints, of animated, articulated objects such as humans, animals, mammals, robots. Images may be scaled, sampled, cropped, and samples clustered, to identify object parts (eg. limbs, joints, head, hands, feet, shoulders, knees, elbows, pelvis, arm, thigh, trunk) and associated probabilities. Distance between corresponding parts of matched 2D skeletons may be determined – by projecting 2D skeleton parts as epipolar lines – and thresholded. Weighted links between nodes (corresponding to 2D skeletons) of a graph may be based on skeleton distances. Weak 3D skeletons may be generated by projecting 2D skeleton parts as lines in 3D space to determine 3D positions of those parts. 3D part locations may be finalised using a random sample consensus algorithm.

Description

3D SKELETON RECONSTRUCTION FROM IMAGES USING MATCHING 2D SKELETONS

FIELD OF THE INVENTION

The present invention relates generally to reconstruction of 3D skeletons from views of one or more 3D real world objects. Improved 2D or 3D images of the 3D real world objects can be generated from the reconstructed 3D skeletons.

BACKGROUND OF THE INVENTION

Reconstruction of 3D skeletons, also known as 3D object pose estimation, is widely used in image-based rendering. Various applications for 3D object pose estimation and virtual rendering can be contemplated, including providing alternative views of the same animated 3D object or objects from virtual cameras, for instance a new and more immersive view of a sport event with players.

Various attempts to provide methods and devices for 3D skeleton reconstruction have been made, including US 8,830,236 and publication “3D Human Pose Estimation via Deep Learning from 2D annotations” (2016 fourth International Conference on 3D Vision (3DV), Ernesto Brau, Hao Jiang). However, the efficiency of the techniques described in these documents remains insufficient in terms of performances, including memory use, processing time (for instance nearly real time such as less than a few seconds before rendering), ability to detect a maximum number of 3D real world objects in the scene.

SUMMARY OF INVENTION

New methods and devices to reconstruct 3D skeletons from source images of the same scene are proposed. A method for generating a 3D skeleton of one or more 3D real world objects observed by cameras according to the invention is defined in Claim 1. It comprises the following steps performed by a computer system: determining a set of one or more 2D skeletons of the 3D real world object or objects in each of (two or more) simultaneous images of the 3D real world objects recorded by the cameras; for one or more pairs (preferably each pair) of the simultaneous images (i.e. of the corresponding sets of 2D skeletons), matching each of one or more 2D skeletons of one of the two corresponding sets with at most one respective skeleton of the other set. Thus either a 2D skeleton is matched with another one from the other set, or it is matched with none; and generating one 3D skeleton from the pairs of matched 2D skeletons.

An idea of the present invention lies on detecting when 2D skeletons of two source images match one the other. Noncomplex triangulation may then be used to obtain 3D skeletons. The amount of 3D data to be processed is thus drastically reduced.

On overall, the process of the present invention shows reduced processing complexity.

Various applications of the invention may be contemplated, including a method for displaying a 3D skeleton of one or more 3D real world objects observed by cameras as defined in Claim 17. It comprises the following steps performed by a computer system: generating a 3D skeleton of a 3D real world object using the generating method above, selecting a viewpoint in 3D space, and displaying, on a display screen, the generated 3D skeleton or a 3D object/character obtained from the generated 3D skeleton from the viewpoint.

In this context, the invention improves the field of rendering a scene from a new viewpoint which may be seen as a new “virtual camera”.

More generally, the 3D skeleton generation may be applied to 2D or 3D image generation, therefore providing improved contribution to the technical field of image processing producing an improved image.

Correspondingly, a system, which may be a single device, for generating a 3D skeleton of one or more 3D real world objects observed by cameras according to the invention is defined in Claim 19. It comprises at least one microprocessor configured for carrying out the steps of: determining a set of one or more 2D skeletons of the 3D real world object or objects in each of simultaneous images of the 3D real world objects recorded by the cameras; for one or more pairs of the simultaneous images, matching each of one or more 2D skeletons of one of the two corresponding sets with at most one respective skeleton of the other set; and generating one 3D skeleton from the pairs of matched 2D skeletons.

Also, a system for displaying a 3D skeleton of one or more 3D real world objects observed by cameras may be as defined in Claim 20. It comprises the above system to generate a 3D skeleton of the 3D real world object connected to a display screen, wherein the microprocessor is further configured for carrying out the steps of: selecting a viewpoint in 3D space, and displaying, on the display screen, the generated 3D skeleton from the viewpoint.

Optional features of the invention are defined in the appended claims. Some of these features are explained here below with reference to a method, while they can be transposed into system features dedicated to any system according to the invention.

In embodiments, matching the 2D skeletons of two images includes: determining a skeleton distance between the 2D skeletons and matching the 2D skeletons together depending on the skeleton distance.

This approach may use a graph to obtain one or more one-to-one associations between a 2D skeleton determined from a first image and a 2D skeleton determined from the second image, wherein nodes of the graph correspond to the 2D skeletons of the two sets and weighted links between nodes are set based on the determined distances between the corresponding 2D skeletons.

In other embodiments, generating one 3D skeleton from the pairs of matched 2D skeletons includes: generating a weak 3D skeleton from each pair of matched 2D skeletons; and determining one or more 3D skeletons from the generated weak 3D skeletons.

The various pairs of matching 2D skeletons are thus used to produce plenty of (intermediate or “weak”) 3D skeletons. As being built from two 2D skeletons, the robustness of each intermediate 3D skeleton may appear weak. However, spatially close instances of the weak 3D skeletons in 3D space make it possible to robustly determine final 3D skeletons representing the 3D real world objects.

It results that reconstruction of a 3D skeleton is enhanced. Furthermore, the present invention improves detection of various 3D skeletons.

In some embodiments, the 2D-to-3D conversions of the pairs of matched 2D skeletons may involve triangulation, meaning generating a weak 3D skeleton from a pair of matched 2D skeletons includes: projecting a part of a first 2D skeleton of the pair as a first line in a 3D space; projecting the same part of the second 2D skeleton of the pair as a second line in the 3D space; and determining a 3D position locating the part for the weak 3D skeleton, based on the first and second lines.

In some embodiments, determining one or more 3D skeletons from the generated weak 3D skeletons includes converting 3D positions of the weak 3D skeletons locating the same part of the weak 3D skeletons into a unique 3D position for the part. Where there are numerous instances of the same part, a final and robust part instance may be obtained. This may be repeated for each part forming a 3D skeleton.

At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:

Figure 1 is a general overview of a system 10 implementing embodiments of the invention;

Figure 2 illustrates an exemplary 3D model of a 3D real world object, based on which a 3D skeleton of the 3D object can be built;

Figure 3 is a schematic block diagram of a computing device for implementation of one or more embodiments of the invention.

Figure 4 illustrates, using a flowchart, embodiments of a method for generating a 3D skeleton of a 3D real world object observed by source cameras according to the present invention;

Figure 5 schematically illustrates an exemplary splitting of a cuboid into elementary cubes V(X,Y,Z);

Figure 6 schematically illustrates a way to compute a part distance between the same parts of two 2D skeletons according to embodiment of the present invention;

Figure 7 illustrates, using a flowchart, steps for computing a skeleton distance between two 2D skeletons;

Figure 8 schematically illustrates a triangulation way to build a weak 3D skeleton from a matching pair of two matched 2D skeletons according to embodiment of the present invention;

Figure 9 illustrates, using a flowchart, steps for converting weak 3D skeletons into a robust 3D skeleton according to embodiments of the present invention;

Figure 10 illustrates bundles of weak 3D skeletons obtained when applying the process of Figure 9 to generate robust 3D skeletons; and

Figure 11 illustrates, using a flowchart, a process for displaying a 3D skeleton of a 3D real world object observed by source cameras according to embodiments of the invention.

DETAILLED DESCRIPTION OF EMBODIMENTS

Figure 1 is a general overview of a system 10 implementing embodiments of the invention. The system 10 comprises a three-dimensional (3D) real world object 11 of a scene captured by two or more source camera/sensor units 12.

The 3D real world object 11 may be of various types, including beings, animals, mammals, human beings, articulated objects (e.g. robots), still objects, and so on. The scene captured may also include a plurality of 3D objects that may move overtime.

Although two main camera units 12a, 12b are shown in the Figure, there may be more of them, for instance about 7-10 camera units, up to about 30-50 camera units in a stadium.

The source camera units 12 generate synchronized videos made of 2D source images 13 (i.e. views from their viewpoints) of the scene at substantially the same time instant, i.e. simultaneous source images. Each source camera/sensor unit 12 (12a, 12b) comprises a passive sensor (e.g. an RGB camera).

The 3D positions and orientations of the source cameras 12 within a reference 3D coordinates system SYS are known. They are named the extrinsic parameters of the source cameras.

Also the geometrical model of the source cameras 12, including the focal length of each source camera and the orthogonal projecting position of the center of projection in the image 13 are known in the camera coordinates system. They are named the intrinsic parameters of the source cameras. This camera model is described with intrinsic parameters as a pinhole model in this description but any different model could be used without changing the means of the invention. Preferably, the source cameras 12 are calibrated so that they output their source images of the scene at the same cadence and simultaneously. The intrinsic and extrinsic parameters of the cameras are supposed to be known or calculated by using well-known calibration procedures.

In particular, these calibration procedures allow the 3D object to be reconstructed into a 3D skeleton at the real scale.

The source images 13 feed a processing or computer system 14 according to the invention.

The computer system 14 may be embedded in one of the source camera 12 or be a separate processing unit. Any communication technique (including Wifi, Ethernet, 3G, 4G, 5G mobile phone networks, and so on) can be used to transmit the source images 13 from the source cameras 12 to the computer system 14.

An output of the computer system 14 is a 3D skeleton for at least one 3D object of the scene in order to generate a 2D or 3D image of preferably the scene. A virtual image 13v built with the 3D skeleton generated and showing the same scene with the 3D object or objects from a viewpoint of a virtual camera 12v may be rendered on a connected display screen 15. Alternatively, data encoding the 3D skeleton generated may be sent to a distant system (not shown) for storage and display, using for instance any communication technique. Stored 3D skeletons may also be used in human motion analysis for video monitoring purposes for instance.

Figure 2 illustrates an exemplary 3D model 20 of a 3D real world object, based on which a 3D skeleton of the 3D object may be built according to the teachings of the present invention. In the example of the Figure, the 3D object is an articulated 3D real world object of human being type. Variants may regard still objects.

The 3D model comprises N distinct parts 21 and N-1 connecting elements or links 22. The parts 21 represent modeled portions of the 3D real world object, for instance joints (shoulders, knees, elbows, pelvis, ...) or end portion (head, hands, feet) of a human being. Each part 21 is defined as a 3D point (or position) in the 3D coordinates system SYS. The 3D point or position may be approximated to a voxel in case SYS is discretized. The connecting elements 22 are portions connecting the parts 21, for instance limbs such as forearm, arm, thigh, trunk and so on. Each connecting element 22 can be represented as a straight line between the two connected parts, also named “adjacent parts”, through 3D space.

To robustly generate the 3D skeleton or skeletons of the scene volume in 3D space, an idea of the present invention consists in determining correspondences or matching between pairs of 2D skeletons detected in the source images and using these correspondences to generate one or more 3D skeletons. Preferably, the pairs of matching 2D skeletons are projected into corresponding intermediate or “weak” 3D skeletons in 3D space. The multiplicity of spatially-close intermediate 3D skeletons is a robust indication that a final 3D skeleton exists in this sub-volume.

This approach advantageously reduces complexity of the 3D skeleton reconstruction as only pairs of 2D skeletons are processed, Thanks to the robustness provided through the multiplicity of intermediate 3D skeletons, it also improves isolation of 3D objects within a scene volume comprising plenty of them. It turns that real time reconstructions of 3D skeletons (and thus displays or human motion analysis for instance) are better achieved. Real time reconstructions for “live” TV or broadcast purposes may include few seconds delay, e.g. less than 10 seconds, preferably at most 4 or 5 seconds.

The inventors have noticed that the proposed approach efficiently works on complex scenes (like sport events with multiple players in a stadium), with an ability to detect a wide number of interoperating 3D objects (multiple human players).

To that end, two or more simultaneous source images 13 of the 3D objects recorded by the source cameras 12 may be obtained, from memory of the computer system for instance.

In case a volume V of the captured scene is delimited, its position and orientation are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known). A set of one or more 2D skeletons of the 3D real world object or objects in each of (two or more) simultaneous source images recorded by the source cameras can be determined. Known techniques to detect 2D skeletons corresponding to a known model can be used as described below. Additional techniques, such as scaling and possibly cropping, may improve detection of 2D skeletons in the images, while allowing clustering them for independent processing.

Next, one or more pairs, and preferably each pair, of the simultaneous source images or corresponding sets of 2D skeletons, are successively considered to determine matching between 2D skeletons. Each of one or more 2D skeletons of one of the two corresponding sets (preferably each 2D skeleton of the set) is matched with at most one respective skeleton of the other set if at all possible. It means that either a 2D skeleton is matched with another one from the other set, or it is matched with none of them, depending on criteria applied.

Each pair of matching 2D skeletons from different views (source images) of the same scene volume can then be processed using triangulation in order to build an intermediate 3D skeleton, the robustness of which is quite low or weak. An intermediate or “weak” 3D skeleton can thus be generated, in 3D space, from each pair of matched 2D skeletons.

All the generated intermediate 3D skeletons can then be used to determine one or more final 3D skeletons, for instance based on spatial criteria to convert e.g. plenty of spatially-close intermediary 3D skeletons into one robust final 3D skeleton for display. More generally, one (or more) 3D skeleton is generated from the pairs of matched 2D skeletons.

The generated 3D skeleton may be used to generate a 2D or 3D image. The present invention thus provides improved contribution to the technical field of image processing producing an improved image.

As mentioned above, an exemplary application for the present invention may relate to the display of a virtual image 13v showing the same scene from a new viewpoint, namely a virtual camera 12v. To that end, the invention also provides a method for displaying a 3D skeleton of one or more 3D real world objects observed by source cameras. This method includes generating at least one 3D skeleton of a 3D real world object using the generating method described above.

Next, this application consists in selecting a virtual camera and displaying the generated 3D skeleton from the virtual camera on a display screen. In practice, several generated 3D skeletons are displayed simultaneously on the display, for instance when displaying a sport event. A simple 3D object as shown in Figure 2 can be used to display the generated 3D skeleton. This is useful to display animations that require low rendering costs. More promising applications can also provide an envelope to the 3D skeleton with a texture, either predefined or determined from pixel values acquired by the source cameras (for better rendering). This is for example to accurately render shot or filmed sportsmen as they actually look like in the scene volume.

Selecting a virtual camera may merely consist in defining the extrinsic and intrinsic parameters of a camera, thereby defining the view point (i.e. distance and direction from the scene volume) and the zoom (i.e. focal) provided by the virtual image.

Generating the 3D skeletons and displaying/rendering them on the display screen 15 may be performed for successive source images 13 acquired by the source cameras 12. Of course the displaying operation is made following the timing of acquiring the source images. It turns that 3D-skeleton-based animations of the captured scene can be efficiently produced and displayed.

Other applications based on the generated 3D skeleton or skeletons may be contemplated. For instance, video monitoring for surveillance purposes of areas, such as the street or a storehouse, may perform detection of 3D skeletons in captured surveillance images and then analyses the moving of these 3D skeletons to trigger an alarm or not.

Figure 3 schematically illustrates a device 300 used for the present invention, for instance the above-mentioned computer system 14. It is preferably a device such as a microcomputer, a workstation or a light portable device. The device 300 comprises a communication bus 313 to which there are preferably connected: - a central processing unit 311, such as a microprocessor, denoted CPU; - a read only memory 307, denoted ROM, for storing computer programs for implementing the invention; - a random access memory 312, denoted RAM, for storing the executable code of methods according to the invention as well as the registers adapted to record variables and parameters necessary for implementing methods according to the invention; and - at least one communication interface 302 connected to a communication network 301 over which data may be transmitted.

Optionally, the device 300 may also include the following components: - a data storage means 304 such as a hard disk, for storing computer programs for implementing methods according to one or more embodiments of the invention; - a disk drive 305 for a disk 306, the disk drive being adapted to read data from the disk 306 or to write data onto said disk; - a screen 309 for displaying data and/or serving as a graphical interface with the user, by means of a keyboard 310 or any other pointing means.

The device 300 may be connected to various peripherals, such as for example source cameras 12, each being connected to an input/output card (not shown) so as to supply data to the device 300.

Preferably the communication bus provides communication and interoperability between the various elements included in the device 300 or connected to it. The representation of the bus is not limiting and in particular the central processing unit is operable to communicate instructions to any element of the device 300 directly or by means of another element of the device 300.

The disk 306 may optionally be replaced by any information medium such as for example a compact disk (CD-ROM), rewritable or not, a ZIP disk, a USB key or a memory card and, in general terms, by an information storage means that can be read by a microcomputer or by a microprocessor, integrated or not into the apparatus, possibly removable and adapted to store one or more programs whose execution enables a method according to the invention to be implemented.

The executable code may optionally be stored either in read only memory 307, on the hard disk 304 or on a removable digital medium such as for example a disk 306 as described previously. According to an optional variant, the executable code of the programs can be received by means of the communication network 301, via the interface 302, in order to be stored in one of the storage means of the device 300, such as the hard disk 304, before being executed.

The central processing unit 311 is preferably adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to the invention, which instructions are stored in one of the aforementioned storage means. On powering up, the program or programs that are stored in a non-volatile memory, for example on the hard disk 304 or in the read only memory 307, are transferred into the random access memory 312, which then contains the executable code of the program or programs, as well as registers for storing the variables and parameters necessary for implementing the invention.

In a preferred embodiment, the device is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).

Various embodiments of the present invention are now described with reference to

Figures 4 to 11.

Figure 4 illustrates, using a flowchart, embodiments of a method according to the present invention. The method takes place in the computer system 14 which has previously received M source images 13 acquired simultaneously by M calibrated source cameras 12, for instance through a wireless or a wired network. These source images 13 are for instance stored in a reception buffer (memory) of the communication interface 302. The M source images may be a subset of source images available.

The method 400 may be repeated for each set of simultaneous source images 13 received from the source cameras 12 for each successive time instants. For instance, 25 Hz to 100 Hz source cameras may be used, thereby requiring processing a set of source images 13 each 1/100 to 1/25 second.

The scene volume V viewed by the source cameras 12 may be predefined as shown by the volume parameters 401. These parameters locate the scene volume in the coordinates system SYS. The scene volume V may be split into elementary voxels V(X,Y,Z), preferably of equal sizes, typically elementary cubes. A size of the elementary voxels may be chosen depending on the 3D object to be captured. This is the resolution of the 3D space: each voxel corresponds to a point in the 3D space.

For instance, the edge length of each elementary voxel may be set to 1 cm for a human being. Figure 5 schematically illustrates the splitting of a cuboid into elementary cubes V(X,Y,Z), only one of which being shown for sake of clarity.

The invention also applies to a 3D coordinates system SYS without specific scene volume and corresponding splitting into voxels.

The source cameras 12 have been calibrated, meaning their extrinsic and intrinsic parameters 402 are known.

The nature, and thus the 3D model 20, or each 3D real world object 11 in SYS is known. For ease of explanation, the description below concentrates on a single type of 3D object, for instance a human being as modelled in Figure 2. Where the captured scene contains various types of 3D objects, various corresponding 3D models 20 can be used using the teachings below.

The method starts with the obtaining 451 of two or more simultaneous source images of the 3D objects or of the scene volume recorded by the source cameras. The source images 12 are for instance retrieved from the reception buffer of the communication interface 302.

Although the sources images may have different sizes from one source camera to the other, it is assumed they have the same size for illustration purposes. In any case, a resizing of some source images may be processed to be in such situation. This resizing is not mandatory but helps in simplifying the description.

From each of these source images 13i, one or more 2D skeletons 2D-SK'j 403 are determined at step 452. Such determination is based on the 3D model or models 20 to detect each part of them (or at least a maximum number of such parts) within each source image. Several occurrences of the same model can be detected within the same source image, meaning several 3D real world objects are present in the scene captured by the cameras.

In the example of Figure 2, the detected 2D skeletons are made of up to thirteen parts with up to twelve connecting elements.

Known techniques can be used to produce these 2D skeletons from the source images 13.

One technique is described in publication “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" by Zhe Cao et al. (2016). This technique calculates confidence maps for part detection and part affinity fields for part association. A confidence map for a given part bears probabilities for respective pixels of the source image that these pixels correspond to said part of the 3D model 20. Each part affinity fields defined for a connecting element (or limb) between two adjacent parts provides affinity vectors for respective pixels of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of the limb connecting, according to the 3D model, two occurrences of said adjacent parts at the respective pixel in the source image.

The part maps and part affinity fields may have a different size/resolution from the source images (e.g. they are sub-sampled compared to the size of the source image). In such a case, the intrinsic parameters of the cameras can be modified taking into account the subsampling factor. In a variant, the part maps or part affinity fields may be interpolated in order to match the genuine size of the source images. In such a case, a bilinear interpolation is preferred over a nearest-neighbor or bi-cubic interpolation.

The part maps and part affinity fields are then processed to respectively obtain part candidates for each part type and limb candidates for each limb type. The limb candidates that share the same part candidates are then assembled into full-body poses, i.e. into 2D skeletons compliant with the 3D model 20.

In this process, each part candidate can be provided with a part probability while each limb candidate can be provided with a pairwise (or part affinity) probability, for instance based on the modulus of affinity vectors between the two part candidates. It results that, in embodiments of the present invention, each constructed 2D skeleton may be associated with a robustness score based on its part’s part probabilities and its limb’s pairwise probabilities.

Another technique is described in publication “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model” by Eldar Insafutdinov et al. (2016) or publication “Deep-Cut: Joint Subset Partition and Labelling for Multi Person Pose Estimation" by Leonid Pishchulin et al. (2016).

Although the technique is different from “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", part candidates with associated part probabilities and pairwise terms (similar to limb candidates) with pairwise probabilities between all parts are determined. A clustering of the part candidates and pairwise terms belonging to one and the same 3D object person is performed. From each cluster, one or more 2D skeletons are identified using the probabilities.

Again, in embodiments of the present invention, a robustness score can be obtained for each 2D skeleton generated based on the corresponding probabilities.

More generally, a convolutional neural network (CNN) can be used which is configured based on a learning library of pictures in which a matching with each part of the models has been made. The CNN detects parts with associated part probabilities and provides pairwise (or part affinity) probabilities between detected parts. Pairwise probabilities may be obtained from different means. For example, in the publication “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model" by Eldar Insafutdinov et al. (2016), a logistic regression algorithm is used. A graph solver is then used to build the 2D skeleton from the probabilities. The graph is made of nodes formed by the parts and links between the nodes. The nodes are weighted with corresponding part probabilities while the links are weighted with corresponding pairwise (or part affinity) probabilities. Different graph solvers can be used. For example, a bipartite solving of the graph reduces to a maximum weight bipartite graph matching problem as explained for instance in “Introduction to graph theory, volume 2” by D. B. West et al. (2001). Graph clustering algorithms can also be used as described in “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model" by Eldar Insafutdinov ef al. (2016). The optimal associations between the parts give the 2D skeletons.

An advantage of the CNNs is that the same running of the CNN can identify, within an input image, parts from different models, provided the CNN has learnt using learning pictures embedding the various models to be searched.

Typically, the part probabilities generated are unary, i.e. set between 0 and 1.

It turns out that step 452 generates a plurality of sets of 2D skeletons 2D-SK'j (where “i" identifies a source image and “j” indexes the 2D skeletons detected in source image i).

Step 452 as described above operates directly on each source image 13i as a whole. Embodiments may optimize such step.

These known techniques for step 452 are dependent on the set of learning pictures used by the CNN to learn. To that aim, the learning pictures usually provide exemplary objects that have bounded sizes. These techniques are badly adapted to detect objects the size of which is not of the same order of magnitude than in the learning pictures. Indeed, 3D objects can be sometimes big, sometimes tiny. This is for instance the case during sport events where players move from very close to the camera to very far.

In first embodiments seeking to increase robustness, it is proposed to use scaling of the source image to find a better scaled version (if any) of the source image from which the 2D skeletons can be better detected.

To that end, one or more scaled versions of a given source image 13 are obtained at step 453.

For instance, a half-sized image (scale 0.5) is generated (through down-sampling) as well as a double-sized image (scale 2 - through up-sampling). Known scaling techniques can be used.

Of course, other scaling values can be used. In this example, at least one up-scaled version and one downscaled version of the source image are obtained and used. In variants, only up-scaled versions or only downscaled versions are used.

Next, part probabilities are determined at step 454 for respective pixels of the source image or its scaled version (possibly each pixel if the part map has the same dimensions as the images) representing probabilities that the respective pixels in the source image or scaled version correspond to a (any one) part of the 3D real world object.

Pixels of the source image or of its scaled versions are examples of “samples” forming an image. For ease of illustration, it is made reference below to pixels, while the invention may apply to any sample. A sample may be for instance a pixel in the source image, a color component of a pixel in the source image, a group of pixels in the source image, a group of pixel color components in the source image, etc.

The determination of part probabilities may be done using a CNN as described above. For instance the part maps generated by such CNN convey such part probabilities. The merger (or superimposition) of several (e.g. the thirteen ones in the case of Figure 2) part maps gives an “image” of part probabilities for each of the source image or scaled version.

Next, the part probabilities so generated can be used to determine from which one of the selected source image and its scaled versions, the set of one or more 2D skeletons should be determined. Thus, only one of the source image and its scaled versions is selected at step 455 based on their part probabilities.

For instance the scaled version (including the source image 13) that maximizes the response of the CNN is selected. As an example, the response of the CNN may be defined as the number of samples/pixels associated with a part probability above a predefined threshold (whatever the part concerned). For instance the predefined threshold may be set to 0.9 in case of unary probabilities. Of course, refinements of this approach may be contemplated. For instance different thresholds may be used for different parts of the model considered.

The selected source image 13 or scaled version is then used for the actual 2D skeleton determination 456 in order to obtain the 2D skeletons 2D-SK'j.

This is repeated for each source image, meaning that some source images can be selected for 2D skeleton determination 456 while up-scaled versions of other source images can be selected for the same operation 456 and down-scaled versions of yet other source images can also be selected for their own step 456.

Optimization of step 452 may also seek to reduce the amount of samples (pixels) to be processed simultaneously in the source images (or their selected scaled versions if any). To do so, relevant subparts of the selected image (source image or scaled version) are identified and selected. The determination 456 of the 2D skeletons can thus be performed independently on each relevant subpart. This substantially reduces calculation complexity and memory consumption.

Implementing such approach, second embodiments thus provide: clustering 457 samples of the selected image into clusters; determining 458, for one or more, preferably each, clusters, a cropping area encompassing (i.e. including) the cluster in the image; and determining 459 one or more 2D skeletons from each cropping area independently. These steps are repeated for each selected image (source images 13i or their scaled versions).

To perform the clustering 457, the selected image (source image or a scaled version) is used to build a graph made of nodes formed by detected parts and links between the nodes. The nodes are weighted with corresponding part probabilities while the links between the nodes are set (e.g. depending on their weights: for instance no link is set when the weight is too low) and weighted with corresponding pairwise (or part affinity) probabilities.

Preferably, a graph/tree is built including all part candidates (i.e. for all the parts). A conventional graph clustering, preferably without transitivity constraint, makes it possible to create clusters of part candidates (with their links between them). This clustering makes it possible to separate quite distant objects in the image.

Transitivity constraints guarantee the consistency of the graph clustering. For example, if a neck is connected to a right shoulder and if the right shoulder is to a right elbow, connectivity constraints guarantee that the neck will be connected to the right elbow. These constraints are introduced in the graph clustering algorithm (e.g. by using an Integer Linear Programming algorithm) to obtain the more coherent and best solution. Resolving the graph clustering without transitivity constraints is less optimal but faster.

Once the clusters are known, each cluster is successively considered. For a given cluster, the selected image (or optionally the corresponding source image, in which case rescaling of the part candidates is performed beforehand) can be cropped 458 around the cluster. This defines cropping areas in the selected image.

The cropping may select the smallest (square or rectangle) portion of the image that includes all the part candidates of the cluster. Optionally, a guard margin may be kept around the part candidates.

Next, the 2D skeletons can be determined 459 independently from each cropping area.

To fully take advantage of the cropping, a new selection of the best scaling factor of the cropping area is performed.

This requires the portions corresponding (given the scaling) to the cropping area in the source image 13 and in its scaled versions are compared using the approach described above (steps 454-455) to select the cropping area in the source image or a scaled version of it image that maximizes the response of the CNN.

More generally, one or more scaled versions of the cropping area in the image are obtained, part probabilities are determined for respective samples of the cropping area in the image and its scaled versions representing probabilities that the respective samples in the cropping area or scaled version correspond to a part of the 3D real world object, and one (preferably only one) from the cropping area in the image and the scaled versions of the cropping area is selected based on their part probabilities from which selected cropping area or scaled version the set of one or more 2D skeletons is determined. The criteria described above for the selection can be reused. A cropping area from the source image or one of its scaled versions is finally selected (this is made for each cluster determined at step 457 and for each source image). 2D skeletons can be determined 459 from this selected cropping area. This may merely rely on graph solving as introduced above.

The selected cropping area (from the source image or one of its scaled versions) is then used to build a graph made of nodes formed by parts detected in it and links between the nodes. The nodes are weighted with corresponding part probabilities while the links may be set depending on the pairwise (or part affinity) probabilities (for instance no link is set in case the corresponding probability is too low) and weighted with corresponding pairwise (or part affinity) probabilities. A graph clustering without transitivity constraint can then be performed on this graph, which makes it possible to create new clusters of part candidates (with their links between them). This helps to further separate slightly distant objects. A second graph clustering but with transitivity constraint can next be performed on each sub-graph corresponding to one of the clusters so created, which makes it possible to create new sub-clusters of part candidates (with their links between them).

At this stage a connected component algorithm may be used to connect the part candidates within each sub-cluster according to the 3D model 20. This step builds numerous connections between the part candidates.

At this stage, it may be possible to have several positions for a same part. For example, several head positions, several neck positions, and so on. Therefore, several different 2D skeletons of the same object may exist.

So, the best 2D skeleton among the potential ones may be extracted from each sub-cluster.

To determine the best 2D skeleton, a shortest tree path detection algorithm may be performed on one (or more) graph/tree built based on the sub-cluster.

The ending parts of the 3D models are considered, and paths from one ending part to each other are defined. Generally, a 3D model has P ending parts. In that case, P-1 paths can be defined from a given ending part. In the example of Figure 2, five ending parts are defined: head, left hand, right hand, left foot and right foot. Four paths can be defined, head-to-right-hand path, head-to-left-hand path, head-to-right-foot path and head-to-left-foot path. Each path includes intermediate parts (e.g. neck, right shoulder, right elbow for the head-to-right-hand path). A complete tree is built for each path between each ending part candidates of the sub-cluster. For instance a complete graph/tree between each head candidates and each right hand candidates passing through the neck candidates, right shoulder candidates and right elbow candidates of the sub-cluster is built. Some weights are associated with each edge of the tree (link between two part candidates corresponding to different adjacent parts). For example, the weights can be the pairwise probabilities between the two nodes or a combination of node part probabilities and pairwise probabilities. The tree can be then segmented into independent sub-trees, each sub-tree defining a unique path between adjacent parts. The construction of the sub-tree can be viewed as a graph segmentation. A global solution of this segmentation is the one (i.e. the path of part candidates) that maximizes the total weights of the independent sub-trees. This is the solving of the tree.

For instance, when several parti (e.g. head) candidates and/or part2 (e.g. neck) candidates exist, the various end-to-end paths may use different links between a parti candidate and a part2 candidate. It means that among the various end-to-end paths determined in the sub-graph, some parti-to-part2 links are more or less used. Preferably, the solving outputs the part candidates of links that are most often used, for instance the head and neck candidates forming the link the most selected by the end-to-end shortest path solver. To illustrate this, if a pair of head and neck candidates is selected by three end-to-end paths and another pair of different head and neck candidates is selected by a single end-to-end path, the final pair of head and neck candidate is the one associated with the three paths. In case of equality, the pair with the highest edge weight can be selected.

It results that a 2D skeleton 2D-SK'j (complete with thirteen parts in the example, or partial) is obtained.

This process of segmentation/subtree generation is repeated for all paths of the sub-cluster, and then for all sub-clusters.

It results that one or more 2D skeletons 2D-SK'j 403 are generated for each cluster of part candidates, and then for each original source image.

Preferably, the obtained 2D skeletons are rescaled to the original scaling of their source images.

This ends the determination 452 of the 2D skeletons 2D-SK'j from each source image i. A matching between the generated 2D skeletons is thus sought. This is step 460 of matching, for one or more pairs and preferably each pair of the simultaneous source images, each of one or more 2D skeletons of one of the two corresponding sets (preferably each skeleton of the set) with at most one respective skeleton of the other set, if any.

An implementation of such matching 460 includes first determining 461 a skeleton distance between two 2D skeletons (taken from two sets), and matching 462 the 2D skeletons together depending on the skeleton distance. This is repeated for all pairs of 2D skeletons (one from the first set and the other from the second set).

As described below, such skeleton distance being determined in the source images, the latter are preferable rescaled, if necessary, to be all at the same resolution and have homogeneous image coordinates.

In embodiments, a part distance is determined between two corresponding parts of the 2D skeletons considered. Preferably, this determining is repeated for other parts (preferably all parts) composing the 2D skeletons, and the determined part distances are summed. The final sum may be the skeleton distance.

Figure 6 schematically illustrates a way to compute a part distance ρδ between the same parts of two 2D skeletons 2D-SK1i and 2D-SK2i provided by two source images 13i and 132. The parts forming the 2D skeletons are shown with black stars. Figure 7 illustrates, using a flowchart, the corresponding operations.

The extrinsic and intrinsic parameters 402 of corresponding cameras 12i and 122 are known and used to calculate the two fundamental matrices 404: M1-2 from camera 12i to camera 122 and M2-1 from camera 122 to camera 12i. In epipolar geometry, it is known that the fundamental matrix projects a point of a first view into a line (an epipolar line) in the other view. To be concrete, the epipolar line is line Δ seen from another camera. Two directions may thus be processed, meaning for instance that the part distance ρδ may be built from a first directional part distance ρδι-2 and a second directional part distance ρδ2-ι.

The top half of Figure 6 illustrates the computation of the first directional part distance ρδι-2 while the bottom half illustrates the computation of the second directional part distance ρδ2-ι.

As shown, a part (the head in the example) is selected at step 700. This part of a first 2D skeleton 2D-SK1i determined from a first source image 13i is projected 701 as a first epipolar line Δ1-2 on the second source image 132. Next, a first directional part distance ρδι-2 is computed 702 between the same part (the head in the example) of a second 2D skeleton 2D-SK2i determined from the second source image 132 and the first epipolar line Δ1-2. The distance may merely be the orthogonal distance between the part and the line (e.g. in number of pixels).

Symmetrically, the part (the head in the example) of the second 2D skeleton 2D-SK2i can be projected 703 as a second epipolar line Δ2-1 on the first source image 13i, and a second directional part distance ρδ2-ι can be computed 704 between the same part (the head in the example) of the first 2D skeleton 2D-SK1i and the second epipolar ΙίηβΔ2-ι.

The part distance ρδ between the head parts of the two 2D skeletons may thus be selected 705 as the maximum distance between the first and second directional part distances ρδι-2 and ρδ2-ι: p5=max{p5i-2 ; ρδ2-ι}. In a variant, the mean value between the two directional part distances can be selected.

Of course, to simplify the process, only one directional part distance can be computed and kept as part distance ρδ.

Optional step 706 may discard the distances that are evaluated as being too high to mirror 2D skeletons belonging to the same object. In this context, step 706 may provide comparing the part distance ρδ with a predefined threshold (e.g. 20 pixels), and if the part distance is above the predefined threshold, the part distance ρδ is discarded from the determining of the distance between the 2D skeletons. Discarding the part distance may merely indicate it is not taken into account for the next steps described below.

Next at step 707, the skeleton distance δ between the two 2D skeletons (initially set to 0) is incremented by the obtained part distance ρδ (if not discarded). This step progressively computes skeleton distance δ.

At the same step, a part counter pc (initially set to 0) is incremented by 1 (if ρδ is not discarded) to count the number of parts taken into account in the calculation of δ.

These operations are repeated for each part of the 2D skeletons (i.e. up to thirteen parts in the example of Figure 2) by looping back to step 700 to select a new part.

When all the parts have been processed (test 708), the value δ is output 709 as the final skeleton distance between the two 2D skeletons. It means that skeleton distance δ is associated with the pair formed of the two 2D skeletons considered.

Optionally, at step 709, counter pc may be compared to a second predefined threshold (e.g. 8 for 13 parts in the model 20) to determine whether the two 2D skeletons are close enough. If pc is below the second predefined threshold, no skeleton distance is associated with the pair of 2D skeletons. For instance, skeleton distance δ is set to an infinite value.

By using this algorithm, a skeleton distance δ is computed for each pair of 2D skeletons coming from two different source images. For instance all skeleton distances 6(2D-SK1j,2D-SK2k) between a 2D skeleton 2D-SK1j determined from source image 13i and 2D skeleton 2D-SK2k determined from source image 132 are known at the end of step 461.

Next step is step 462 consisting in determining matchings between pairs of 2D skeletons. This determining is based on these skeleton distances.

In embodiments, this is made using a graph. This is to obtain one or more one-to-one associations between a 2D skeleton 2D-SK1j determined from the first source image 13i and a 2D skeleton 2D-SK2k determined from the second source image 132.

The graph may be built with nodes corresponding to the 2D skeletons of the two sets and with weighted links between nodes that are set based on the determined skeleton distances between the corresponding 2D skeletons. In this graph, a node (i.e. a 2D skeleton of a first set) is linked to a plurality of other nodes (2D skeletons of the other set). No link is set between nodes corresponding to 2D skeletons of the same set. A bipartite solving of this graph as introduced above outputs optimal one-to-one associations between 2D skeletons. It means that a 2D skeleton 2D-SK1j of the first set is at the end linked (i.e. matched) to at most one 2D skeleton 2D-SK2k of the other set.

The bipartite solving may be based on the link weights only, meaning the one-to-one matchings correspond to the minimums of the sum of the link weights in the graph. Optionally, the nodes may be weighted using the robustness scores indicated above (in which case an appropriate formula between the node weights and the link weights is used).

Once the matched 2D skeletons 405 (i.e. pairs of 2D skeletons) are known, a weak 3D skeleton W3D-SK is generated from each pair of matched 2D skeletons. This is step 463. This is to obtain plenty of weak 3D skeletons in the volume.

The 3D skeletons built at step 463 are said to be “weak” because they are not the final ones.

Step 463 of forming the 3D skeletons uses inter-view 2D triangulation to convert two matched 2D skeletons into a weak 3D skeleton. 2D triangulation is made each part by each part.

An exemplary implementation of this step is illustrated in Figure 8. It is made of three main sub-steps, namely: projecting a part (for instance a foot in the shown example) of a first 2D skeleton 2D-SK1j of the matching pair as a first line Δ1 in 3D space (e.g. volume V representing the scene volume when it is defined). The projection corresponds to the line shown in Figure 5 for instance from a part in the source image. This projection is a geometrical issue based on the extrinsic parameters of the corresponding camera (here camera 12i); projecting the same part of the second 2D skeleton 2D-SK2k of the matching pair as a second line Δ2ΐη the 3D space; and determining a 3D position (e.g. a voxel V(X,Y,Z)) locating the part (here the foot) for the weak 3D skeleton W3D-SK, based on the first and second lines.

The two lines Δι and Δ2 rarely intersect one the other in the same 3D position or the same voxel. If they intersect, the intersecting point or voxel is elected as representing the part considered. Otherwise, the closest 3D point or voxel to the two lines is preferably selected. The closeness can be evaluated based on a least square distance approach.

These steps can indeed be repeated for all parts composing the 2D skeletons of the matching pair. It results that a plurality of 3D positions locating, in the 3D space, a plurality of parts forming the weak 3D skeleton is obtained. The weak 3D skeleton W3D-SK is thus formed.

Step 463 is performed for each matching pairs of 2D skeletons. The result is a plurality of weak 3D skeletons 406 built in the 3D space. Several weak 3D skeletons correspond to the same 3D object.

That is why these various weak 3D skeletons are then converted into one or more (final) 3D skeletons 3D-SK. This is step 465. Beforehand, a clustering of the weak 3D skeletons can be performed at step 464 in order to reduce the complexity of next step 465.

The clustering may be based on the 2D skeletons from which the 3D weak skeletons are built. For instance, the 3D weak skeletons sharing a common 2D skeleton can be grouped into the same cluster. Such clustering approach aims at grouping the 3D weak skeletons liable to represent the same 3D object.

For illustrative purposes, it is assumed that a first 3D weak skeleton is built from source images 13i and 132 based on a matching pair between 2D-SK1i (from 13i) and 2D-SK2i (from 132); a second 3D weak skeleton is built from source images 132 and 133 (not shown in the Figures) based on a matching pair2D-SK2i (from 132) and 2D-SK34 (from 133). The two 3D weak skeletons are grouped into the same cluster because they share the same 2D skeleton, namely 2D-SK2i.

It is also assumed that a third 3D weak skeleton is built from source images 13i and 133 based on a matching pair between 2D-SK1i (from 13i) and 2D-SK34 (from 133). This third 3D weak skeleton shares the same 2D skeletons as the two first 3D weak skeletons. In this context, the three 3D weak skeletons are coherent and can thus be grouped into the same cluster.

However, if the third 3D weak skeleton was built from a matching pair between 2D-SK1i (from 13i) and 2D-SK3e (from 13β) [thus no longer from 2D-SK34], it is also grouped in the same cluster as it shares 2D-SK1i with the first 3D weak skeleton. The remainder of the process as described below should be appropriate to limit the effect of this incoherent third 3D weak skeleton in the building of the final 3D skeleton.

The combining of weak 3D skeletons of a cluster into a “robust” 3D skeleton can be performed based on spatial closeness criterion and performed part after part.

For instance, 3D positions (or voxels) of the weak 3D skeletons locating/representing the same part of the weak 3D skeletons can be converted into a unique 3D position (or voxel) for the part, i.e. a part forming the “robust” 3D skeleton 3D-SK. As the same process is repeated for each part, the parts forming 3D-SK are progressively built.

The conversion can be based on spatial closeness. For instance a RANSAC (RANdom SAmple Consensus) algorithm with a local/global fitting model can be applied. This is illustrated in Figure 9.

Let consider variable N as being the number (less 1) of parts forming the model 20 (at the end N is 12 in the case of Figure 2), variable index_part as indexing each part type, table F_Position as defining the final 3D positions of the N parts and table FJnliers as defining the numbers of inliers for the N parts respectively.

Starting from the cluster 900 of weak 3D skeletons, step 901 initializes N to 0 and tables F_Position and F_lnlier to empty tables.

At step 902, two interim tables l_Position and IJnliers are initialized to 0 for the current iteration.

At step 903, the 3D positions corresponding to the same part in the weak 3D skeletons are selected. This part is indexed by variable index_part 904.

For example, if ten weak 3D skeletons are considered, at most ten 3D positions corresponding to the heads of these skeletons are obtained. Of course, some weak 3D skeletons may be partial and not comprise the part currently considered. A RANSAC average 3D position is then calculated at step 905 from the selected 3D positions.

The RANSAC approach calculates a robust average 3D position as the average of selected inliers, i.e. selected 3D positions. These selected 3D positions are accepted as inliers for the computation if their distance to other robust 3D positions is below a threshold.

The number of inliers NJnliers (i.e. the number of voxels that are close to the average 3D position calculated by the RANSAC algorithm) are calculated. This is a functionality of the RANSAC algorithm.

If this number NJnliers is higherthan a given threshold (e.g. 5) and higher than the number of inliers already stored in F_lnliers[index_part], then the calculated average 3D position is accepted at step 906. This triggers an interim table updating step 907 during which the temporary position for the current part, i.e. l_Position[index_part], is set to the calculated average 3D position and the temporary number of inliers for the current part, i.e. I_lnliers[index_part] is set to NJnliers. This is to memorize the 3D position calculated from the maximum number of inliers for each part, throughout the iterations. Next step is step 908.

Otherwise, if N_lnliers is less than the threshold or less than F_lnliers[index_part], no update is made and the process goes directly to step 908.

At step 908, the next part is considered and the process loops back to step 903 to consider all 3D positions corresponding to this next part.

When all the parts have been processed, a set of 3D positions for all parts is available and stored in table ‘l_Position’. Next step 909 then consists in checking whether the calculated 3D positions meet some morphological constraints defined by the 3D model 20.

These can be based on distances between 3D positions of parts.

For instance, the constraints may vary from one part to the other. For instance, a common head-neck distance is higherthan 10 cm but less than 40 cm, a common pelvis-knee distance is higher than 20 cm but less than 80 cm, and so on. A check may thus be performed on a part-to-part basis.

Alternatively it may merely be checked that all the 3D positions are comprised within the same constrained volume, e.g. a sphere with a radius of 2 meters for human beings.

In the affirmative of step 909, the final positions and inliers tables 901 can be updated 910 with the interim tables: F_Position = ^Position and F_lnliers= IJnliers for each part meeting the constraint of step 909. Next step is step 911.

In the negative, next step is step 911.

At step 911, a next iteration is started by looping back to step 902. The number of iterations can be predefined.

At the end (test 912 of last iteration is negative), F_Position defines 913 the final 3D positions for the parts forming a 3D model. This is the final “robust” 3D skeleton for the current cluster.

Figure 10 illustrates the result of such operations on three clusters of weak 3D skeletons W3D-SKto obtain, each time, a final robust 3D skeleton 3D-SK. The three bundles of weak 3D skeletons W3D-SK are shown on the left side of the Figure, while the three final robust 3D skeletons 3D-SK are shown on the right side.

Figure 10(A) shows a case where the weak 3D skeletons (of the same cluster) have low variability and are located at roughly the ‘same’ 3D positions. This situation mirrors an efficient matching between the 2D skeletons between the source images as well as a stable and accurate detection of the 2D skeletons from the source images.

Figure 10(B) shows a case where the weak 3D skeletons have higher variability but are located at roughly the ‘same’ 3D positions. This situation mirrors an efficient matching between the 2D skeletons from the source images, but an unstable detection of the 2D skeletons (or some parts of the 2D skeletons) from the source images.

Last, Figure 10(C) shows a case where the weak 3D skeletons have high variability and not located at the ‘same’ 3D positions. This mirrors a low efficient matching between the 2D skeletons from the source images. Some generated weak 3D skeletons are even false 3D skeleton. However, despite some false 3D skeletons, the majority of the 3D weak skeletons are at roughly the right position.

All of this makes it possible to obtain a final and robust 3D skeleton 3D-SK as shown on the right part of Figure 10(A), Figure 10(B) and Figure 10(C).

Back to Figure 4, step 465 is thus performed on each cluster of weak 3D skeletons, thereby generating a plurality of robust 3D skeletons. A final, and optional, step 466 may consist in deleting therefrom duplicates or redundant 3D skeletons, i.e. allegedly robust 3D skeletons that correspond to the same 3D object.

First sub-step consists in detecting such duplicates. Two approaches are proposed for illustrative purposes.

In one approach, a gravity center of each 3D skeleton 3D-SK is first computed, for instance as the iso-barycenter of all 3D positions (or voxels) of parts forming the 3D skeleton 3D-SK. Two 3D skeletons may be considered as duplicates or redundant if the distance between their gravity centers is below a predefined threshold.

In another approach, an average 3D distance between each pair of 3D skeletons is first computed. The average 3D distance may be the sum of part distances between the same (existing) parts of the two 3D skeletons. Two 3D skeletons may be considered as duplicates or redundant if their average 3D distance is below a predefined threshold.

Next sub-step thus consists in selecting one of the 3D skeleton duplicates. For instance, the 3D skeleton having the higher number of parts is selected.

Some applications may require that the 3D skeleton or skeletons obtained at step 465 or 466 (thus generated using the process of the Figure) be displayed, for instance using the display screen 15. A 2D or 3D image of the 3D object or objects can thus be generated using the obtained 3D skeleton or skeletons.

Figure 11 illustrates, using a flowchart, such a process 1100 for displaying a 3D skeleton of one or more 3D real world objects observed by source cameras. This is an exemplary application using the generated 3D skeleton.

Step 1101 corresponds to generating a 3D skeleton of the 3D real world object using the teachings of the invention, e.g. using the process of Figure 4.

Step 1102 consists in selecting a virtual camera 12v. Such camera does not actually exist. It is defined by a set of extrinsic and intrinsic parameters chosen by the user. These parameters define from which viewpoint, at which distance and with which focal (i.e. zoom) the user wishes to view the scene.

Using these parameters of the virtual camera, the virtual image 13v can be computed at step 1103. This step merely consists in projecting the 3D skeleton or skeletons located in the 3D space onto a virtual empty image defined by the parameters of the virtual camera.

Next, the built virtual image 13v is displayed on the display screen 15 at step 1104.

Steps 1103 and 1104 ensure the display on a display screen of the generated 3D skeleton from the viewpoint of the virtual camera.

Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the present invention.

Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.

In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims

1. A method for generating a 3D skeleton of one or more 3D real world objects observed by cameras, comprising the following steps performed by a computer system: determining a set of one or more 2D skeletons of the 3D real world object or objects in each of simultaneous images of the 3D real world objects recorded by the cameras; for one or more pairs of the simultaneous images, matching each of one or more 2D skeletons of one of the two corresponding sets with at most one respective skeleton of the other set; and generating one 3D skeleton from the pairs of matched 2D skeletons.

2. The method of Claim 1, wherein determining a set of one or more 2D skeletons in an image includes: obtaining one or more scaled versions of the image, determining part probabilities for respective samples of the image and its scaled versions representing probabilities that the respective samples in the image or scaled version correspond to a part of the 3D real world object, and selecting only one of the image and its scaled versions based on their part probabilities from which selected image or scaled version the set of one or more 2D skeletons is determined.

3. The method of Claim 1, wherein determining a set of one or more 2D skeletons in an image includes: clustering samples of an image into clusters; determining, for one or more clusters, a cropping area encompassing the cluster in the image; obtaining one or more scaled versions of the cropping area in the image, determining part probabilities for respective samples of the cropping area in the image and its scaled versions representing probabilities that the respective samples in the cropping area or scaled version correspond to a part of the 3D real world object, and selecting only one from the cropping area in the image and the scaled versions of the cropping area based on their part probabilities from which selected cropping area or scaled version the set of one or more 2D skeletons is determined..

4. The method of Claim 1, wherein matching 2D skeletons of two images includes: determining a skeleton distance between the 2D skeletons and matching the 2D skeletons together depending on the skeleton distance.

5. The method of Claim 4, wherein determining a skeleton distance between the 2D skeletons includes determining a part distance between two corresponding parts of the 2D skeletons

6. The method of Claim 5, further comprising repeating determining a part distance for other parts composing the 2D skeletons, and summing the determined part distances.

7. The method of Claim 5, wherein determining a part distance between two corresponding parts of the 2D skeletons includes: projecting a part of a first 2D skeleton determined from a first image as a first epipolar line on the second image, and calculating a first distance between the same part of a second 2D skeleton determined from the second image and the first epipolar line.

8. The method of Claim 7, wherein determining a part distance between two corresponding parts of the 2D skeletons further includes: projecting the part of the second 2D skeleton as a second epipolar line on the first image, and calculating a second distance between the same part of the first 2D skeleton and the second epipolar line.

9. The method of Claim 8, wherein the part distance between the two corresponding parts is the maximum distance between the first and second distances.

10. The method of Claim 5, wherein a part distance above a predefined threshold is discarded from the determining of the skeleton distance between the 2D skeletons.

11. The method of Claim 4, wherein matching the 2D skeletons includes using a graph to obtain one or more one-to-one associations between a 2D skeleton determined from a first image and a 2D skeleton determined from the second image, wherein nodes of the graph correspond to the 2D skeletons of the two sets and weighted links between nodes are set based on the determined distances between the corresponding 2D skeletons.

12. The method of Claim 1, wherein generating one 3D skeleton from the pairs of matched 2D skeletons includes: generating a weak 3D skeleton from each pair of matched 2D skeletons; and determining one or more 3D skeletons from the generated weak 3D skeletons.

13. The method of Claim 12, wherein generating a weak 3D skeleton from a pair of matched 2D skeletons includes: projecting a part of a first 2D skeleton of the pair as a first line in a 3D space; projecting the same part of the second 2D skeleton of the pair as a second line in the 3D space; and determining a 3D position locating the part for the weak 3D skeleton, based on the first and second lines.

14. The method of Claim 13, further comprising repeating the two steps of projecting and the step of determining for all parts composing the 2D skeletons of the pair, thereby obtaining a plurality of 3D positions locating, in the 3D space, a plurality of parts forming the weak 3D skeleton.

15. The method of Claim 12, wherein determining one or more 3D skeletons from the generated weak 3D skeletons includes converting 3D positions of the weak 3D skeletons locating the same part of the weak 3D skeletons into a unique 3D position for the part.

16. The method of Claim 15, wherein converting 3D positions includes applying a Random sample consensus algorithm.

17. A method for displaying a 3D skeleton of one or more 3D real world objects observed by cameras, comprising the following steps performed by a computer system: generating a 3D skeleton of a 3D real world object using the method of Claim 1, selecting a viewpoint in 3D space, and displaying, on a display screen, the generated 3D skeleton from the viewpoint.

18. A non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform the method of Claim 1 or 17.

19. A system for generating a 3D skeleton of one or more 3D real world objects observed by cameras, comprising at least one microprocessor configured for carrying out the steps of: determining a set of one or more 2D skeletons of the 3D real world object or objects in each of simultaneous images of the 3D real world objects recorded by the cameras; for one or more pairs of the simultaneous images, matching each of one or more 2D skeletons of one of the two corresponding sets with at most one respective skeleton of the other set; and generating one 3D skeleton from the pairs of matched 2D skeletons.

20. A system for displaying a 3D skeleton of one or more 3D real world objects observed by cameras, comprising the generating system of Claim 18 connected to a display screen, wherein the microprocessor is further configured for carrying out the steps of: selecting a viewpoint in 3D space, and displaying, on the display screen, the generated 3D skeleton from the viewpoint.