WO2008107553A2

WO2008107553A2 - Augmented reality method and devices using a real time automatic tracking of marker-free textured planar geometrical objects in a video stream

Info

Publication number: WO2008107553A2
Application number: PCT/FR2008/000068
Authority: WO
Inventors: Valentin Lefevre; Nicolas Livet
Original assignee: Total Immersion
Priority date: 2007-01-22
Filing date: 2008-01-18
Publication date: 2008-09-12
Also published as: US8374394B2; EP2132710A2; US8374395B2; US20100220891A1; JP2010517129A; US20130121531A1; US20130076790A1; FR2911707B1; FR2911707A1; US8315432B2; JP5137970B2; KR101328759B1; US8374396B2; US20120328158A1; US20130004022A1; EP4254333A3; KR20090110357A; US8824736B2; EP4254333A2; WO2008107553A3

Abstract

The invention relates to a method and to devices for the real-time tracking of one or more substantially planar geometrical objects of a real scene in at least two images of a video stream for an augmented-reality application. After receiving a first image of the video stream (300), the first image including the object to be tracked, the position and orientation of the object in the first image are determined from a plurality of previously determined image blocks (320), each image block of said plurality of image blocks being associated with an exposure of the object to be tracked. The first image and the position and the orientation of the object to be tracked in the first image define a key image. After receiving a second image from the video stream, the position and orientation of the object to be tracked in the second image are evaluated from the key image (300). The second image and the corresponding position and orientation of the object to be tracked can be stored as a key image. If the position and the orientation of the object to be tracked cannot be found again in the second image from the key image, the position and the orientation of this object in the second image are determined from the plurality of image blocks and the related exposures (320).

Description

Augmented reality method and devices using automatic, real-time tracking of textured planar geometric objects without markers in a video stream

The present invention relates to the combination of real and virtual images, also called augmented reality, and more particularly a method and augmented reality devices using automatic real-time tracking of planar geometric textures, without markers, in a video stream. Augmented reality aims to insert one or more virtual objects in the images of a video stream, Depending on the type of application, the position and orientation of these virtual objects can be determined by data related to certain elements. of this scene, for example coordinates of a particular point of the scene such as the hand of a player or by data external to the scene represented by the images, for example coordinates directly from a game scenario. When the position and the orientation are determined by data related to certain elements of this real scene, it may be necessary to follow these elements according to the movements of the camera or the movements of these elements themselves in the scene. The operations of tracking elements and incrustation of virtual objects in the real images can be executed by separate computers or by the same computer.

There are several ways to track elements in an image stream. Generally, element tracking algorithms, also known as target tracking algorithms, use a marker that can be visual or use other means such as radio frequency or infrared based means. Alternatively, some algorithms use pattern recognition to track a particular element in an image stream.

The Ecole Polytechnique Fédérale de Lausanne has developed a visual tracking algorithm that does not use a marker and whose originality lies in the pairing of particular points between the current image of a stream. video with a keyframe, called a keyframe, obtained at system boot and an updated keyframe while performing visual tracking.

The objective of this visual tracking algorithm is to find, in a real scene, the pose, that is to say the position and the orientation, of an object whose three-dimensional mesh is available, or to find the Extrinsic position and orientation parameters of a camera filming this object, motionless, thanks to the image analysis.

The current video image is compared with one or more keyframes recorded to find a large number of matches between these pairs of images to estimate the pose of the object. To this end, a keyframe is composed of two elements: a captured image of the video stream and a pose (orientation and position) of the real object appearing in this image. Keyframes should be "offline", off-line, or "on-line" keyframes, or on one. Offline keyframes are images extracted from the video stream in which the object to be tracked has been placed manually through the use of a pointing device such as a mouse or using an adjustment tool such as a Pocket Dial marketed by the company Doepfer. The offline keyframes preferably characterize the pose of the same object in several images. They are created and saved "offline", that is to say out of the permanent regime of the application. "Online" keyframes are dynamically stored during trace program execution. They are calculated when the error, that is the distance between the points of interest pairings, is small. Online keyframes replace the offline keyframes used to initialize the application. Their use aims to reduce the shift, also called drift, which can become important when one moves too far from the initial relative position between the camera and the object. Learning new online keyframes also results in more robust application to external light variations and camera color variations. However, they have the disadvantage of introducing a "vibration" effect on the pose of the object over time. When learning a new online keyframe, this one just replace the previous keyframe, offline or online. It is used as a common keyframe.

Each keyframe, offline or online, includes an image in which the object is present and a pose to characterize the location of that object as well as a number of points of interest that characterize the object in the object. picture. Points of interest are, for example, constructed from a Harris point detector and represent locations with high values of directional gradients in the image.

Before initializing the application, it is necessary to determine one or more keyframes offline. These are generally images extracted from the video stream, which contain the object to be tracked, and which are associated with a position and an orientation of the three-dimensional model of this object. For this, an operator visually matches a wired model to the actual object. The manual preparation phase thus consists in finding a first estimate of the pose of the object in an image extracted from the video stream, which amounts to formalizing the initial affine transformation T _{p → c} , the transition matrix between the reference associated with the image. object followed to the marker attached to the camera. The initial affine transformation can be decomposed according to a first transformation r _{o → c} relative to an initial position of the object, for example in the center of the screen, that is to say a transformation linked to the change of reference between the reference of the camera and the reference of the object, and in a second transformation T _{p → 0} relative to the movement and rotation of the object from its initial position in the center of the screen to the position and the orientation in which is actually the object on the keyframe, where T _{p → c} = T _p → ₀ - T ₀ ^ ₀ . If the values α, β and γ correspond to the translation of the object from its initial position in the center of the image to its position in the key image and if the values θ, φ and φ correspond to the rotation of the image. object from its initial position in the center of the image to its position in the keyframe along the x, y and z axes, the transformation T _{p → 0} can then be expressed in the form of the following matrix,

cosι9 cosζpcos # sin6> β p $ VΑ (ps, m.θcosφ -Q, o & φύnφ -cos ^ sin # cos ^ -sin ^ sin ^ cos # cos ^ γ

0 0 0 1

The use of this model makes it possible to establish the link between the coordinates of the points of the three-dimensional model of the object expressed in the reference of the object and the coordinates of these points in the reference of the camera. During the initialization of the application, the offline keyframes are processed to position points of interest according to the parameters chosen at the launch of the application. These parameters are empirically specified for each type of application use and allow the matching detection core to be modulated and to obtain a better quality in estimating the pose of the object according to the characteristics of the application. the real environment. Then, when the real object in the current image is in a pose that is close to the pose of the same object in one of the offline keyframes, the number of matches becomes important. It is then possible to find the affine transformation allowing to fix the virtual three-dimensional model of the object on the real object.

When such a match has been found, the algorithm goes into steady state. The displacements of the object are followed by one frame on the other and any drifts are compensated thanks to the information contained in the offline key image retained during the initialization and in the online key image calculated during the execution of the application.

The tracking application combines two types of algorithms: a detection of points of interest, for example a modified version of Harris point detection, and a reprojection technique of the points of interest positioned on the three-dimensional model towards the flat image. This reprojection makes it possible to predict the result of a spatial transformation from one frame to the other. These two combined algorithms allow robust tracking of an object with six degrees of freedom.

In a general way, a point p of the image is the projection of a point P of the real scene with p ~ P,> P _E -T _{p → c} - P where P | is the matrix of intrinsic parameters of the camera, ie its focal length, the center of the image and the offset, PE is the matrix of the extrinsic parameters of the camera, ie the position of the camera in the real space, and T _{p → c} is the matrix of passage between the reference associated with the object followed towards the marker attached to the camera. Only the relative position of the object relative to the relative position of the camera is considered here, which amounts to placing the reference of the real scene at the optical center of the camera. This results in the relation P ~ P, - T _{p → c} - P. Since the matrix Pi is known, the tracking problem therefore consists of determining the matrix T _{p → c} , ie the position and the orientation of the object with respect to the reference of the camera.

However, it is important to note that when the error measurement becomes too large, that is, when the number of matches between the current keyframe and the current image becomes too small, the tracking is stall (we consider that the estimate of the pose of the object is no longer sufficiently consistent) and a new initialization phase is necessary.

The pose of an object is estimated according to the correspondences between the points of interest of the current image resulting from the video stream, the points of interest of the current key image and the points of interest of the previous image. of the video stream. These operations are called the matching phase. From the most significant correlations, the software calculates the pose of the object corresponding best to the observations.

Figures 1 and 2 illustrate this tracking application. Alternatively, during the phase of creation of keyframes offline, the pose of an object can be determined according to the configuration of its characteristic points. For this, image blocks centered on points of interest are generated from offline keyframes according to affine transformations or homographic deformations. These blocks of images as well as the blocks of images obtained after transformation are called patches. A patch can here be defined as an image block comprising a point of interest and which is associated with the pose of the corresponding object. The pose of each patch is calculated according to the transformation performed to obtain the corresponding image block. The patches are preferably arranged according to a decision tree to limit the calculation times during the execution of the tracking application. Thus, for each image resulting from a video stream, the object tracking application determines certain points of interest of this image and compares the image blocks centered on these points with the previously created patches to determine the pose of the image. the object in the image. However, this solution also induces a "vibration" effect on the pose of the object over time.

Proposed object tracking solutions for augmented reality applications often come from research and do not take into account the constraints of implementing commercial systems. In particular, the problems related to the robustness, the ability to quickly launch the application without requiring a manual phase of initialization, the detection of errors of the "stall" type (when the object to follow is "lost") and automatic and real-time reset after such errors are often left out.

The invention solves at least one of the problems discussed above.

The invention thus relates to a method for tracking in real time at least one substantially planar geometric object of a real scene in at least two substantially consecutive images of at least one video stream, in an augmented reality application, this method characterized in that it comprises the following steps,

receiving a first image of said at least one video stream, said first image comprising said at least one object to follow; determining the position and the orientation of said at least one object to be followed in said first image from a plurality of predetermined image blocks, each image block of said plurality of image blocks being associated at a pose of said at least one object to follow, said first image and the position and the orientation of said at least one object to be followed in said first image being called a key image;

receiving a second image of said at least one video stream, said second image comprising said at least one object to follow; and, evaluating the position and orientation of said at least one object to be followed in said second image from said key image.

The method according to the invention thus makes it possible to automate the initialization of an augmented reality application using automatic tracking, in real time, of textured planar geometric objects, without markers, in a video stream. This method also resets the application in case of stall, that is to say when the object to follow is lost.

According to a particular embodiment, the steps of receiving an image and evaluating the position and orientation of said at least one object to be followed in said received image are repeated for images of said at least one next video stream. said second image, in order to follow said at least one object in a sequence of images.

Advantageously, the position and orientation of said at least one object to be tracked in received image are evaluated from said plurality of image blocks when the position and orientation of said at least one object to be tracked can not be evaluated in said received image from said keyframe to allow automatic reset when the object to be tracked is lost. Still according to a particular embodiment, the values of said keyframe are replaced by a received image and the evaluated position and orientation of said object to follow in said received image to improve the tracking of said at least one object to follow.

Still according to a particular embodiment, the method further comprises a step of evaluating the installation of the image sensor from which said at least one video stream originates in a frame linked to said real scene from the evaluated position of said at least one object to follow. Advantageously, the method further comprises a step of determining the movement of said image sensor. This embodiment makes it possible to follow the displacement of said image sensor when said at least one object to be followed is immobile in the scene observed by said image sensor. According to a particular embodiment, said at least one object to be followed comprises a uniform color zone adapted to implement chromakey technology for inserting an element into the uniform color area of the image. Still according to a particular embodiment, said at least one object to follow is simultaneously monitored in at least two separate video streams to allow, in particular, the implementation of collaborative applications.

Still according to a particular embodiment, the method further comprises a step of insertion of at least one element into at least one of said received images according to the evaluated position and orientation of said at least one object to be followed in said image received, said at least one element being selected from a list comprising at least one representation of at least one virtual object and at least one second video stream, in order to enrich the image issuing from said image sensor.

The invention also relates to a computer program comprising instructions adapted to the implementation of each of the steps of the method described above.

The invention also relates to a means for storing information, removable or not, partially or completely readable by a computer or a microprocessor comprising code instructions of a computer program for the execution of each of the steps of the previously described method.

The invention also relates to a device for tracking in real time at least one substantially planar geometric object of a real scene in at least two substantially consecutive images of at least one video stream, in an augmented reality application, this device characterized in that it comprises the following means,

means for receiving a first image of said at least one video stream, said first image comprising said at least one object to follow;

means for storing said first image in first storage means; means for determining the position and orientation of said at least one object to be followed in said first image from a plurality of image blocks previously stored in second storage means, each image block of said plurality of image blocks being associated with a laying of said at least one object to follow, said pose being stored in said second storage means, the position and the orientation of said at least one object to be followed in said first image being stored in said first storage means;

means for receiving a second image of said at least one video stream, said second image comprising said at least one object to be tracked; and,

means for evaluating the position and orientation of said at least one object to be followed in said second image from the data stored in said first storage means.

The device according to the invention thus makes it possible to automate the initialization or the reinitialization of an augmented reality application using an automatic tracking, in real time, of textured planar geometrical objects, without marker, in a video stream.

According to a particular embodiment, the device further comprises means for determining whether the position and orientation of said at least one object to be tracked can be evaluated in said second image from the data stored in said first storage means, said means for determining the position and orientation of said at least one object to be tracked in said first image from the data stored in said second storage means being adapted to evaluate the position and orientation of said at least one object to be tracked in said second image from the data stored in said second storage means. The device according to the invention thus makes it possible to automatically reset the application when the object to be followed is lost.

Still according to a particular embodiment, the device further comprises means for storing said second image and the position and orientation of said at least one object to be followed in said second image. image in said first storage means to improve the tracking of said at least one object.

Still according to a particular embodiment, the device further comprises transformation means adapted to determine the laying of said at least one object to follow or the image sensor from which said at least one video stream in one of the markers related to said real scene, auditing at least one object to follow or said image sensor to determine the relative movements of said at least one object to follow at said image sensor in said real scene. According to a particular embodiment, the device further comprises means for inserting at least one element into at least one of said received images according to the evaluated position and orientation of said at least one object to be followed in said received image, said at least one element being chosen from a list comprising at least one representation of at least one virtual object and at least one second video stream in order to enrich the image issuing from said image sensor.

Other advantages, aims and features of the present invention appear from the following detailed description, given by way of non-limiting example, with reference to the accompanying drawings in which: - Figure 1 schematically represents the essential principles of the application object tracking developed by the Swiss Federal Institute of Technology in Lausanne;

FIG. 2 illustrates certain steps of the method for determining the placing of an object in an image of a video stream from key images and the previous image of the video stream;

FIG. 3 illustrates the general diagram of an object tracking algorithm according to the invention;

FIG. 4 shows an example of an apparatus making it possible to implement the invention at least partially; FIG. 5, comprising FIGS. 5a and 5b, illustrates two examples of architectures that can be used when a mobile device for capturing and displaying images is used; FIG. 6 shows an example of use of a mobile device for capturing and displaying images for an augmented reality application with object tracking; and,

- Figure 7 shows how a mobile device for capturing and viewing images can be used as a cursor or a motion sensor with six degrees of freedom.

The method according to the invention particularly relates to the automation of the initialization and reset phases after a stall of the object tracking application on images from a video stream. Figure 3 illustrates the overall diagram of the object tracking application embodying the invention.

As shown in FIG. 3, the object tracking application comprises three phases: a preparation phase (I), an initialization or reset phase (II) and an object tracking phase (III). The preparation phase (I) consists mainly in extracting the characteristic points of the object to be followed in order to prepare a search tree. After having acquired a textured image of the object (step 300), the points of interest of the image are found (step 305) according to a conventional algorithm such as for example a Harris point detector. A textured image of the object is preferably an image on which the object alone appears, such as a computer image or an image on which the object has been cut off and the background deleted.

When the points of interest of the object have been determined, the application extracts the image blocks centered on these points in order to generate patches. Patches are generated (step 310), for example, by random transformations based on translations, rotations, and scale changes. For example, around point of interest m ₀ of the image, it is possible to perform the affine transformation defined by the following relation, {n - n ₀ ) = H (m - m ₀ ) + T (t ₎ , t ₂ ) where the point n is the transformation of the point m, T (ti, t ₂ ) corresponds to a translation around the point m ₀ , t i being a small vertical translation in the image and t ₂ being a small horizontal translation and H = R _α - R ^{~ ι} - S (X ₁₇ A ₇ ) - R _β . R _a and R _β ^' correspond to rotations along two orthogonal axes and 5 (/ I ₁ , / I ₂ ) represents the change of scale. The parameters to be randomly varied are therefore α, β, λ 1, X 1, t 1 and z.

Each patch is associated with a pose that is calculated according to the transformation applied to the image block to obtain the corresponding patch. A search tree is then constructed from the generated patches (step 315).

During the initialization phase (II), an offline key image is created from a first image from the video stream (step 320). This first image from the video stream is stored for use as an offline keyframe for tracking the object in subsequent images from the video stream. To determine the pose of the object in this first image, several points of interest are determined, for example the p points with the strongest directional gradient. The image block defined around each of these points is compared to the patches determined during the preparation phase, according to the search tree. The size of these image blocks is preferably equal to that of the patches. The pose associated with each of the most similar patches of the image blocks is used to determine the pose of the object in the image. The pose of the object can be defined as the average of the poses of each selected patch, that is to say of each patch most similar to each of the blocks of images, or according to a voting mechanism. The pose thus determined is associated with the image from the video stream to form the key image offline. This offline keyframe is then used to initialize the tracking application (step 325). This process is fast and allows instant initialization. When the pose of the object is determined in the first image is that the current keyframe is selected (keyframe offline determined during the initialization phase), the tracking application can find the object (phase III) in the successive images of the video stream according to the tracking mechanism mentioned above (step 330). According to this mechanism, the displacements of the object (displacement of the object in the scene or displacement induced by the movement of the camera in the scene) are followed by a frame on the other and any drifts are compensated thanks to the news contained in the offline keyframe retained during initialization and, optionally, in the calculated online keyframe when running the application. The tracking application advantageously combines algorithms for detecting points of interest and reprojection of the points of interest positioned on the three-dimensional model towards the plane image to predict the result of a spatial transformation of a frame on the other. The pose of the object is thus estimated according to the correspondences between the points of interest of the current image resulting from the video stream, the points of interest of the current key image and the points of interest of the previous image. resulting from the video stream, that is to say according to the pairing of the points of interest from these images.

When the error measurement becomes too large, i.e. when the number of matches between the current keyframe and the current image becomes too small, the tracking is stalled and a reset phase is necessary. . The reset phase is similar to the initialization phase described above (steps 320 and 325). During this phase, the current image of the video stream is used to form the new offline keyframe whose pose is determined according to its points of interest and the search tree including patches determined during the preparation phase. The offline keyframe is therefore a dynamic offline keyframe that is updated automatically when the tracking application picks up.

Figure 4 shows schematically an apparatus adapted to implement the invention. The device 400 is for example a microcomputer, a workstation or a game console.

The device 400 preferably comprises a communication bus 402 to which are connected:

a central processing unit or microprocessor 404 (CPU, Central Processing Unit);

a ROM 406 (ROM, Read OnIy Memory) which may include the operating system and programs such as "Prog"; a random access memory or cache memory 408 (RAM, Random Access

Memory) with registers adapted to record variables and parameters created and modified during the execution of the aforementioned programs;

a video acquisition card 410 connected to a camera 412; and,

a graphics card 416 connected to a screen or to a projector 418. Optionally, the apparatus 400 may also have the following elements:

a hard disk 420 which may comprise the aforementioned "Prog" programs and data processed or to be processed according to the invention;

a keyboard 422 and a mouse 424 or any other pointing device such as an optical pencil, a touch screen or a remote control enabling the user to interact with the programs according to the invention;

a communication interface 426 connected to a distributed communication network 428, for example the Internet network, the interface being able to transmit and receive data; a data acquisition card 414 connected to a sensor (not shown); and,

- A memory card reader (not shown) adapted to read or write processed or processed data according to the invention.

The communication bus allows communication and interoperability between the various elements included in the device 400 or connected to it. The representation of the bus is not limiting and, in particular, the central unit is capable of communicating instructions to any element of the apparatus 400 directly or via another element of the apparatus 400. The code executable of each program allowing the programmable device to implement the processes according to the invention, can be stored, for example, in the hard disk 420 or in read only memory 406.

According to one variant, the executable code of the programs may be received via the communication network 428, via the interface 426, to be stored in the same manner as that described previously.

The memory cards can be replaced by any information medium such as, for example, a compact disc (CD-ROM or DVD). Of in general, the memory cards can be replaced by information storage means, readable by a computer or by a microprocessor, integrated or not integrated into the device, possibly removable, and adapted to store one or more programs whose execution allows the implementation of the method according to the invention.

More generally, the program or programs may be loaded into one of the storage means of the device 400 before being executed.

The central unit 404 will control and direct the execution of the instructions or portions of software code of the program or programs according to the invention, instructions which are stored in the hard disk 420 or in the read-only memory 406 or else in the other elements of FIG. aforementioned storage. When powering on, the program or programs that are stored in a non-volatile memory, for example the hard disk 420 or the read only memory 406, are transferred into the random access memory 408 which then contains the executable code of the program or programs according to the invention, as well as registers for storing the variables and parameters necessary for the implementation of the invention.

It should be noted that the communication apparatus comprising the device according to the invention can also be a programmed apparatus. This device then contains the code of the computer program or programs for example frozen in a specific application integrated circuit (ASIC).

Alternatively, the image from the video card 416 can be transmitted to the screen or the projector 418 through the communication interface 426 and the distributed communication network 428. Similarly, the camera 412 can be connected to a card 410 'video acquisition, separate from the camera 400, so that the images from the camera 412 are transmitted to the camera 400 through the distributed communication network 428 and the communication interface 426. Because of the simplification of implementation provided by the initialization and automatic reset process of the invention, the object tracking application can be implemented without having recourse to a specialist. The tracking application can be used in a standard way to track an object in a sequence of images from a video stream, for example to embed a video sequence on an object of the scene taking into account the position and the orientation of this object, but also to determine the motion of a camera according to the analysis of a static object of the scene. In this case the object is part of the set and find the pose of this object in the scene is therefore to find the pose of the camera in relation to it. It then becomes possible to add virtual elements in the scene provided that the geometric transformation between the object and the geometric model of the scene

10 is known. which is the case. This approach therefore allows to increase the real scene (the part of the piece that has been modeled) with animated virtual objects that move according to the geometry of the scene.

If Rf is the reference associated with the tracked object, Rs is the reference associated with the scene, Rk is the reference associated with the camera and Rm is the reference of the model.

15 Animated 3D, it is necessary to define the transform T _{f → c} of the reference Rf to the Rc mark, using the transform T _{f → s} (known) of the reference Rf to the Rs mark and the transform T _{s → m} of the mark Rs towards the reference Rm. The affine transformation which makes it possible to pass from the reference Rm associated with the virtual three-dimensional model to the reference Rc of the camera is determined by the following relation,

^ on ^ur PRc - ~ ^A Tf → ç - T ^λ f ~ → ^x s • ¹ Ts ~ → ^x m • ^r PRm where P _Rc is the transform of the point P _Rm defined in the reference Rm of the three-dimensional model and T _{I → J} is the affine transformation allowing to go from the reference i to the reference j. The previous relation can be simplified in the following form,

_Rc 25 P - Tm _→ T _c _s ^• ^• _→ f _→ T _{m s} - P _Rm is, RRC ~ ^ι → m c ^r Rm

The projection of the point P _Rm according to the parameters of the camera, which can be expressed in the form of the matrix P ₁ , makes it possible to obtain the point P _R ' _c defined in the image from the camera. The point P _R ' _c is thus defined according to the following relation,

Thus, a three-dimensional synthesis object defined in the frame of the scene can be projected on the current image of the video stream to increase the video stream with animated virtual objects.

The tracking application can also be used as an interface with a mobile device such as a PDA (personal digital assistant), a mobile phone or any other device provided with a video sensor. In particular, the application may consist of finding and then following in the image a texture rectangle previously learned, for example stored in an object database, in order to increase the real video stream with virtual models or secondary video streams stalled on the object. The main interest of this technology lies in the fact that the camera and the object being tracked can be moved freely in the scene. The geometric change mark processing is identical to that described above.

The object tracking application is particularly robust for current low quality standards such as the H.263 standard often used for sending and receiving video streams to and from a telecom server. In addition, it is possible to send control or control information by using the keys of the mobile device using, for example, the DTMF (Dual Tone Modulated Frequency) on the operator's infrastructure. telecom. In this type of application, object tracking and / or video stream enhancement processing can be local or remote. FIG. 5, comprising FIGS. 5a and 5b, illustrates two examples of architectures that can be used. Figure 5a corresponds to a remote tracking technique. The mobile device 500 comprises a transmitter / receiver 505 for transmitting the video stream to the server 510 which comprises a transmitter / receiver 515. The server 510 has an object tracking application and video stream enrichment so that that the server 510 is adapted to receiving one or more video streams from one or more mobile devices 500, tracking an object in the images of this stream, integrating a virtual object or a secondary video stream into these images and transmitting the video stream thus modified to the devices mobile 500 who display it. FIG. 5b illustrates an alternative according to which the application of object tracking and video stream enrichment is integrated in the mobile device 500 '. The server 510 'comprises an application, for example a game, controlled by the commands of the mobile device 500'. The data exchanged between the server 510 'and the mobile device 500' are control and command instructions as well as general information such as the results of the control and command instructions. The video stream does not need to be transmitted between transceivers 505 'and 515'. Whatever the architecture used, it should be noted that the server receives information on the type of object to follow, the relative position of the camera with respect to the object to be tracked and preferably the various actions performed by the user. 'user.

FIG. 6 shows an example of use of a mobile device for capturing and displaying images for an augmented reality application with object tracking, using one of the architectures presented in FIG. 5. The mobile device 600 includes an image sensor (not shown) for acquiring a video stream from a real scene 610 and a screen 605. The actual scene 610 includes an object 615 to be tracked, on which an illustration 620 which plays the role of texture. In this example, the scene containing the tracked object is projected on the screen 605 and the position of this object in the scene makes it possible to add a virtual object such as the dialog 625 or the animated three-dimensional virtual object 630.

As shown in FIG. 7, the mobile device can also be used as a six degree of freedom cursor or as a six degree of freedom motion sensor depending on the relative pose (position and orientation) of the mobile device relative to to the object being tracked. This slider or sensor can be used to control movements. Four types of displacement can be envisaged. According to a first "viewfinder" mode of displacement, the mobile device aims to simulate a pointing tool to guide actions, to aim and select zones or objects and possibly to move the selection. It is considered here that the plane texture is placed on a real flat surface such as a table. A target is displayed on the screen in the optical axis of the image sensor of the mobile device. This one is deformed according to the position of the camera because of its virtual projection on the table. The target object projected on the table is an ellipse, or other two-dimensional geometric object. It is also necessary to determine the intersections between the direction given by the optical axis of the camera with the three-dimensional virtual objects placed on the table in order to be able to perform actions on these three-dimensional objects. Finally, it is important to be able to determine if the virtual target attached to the table partially overlaps a virtual object in order to select it. The various applications concerned by this pointing device are mainly video games and in particular simulation games, racing games and shooting games.

A first step is to express in the plane of the table the equation of the ellipse resulting from the intersection of the cone centered on the optical axis with the plane of the table. The radius of the cone is preferably configurable by the phone keys and is expressed linearly with distance according to the function f (z _d) = r, for example f (z _d) = az _d where a is a parameter modifiable by the user and Z _d is the actual distance between the plane of the table and the camera. For the sake of clarity, it is here considered that the reference Rf of the object followed is identical to that of the table, the xy plane corresponding to the plane of the table and the z axis being directed upwards. The plane of the table thus has for equation z = Q.

In the reference frame Rf, the axis of the cone, ie the optical axis, is defined by the position of the camera P _c = [x _c y _c zj and by the vector F = [x _t y _t zj . The intersection I of the axis of the cone with the plane of the table is thus determined as follows, X ₁ ^{• X} !

^Z c

= y _c ^• y,

~ ^Z t

= 0

Knowing this point of intersection I, it is possible to deduce the distance between this point and the camera and thus to determine the radius b of the circle belonging to the cone whose center is this point of intersection I. It is then possible to deduce the following equation of the ellipse in the plane of the table according to the reference of the tracked object,

(xcosγ + ysinγ - x,) ² (ycosγ - xsinγ - y _j ) ² _ a ² b ² where γ represents the angle between the projection of the y axes of the reference points linked to the camera and the object followed in the plane of the table. This relation makes it possible to represent the ellipse in the image and to determine if an element of the image belongs to the ellipse, that is to say whether this element is selectable or not.

A second mode of movement allows the user to move instinctively in a virtual environment as if the camera was placed at the level of his own eyes. This mode of travel is particularly suitable for first person shooter type games and virtual museum visit type applications. Motion capture is performed from any reference position in three-dimensional space. This reference position can be changed at any time by a simple command. Small movements of the user with respect to this reference position are captured and transmitted to the application. This approach makes it possible to make displacements according to 6 degrees of freedom in a virtual environment.

The list of movements associated with these displacements with respect to the reference point can be defined in the following way, - the movement "forward", respectively "backward", is identified when the camera approaches, respectively away, from the object followed according to a displacement on the optical axis of the camera; a lateral translation is identified when the camera is moved on the left or on the right along the horizontal axis perpendicular to the optical axis;

the elevation movements in the virtual set are identified by a translation upwards or downwards of the camera;

- looking to the left or right is identified by a rotation of the camera along the vertical axis;

- look up or down is identified by a rotation of the camera along the horizontal axis perpendicular to the optical axis; and, - tilting the head to the left or right is identified when the camera rotates along the optical axis.

Naturally these movements are valid only if the tracked object is present in the field of the camera. In the opposite case, the last position is kept until the tracked object enters the field of the camera again.

A third mode of movement allows the user to control the movement of an object or virtual character seen by the user. Such a mode of travel is particularly suitable for video games and exploration games. The motion capture is performed according to the difference in pose between the reference Rf of the object followed and the reference Rc of the camera. The movements of the virtual object or character are defined as follows,

the optical axis of the camera represents the scene as it is perceived by the virtual object or character; - A translation on the horizontal axis perpendicular to the optical axis allows a lateral movement of the object or. virtual character; and,

- the magnification of the scene (zoom on the virtual objects) is determined by the distance between the camera and the tracked object.

The rotations around the optical axis and the translations along the vertical axis do not, a priori, have a pre-established function and may correspond to a particular use depending on the type of application targeted. The fourth mode of displacement is that according to which the motion capture is done directly between the pose difference of the reference mark Rc of the camera and the reference mark Rf of the tracked object. This mode aims to inspect a virtual object or virtual scene. It is thus possible to move around this element, to move closer to or away from it. This mode of displacement being very useful and very intuitive, it is particularly intended for educational applications, demonstrative and video games.

As mentioned, the system described makes it possible to improve the interactivity of many applications, especially in the field of games. The enrichment of the video stream combined with the controls and the function of the cursor makes it possible, for example, to create an interface adapted to each application. By way of illustration, the following example relates to a game of Tamagoshi type comprising different modes such as dressage, food and communication. The animal can do the good when we approach him, to be dizzy when we quickly turn around him and follow us when we turn slowly around him. We can beat him, for example to punish him by quickly moving the camera from one side to the other. It is possible to congratulate him by tapping on his head with a movement up and down. A key on the keyboard is used to select a food while a camera movement to the animal to launch the 3D synthetic object corresponding to this food. Different foods are available. It is also possible to pet our pet after getting a little closer to him. Different movements of the camera make it react differently. The animal can ask questions and the user can then answer yes or no (no from top to bottom, yes from right to left) and the answers are recorded. A scenario can be created to make the discussion smarter.

Another type of application concerns audiovisual presentations in which virtual objects, animated or not, or video streams are added in real time. Such applications are particularly used in the field of "broadcast" or "stand up cabinet". According to the invention, the facilitator may, during his presentation, freely manipulate a table, in orientation and position, and allow the display of a video stream or virtual information on this table. This table preferably comprises one or more areas of uniform color corresponding to the location where the secondary video stream must be inserted, for example according to the chromakey technique. Virtual objects are inserted relative to the current pose of the array. To facilitate this chart tracking, it can also contain a textured area, for example on the outside or in the center. The facilitator is then able to launch a report from this table: a technician behind the scenes triggers the display of a video stream in the table and the animator "launches" the subject by approaching the board towards the camera . The window for displaying the video stream then leaves the table and replaces the current video stream. An important aspect of this application is that it allows the presenter to be able to move his hand in front of the board and partially obscure the video stream. The animator can thus point to an element of the video stream presented on the board. To improve the robustness of the system and in particular to avoid the vibrations generally present in table monitoring applications, the image dimensions of the secondary video stream are here greater than those of the area or areas of uniform color to cover. Thus, a slight offset of the secondary video stream presented on the table does not reveal a uniform color area.

Other applications such as virtual tours of a museum or building can also be easily implemented.

Although the previous examples are based on the use of a single camera, it is possible to use several cameras simultaneously and thus allow, for example, the cooperation of several users located in the same real environment. It is thus necessary to place oneself in a common coordinate system such as the reference of one of the cameras or the reference of the tracked object, which is the same for the different cameras, and to make a projection in the image resulting from each camera according to the reference associated with each of the cameras.

It is necessary to determine the transformation to find the relative position and orientation of the users. The following relationship allows to transform the coordinates of a point expressed in the reference of the camera n in coordinates expressed in the reference of the camera 1:

where P _R represents the coordinates of the point P in the reference of the camera 1, P _Ru represents the coordinates of the point P in the reference of the camera n and T _{l → j} is the affine transformation which makes it possible to pass from the reference mark i to the reference mark j. The previous transformation can then be written in the following simplified form,

→ c, - It remains then only to perform the projection according to the camera parameters (matrix Pi) in order to find the coordinates of the point P _κ in the image, which leads to the following relationship,

^λ T m → c -

The use of several cameras makes it possible, for example, to implement applications that offer users the possibility of moving simultaneously and following the same object. An example of such applications is games, for example racing games of cars, planes or motorcycles. The control principle used for a racing game is that corresponding to the first mode of displacement, of "viewfinder" type, or the third mode allowing users to control the movement of virtual objects they see. It allows several players located in the same real environment to compete on a three-dimensional circuit positioned relative to the object tracked in the real scene. Each user controls a gear and the collaborative mode allows different gear to interact within the application. It is also possible to use this feature for board games or board games. From a downloaded and printed platform, it is possible to consider a whole range of collaborative applications. Board games that use a board and pawns can be simulated by three-dimensional virtual objects. The keys of the keyboard can then act on the configuration of the game and to confront other players live. Similarly, sports games, for example tennis type can be implemented. For this type of application, the mode of displacement used is preferably the third mode allowing users to control the movement of virtual objects they see. The orientation of the line of sight of the camera thus gives the orientation of the character in the games. A touch of the keyboard advantageously "hit" in the ball when it comes near the player. The reference position allows you to simply move forward and backward, by translating the camera on its axis of view. For lateral movements, the lateral movements of the user are taken into account. Naturally, to meet specific needs, a person skilled in the field of the invention may apply modifications in the foregoing description.

Claims

A method for real-time tracking of at least one substantially planar geometrical object of a real scene in at least two substantially consecutive images of at least one video stream, in an augmented reality application, the initialization of the method being automatic, this method being characterized in that it comprises the following steps, - receiving a first image of said at least one video stream

(300), said first image comprising said at least one object to follow;

determining the position and orientation of said at least one object to be tracked in said first image (320) from a plurality of previously determined image blocks, each image block of said plurality of image blocks; image being associated with a pose of said at least one object to follow;

creating a key image comprising said first image and the position and orientation of said at least one object to be followed in said first image;

- receiving a second image of said at least one video stream (330), said second image comprising said at least one object to follow; and,

evaluating the position and orientation of said at least one object to be followed in said second image from said key image (330).

The method of claim 1 further comprising a prior step of determining said plurality of image blocks and said associated poses from at least one textured image of said object to be tracked.

3. Method according to claim 1 or claim 2 characterized in that the steps of receiving an image and evaluating the position and orientation of said at least one object to follow in said received image are repeated for images of said at least one video stream following said second image.

4. Method according to any one of claims 1 to 3 characterized in that if the position and the orientation of said at least one object to following can not be evaluated in an image received from said keyframe, the position and orientation of said at least one object to be tracked in said received image are evaluated from said plurality of image blocks (320).

5. Method according to any one of the preceding claims, characterized in that the values of said keyframe are replaced by a received image and the evaluated position and orientation of said object to be followed in said received image.

6. Method according to any one of the preceding claims, characterized in that it further comprises a step of evaluating the installation of the image sensor from which said at least one video stream originates in a frame linked to said real scene. from the evaluated position of said at least one object to follow.

7. The method of claim 6 characterized in that it further comprises a step of determining the movement of said image sensor.

8. Method according to any one of the preceding claims characterized in that said at least one object to follow comprises a uniform color area adapted to implement chromakey technology.

9. The method as claimed in claim 1, wherein said at least one object to be tracked is simultaneously tracked in said at least one video stream and in at least one other video stream, said at least one video stream and said at least one video stream. another video stream coming from at least two different image sensors, said method further comprising a step of estimating the relative position of said at least two image sensors.

10. Method according to any one of the preceding claims, characterized in that it further comprises a step of inserting at least one element into at least one of said received images according to the position and the orientation evaluated from said to least one object to follow in said received image, said at least one element being selected from a list comprising at least one representation of at least one virtual object and at least one second video stream.

The method of any one of the preceding claims further comprising a step of determining at least one control of control, said at least one control command being determined according to the variation of pose said object to follow.

12. Method according to any one of the preceding claims further comprising a step of transmitting at least one indication relating to the variation of the pose of said object to follow.

The method of any of claims 1 to 11, further comprising a step of acquiring said first and second images and a step of displaying at least a portion of said first and second images, said steps of acquiring and display being implemented in a device separate from the device implementing said other steps.

14. Computer program comprising instructions adapted to the implementation of each of the steps of the method according to any one of claims 1 to 12.

15. Device for tracking in real time at least one substantially planar geometric object of a real scene in at least two substantially consecutive images of at least one video stream, in an augmented reality application, the initialization of said tracking being automatic, said device being characterized in that it comprises the following means, means for receiving a first image of said at least one video stream, said first image comprising said at least one object to be followed;

means for storing said first image in first storage means;

means for determining the position and orientation of said at least one object to be followed in said first image from a plurality of image blocks previously stored in second storage means, each image block of said plurality of image blocks being associated with a laying of said at least one object to follow, said pose being stored in said second storage means, the position and the orientation of said at least one object to be followed in said first image being stored in said first storage means; means for receiving a second image of said at least one video stream, said second image comprising said at least one object to be tracked; and,

16. Device according to claim 15 characterized in that it further comprises means for determining whether the position and the orientation of said at least one object to follow can be evaluated in said second image from the data stored in said first means storage means, said means for determining the position and orientation of said at least one object to be followed in said first image from the data stored in said second storage means being adapted to evaluate the position and orientation of said at least one object to be followed in said second image from the data stored in said second storage means.

17. Device according to claim 15 or claim 16 characterized in that it further comprises means for storing said second image and the position and orientation of said at least one object to be followed in said second image in said first means of storage.

18. Device according to any one of claims 15 to 17 characterized in that it further comprises transformation means adapted to determine the laying of said at least one object to follow or the image sensor from which said at least one a video stream in one of the references related to said real scene, said at least one object to follow or said image sensor.

19. Device according to any one of claims 15 to 18 characterized in that it further comprises means for inserting at least one element into at least one of said received images according to the position and orientation evaluated said at least an object to follow in said received image, said at least one element being selected from a list comprising at least one representation of at least one virtual object and at least one second video stream.

20. Device according to any one of claims 15 to 19 further comprising means for receiving said first and second images of at least one mobile phone.