WO2009036831A1

WO2009036831A1 - Device and method for aligning a 3d object in an image corresponding to a field of view of a recording device

Info

Publication number: WO2009036831A1
Application number: PCT/EP2008/005782
Authority: WO
Inventors: Peter Eisert; Philipp Fechteler; Jürgen Rurainsky
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2007-09-14
Filing date: 2008-07-15
Publication date: 2009-03-26
Also published as: DE102007043836B3

Abstract

The invention relates to a device (20) for aligning a 3D object in an image corresponding to a field of view of a recording device (12), comprising a unit (21) for segmenting the image (22) into foreground and background in order to obtain a first silhouette image (23), a unit (24) for synthesizing a second silhouette image (25) of the 3D object in a starting position, and a unit (26) for estimating alignment parameters (27) for aligning the 3D object from the starting position based on variances between the first and second silhouette images.

Description

Device and method for aligning a 3D object in an image corresponding to a field of view of a recording device

description

The present invention relates to methods and apparatus for image analysis and synthesis, and more particularly to methods and apparatus for aligning and determining the orientation of a 3D object in an image corresponding to a field of view of a recording apparatus.

For a virtual fitting of, for example, individualized shoes, u.a. used a computer-aided extension of the perception of reality, combining real images or video with virtual 3D objects represented by 3D computer graphics models.

To give a person an idea of what a garment, such as a shoe, will look like after it has been made, the person in pre-existing systems can step in front of a so-called virtual mirror. In this case, a recording device, such as a camera, records the person wearing special try-on shoes with a standard design. A reproduction device, for example in the form of a monitor, replaces a real mirror and outputs a horizontally rotated camera image. The monitor is attached in such a way that the person or body parts of it appear at least approximately at the same position where the person would expect to see them if they looked into a real mirror. In order to enhance a virtual impression, a background of the images recorded by the recording device is separated from an image foreground and replaced by a synthetic environment. For the virtual fitting, the position and orientation of relevant parts of the body are estimated. Once known, computer graphics models (eg of garments) are rendered and integrated into the video sequence so that real garments can be replaced with corresponding virtual garments. In the following, rendering refers to the generation of a digital image from an image description. In order for the person to be able to move freely during the virtual fitting, it is necessary to estimate their movement and to orient or orient the virtual objects or garments according to the estimated movement in the video sequence.

It is therefore the object of the present invention to provide an improved concept for aligning a 3D graphics object in a video image.

The object is achieved by a device according to claim 1, a method according to claim 13 and a computer program according to claim 14.

The recognition of the present invention is that alignment of a 3D graphics object in a video image or in a video sequence can be accomplished by using both an image synthesized from the 3D graphics object and a foreground of the recorded image Video sequence silhouettes images are generated. By superimposing the individual silhouette images and determining deviations of the silhouette images from one another, a silhouette of the 3D object can be adapted to a silhouette of the real image at least in a subregion of interest. This is done according to embodiments by means of a gradient-based concept, which uses the so-called optical-flux equation. In this case, for a region of interest in which a first silhouette image of a real image and a second silhouette image of the image synthesized from the 3D object are to be matched to each other, it is determined how pixels of the synthesized image in order to obtain an alignment of the respective silhouettes in the region of interest.

Thus, with embodiments of the present invention in a video image, for example, virtual shoes can be placed over real existing shoes, thus effecting a virtual fitting of the virtual shoes. In this case, a person can move freely in front of a recording device. A virtual fitting of other clothing or accessories, jewelry, hairstyles is of course also possible.

To this end, the present invention provides apparatus for aligning a 3D object in a cradle image corresponding to a field of view of a cradle with means for segmenting the cradle image into foreground and background to obtain a first silhouette image, means for synthesizing a second silhouette image of the 3D object in a starting position and means for estimating alignment parameters for aligning the 3D object with the starting position based on deviations between the first and second silhouette images.

According to embodiments, the recording device is a camera for two-dimensional recording of video sequences with a predetermined resolution in the horizontal and vertical directions. The SD object in embodiments of the present invention is a 3D object of a shoe, in particular a sports shoe. Thus, embodiments of the present invention may serve to facilitate a virtual fitting of shoes, in particular sports shoes. In this case, methods for aligning the 3D object according to embodiments are implemented in such a way that they enable the alignment of the 3D object in the recorded image in real time, in order thereby to prevent a movement of a 3D object Person in front of the cradle meet. Real time means the time spent in the "real world".

In order to make the alignment of the 3D object or the matching of the first and second silhouette image in the image area of interest as reliable as possible, in embodiments of the present invention, the two silhouette images are each filtered with a low-pass filter to make abrupt silhouette edges into linear ramps with constant To transform intensity gradients.

An advantage of the present invention is that movement of body parts of low complexity can be estimated and transmitted to computer graphics models. The low complexity allows a real-time comparison of body movements and 3D object movements.

Preferred embodiments of the present invention will be explained below with reference to the accompanying drawings. Show it:

1 shows a schematic representation of a virtual mirror as a possible application of exemplary embodiments of the present invention;

FIG. 2 is a block diagram of an apparatus for aligning a 3D object according to an embodiment of the present invention; FIG.

3a is a schematic representation of a silhouette image of two legs and shoes according to an embodiment of the present invention;

3b is a schematic representation of a vertical intensity histogram according to an embodiment of the present invention; 3c is a schematic representation of a horizontal intensity histogram according to an embodiment of the present invention;

4 is an illustration of a superimposition of a first silhouette image and a second silhouette image in a starting position according to an exemplary embodiment of the present invention;

5 is a diagram for explaining a principle of alignment parameter estimation according to an embodiment of the present invention;

6 shows a perspective projection in which 3D coordinates of a 3D object point are projected into an image plane;

7a, b show two examples of a shoe rendering with some removed shoe parts according to an embodiment of the present invention; and

8 shows representations of recorded real images and correspondingly virtually expanded images with individualized shoes.

With regard to the following description, it should be noted that in the different embodiments, identical or equivalent functional elements have the same reference numerals and thus the descriptions of these functional elements in the embodiments illustrated below are interchangeable.

Fig. 1 shows schematically a system 10 for the realization of a virtual mirror, in which embodiments of the present invention may find application.

The system 10 comprises a camera 12, a device 14 for processing images recorded with the camera 12. and an output device 16 for outputting a virtual mirror image from an image recorded with the camera 12. In this case, the virtual mirror image is computer-aided, for example, with virtual garments, such as shoes, expanded.

The camera 12, which may be, for example, an XGA (XGA = Extended Graphics Array) FireWire camera (FireWire = i.Link or IEEE 1394), is mounted close to the monitor 16. The camera 12 is directed downwardly for an application of the shoe-fitting system 10 to record the feet of a person standing on a floor 18 in front of the system 10. The legs of the person belonging to the foreground in the real image recorded by the camera 12 are separated in the means 14 for processing from the background of the recorded image and reproduced on the monitor 16 after the recorded image has been horizontally mirrored. The position of the monitor 16 and the viewing direction of the camera 12 are chosen such that an average person on the monitor 16 sees approximately the same as if they were looking at a real mirror mounted in the same position as the monitor 16 ,

For example, the bottom 18 in front of the camera 12 is kept green or blue to allow application of so-called chroma keying techniques to facilitate the segmentation of the foreground and background with changing lighting and any colors of clothing. Chroma-keying in film or television technology refers to processes which make it possible to subsequently place objects or persons in front of a background which can contain either a real film recording or a computer graphic. An additional light source below the camera 12 can reduce effects caused by shadows. In the processing device 14, image processing methods, motion tracking, rendering and computer-aided enhancement of the perception of reality are implemented. For example, in embodiments of the present invention, the processing device 14 may be a personal computer. In embodiments of the present invention, the means 14 for processing includes a server that allows control of the system 10 and interfaces with a configuration database.

The device 14 comprises, according to exemplary embodiments, a device 20 for aligning a 3D object in an image corresponding to a field of view of the camera 12, which is shown schematically in FIG. 2.

The device 20 comprises a device 21 for segmenting the camera image 22 recorded by the camera 12 into a foreground and background in order to obtain a first silhouette image 23. Furthermore, the device 20 comprises a device for synthesizing a second silhouette image 25 of the 3D object in a starting position. The first silhouette image 23 and the second silhouette image 25 form inputs of a device 26 for estimating alignment parameters 27 for aligning the 3D object from the home position based on deviations between the first silhouette image 23 and the second silhouette image 25.

The (calibrated) camera 12 continuously records the space in front of the system 10 and transmits the recorded camera images 22, for example with a resolution of 1024 x 768 pixels, to the means 21 for segmentation. All automatic camera controls are switched off in order to avoid unexpected behavior, for example after changing the light. To avoid interference with artificial ambient lighting, the shutter speed of the camera 12 is one with the flicker frequency ambient lighting synchronized. The exposure of the camera 12 is recalculated each time according to one embodiment, and the gain of the camera adjusted accordingly when no one is near the camera 12 to adjust the camera 12 according to changing illumination.

An idle state of the system 10 is determined by a change detector that utilizes information about spatial-temporal variations in the video signal 22 provided by the camera 12. After the camera exposure has been adjusted in accordance with a current ambient light situation, a background image is calculated in exemplary embodiments by, for example, averaging ten consecutive video images. This background image is used by the segmentation device 21 to separate the mainly green and blue background of shoes and legs in the foreground of the recorded camera image 22.

In order to account for real-time constraints, the means 21 for segmentation is adapted to scale an image resolution of the recorded camera images 22. As a result, the image signal processing can take place in a so-called image pyramid. For this purpose, the recorded camera image 22 is filtered and, for example, downscaled four times in succession by a factor of 2, until a resolution of, for example, 64 × 48 pixels is achieved. Other scaling factors and resolutions are of course also conceivable.

According to embodiments of the present invention, the means 21 for segmentation is adapted to separate the foreground and background of the recorded camera image 22 by first of all the background from the foreground for a downscaled image compared to the camera image 22 based on background information and knowledge of Background color and possible shadow fluxes to obtain a low-resolution silhouette image, and thus to detect silhouette edges of the first silhouette image 23 in the resolution of the camera image 22 based on the low-resolution silhouette image and the background information. In other words, the separation or segmentation begins, for example, with an image scaled down to 64 × 48 pixels, in which all the pixel colors of the recorded image are compared with the corresponding pixel colors of the previously calculated background image. When deciding whether a pixel belongs to the foreground or the background, according to embodiments, an RGB color table (RGB = red green blue) with 64 ³ entries is used. The RGB color space can be schematized in the form of a cube. This color cube is adaptively filled with the green background pixels. In order to be able to process shadows and reflections on the ground 18, the resulting shape of the background pixels in the RGB color cube is extended by cylinder- and cone-like models. After the pixels have been classified, ie, whether they belong to foreground or background, small holes are filled and small areas are removed until only the two legs with the shoes remain. A resulting silhouette image or a segmentation mask is then passed on to higher resolution levels of the image pyramid. There, only those image areas in foreground and background are then segmented, which have their origin in the peripheral areas of the silhouette image of the respectively lower resolution area of the image pyramid. In this case, edge area means the border area between image foreground and image background. This procedure is repeated until the original resolution (1024 x 768) is reached, thereby obtaining segmentation masks or first silhouette images for each image pyramid resolution level.

For the sake of clarity, only a first and a second silhouette image will be discussed below, with the first and second silhouette images can refer to any image pyramid or resolution level.

According to exemplary embodiments of the present invention, the device 21 comprises a device for determining an area in the first silhouette image 23 at which the 3D object is to be aligned. For this purpose, in embodiments, the means for determining the area is adapted to determine intensity distributions in the horizontal and vertical dimension in the first silhouette image in order to obtain coordinates for the starting position of the 3D object therefrom. For this purpose, horizontal and vertical intensity histograms can be calculated, which can also be used to determine if a person has entered the field of view of the camera 12.

A schematic representation of a silhouette image of two legs and shoes is shown schematically in FIG. 3a. FIG. 3b schematically shows a vertical intensity histogram, which results from the silhouette image according to FIG. 3a. Accordingly, Fig. 3c shows a horizontal intensity histogram resulting from the silhouette image of Fig. 3a.

From the vertical histogram shown in FIG. 3b, a start of intensity values at ay coordinate y.sub.i can be recognized, y.sub.i thus serving as an indication of the foot position of the feet standing at a vertical height according to this example.

From the horizontal histogram shown in Fig. 3c can be two areas X ₁ - x ₂ and X ₃ - X ₄ make up with increased intensity. These two areas correspond to the areas of both legs and feet. Thus, the left toe can be determined from the coordinates (X ₁ , V ₁ ) and the right toe can be determined from the coordinate (x ₄ , Y ₁ ). That is, the means for determining the area is adapted according to embodiments to the coordinate Y ₁ for the Starting position of the 3D object in the vertical direction from an abrupt increase in intensity or decrease in intensity in the vertical direction in a lower portion of the first silhouette image 23, and by a coordinate Xi or X ₄ for the initial position of the 3D object in the horizontal direction from an abrupt increase in intensity or intensity decrease in the horizontal direction in the first silhouette image 23. Alternatively, two separate vertical histograms for the areas separated by Xi - X ₂ and X ₃ - X ₄ can be calculated to take into account feet that are not at a common vertical height. Second contour images can now be placed on the output coordinates thus determined by suitably aligning 3D objects (eg, shoe models) that have been synthesized by the device 24. This situation is shown schematically in FIG.

4 shows a first silhouette image 23 of a shoe with a leg and a second synthesized silhouette image 25 of a 3D object (corresponding to a shoe, for example) in a starting position. The initial position is determined by the start coordinates determined by the histograms and an output orientation (e.g., perpendicular) of the 3D object.

The estimation means 26 estimates the alignment parameters for the 3D object by means of a first frame image 23 corresponding to a single frame, which has been derived from a camera image 22 recorded by the camera 12.

Rather than tracking a certain number of feature points in the captured camera image 22, the entire recorded camera image 22 is exploited for robust alignment parameter estimation. The principle of the alignment parameter estimation will be briefly explained below with reference to FIG. Fig. 5 shows a first silhouette image 23 of a leg with shoe and a second silhouette image 25 of a synthesized shoe in a starting position. Movement or alignment parameters for the 3D object of the synthesized shoe are now to be estimated in such a way that a 3D object aligned in accordance with the alignment parameters or the resulting second silhouette image 25 lies above the silhouette of the shoe of the first silhouette image 23 comes. Thereby, the synthetic shoe corresponding to the second silhouette image 25 can be overlaid with the real shoe corresponding to the first silhouette image 23, so that the impression later arises that a person wears the synthesized shoe.

The second silhouette image 25 of the 3D object is compared with the first silhouette image 23 of the recorded image. All motion or orientation parameters (R _x , R _y , R _z , t _x , t _y , t _z ) are optimized in order to obtain as perfect a match as possible between the first and second silhouette images. Here, R _x , R _y and R _{2 are} rotational angles (eg Euler angles or Euler angles) and t _x , t _y and t _{z are} components of the displacement or translation vector [t _x t _y t _z ] ^τ for a 3D object.

Using silhouette images 23, 25 as input to the alignment parameter estimator 26 results in robust results for the alignment parameters (R _x , R _y , R _z , t _x , t _y , t _z ) even for highly specular garment materials , in particular sports shoes with possible reflectors. However, according to embodiments, the device 26 may be provided with texture and color information (possibly additional) to estimate the alignment parameters. That is, the device 26 for estimating the alignment parameters (R _x , R _y , R _z , t _x , t _y , t _z ) is formed in accordance with embodiments to provide texture information from the video image 22 or image signal processing in addition to the silhouette images 23, 25. such as detection of horizontal and / or vertical edges, to use derived image information.

The tracking corresponds to the finding of those 3D alignment parameters (R _x , R _y , R _z , t _x , t _y , t _z ) that result in an optimal alignment of the two-dimensional silhouette images 23, 25 (and / or color information ) to lead. A complete search in six-dimensional (or for a pair of shoes in twelve-dimensional space) would be very inefficient at this point. Therefore, the alignment parameters (R _x , R _y , R _z , t _x , t _y , t _z ) are directly calculated according to embodiments using a gradient-based technique.

For this, the means 26 is adapted for estimation to filter the first and second silhouette images 23, 25 respectively with a low-pass filter in order to smooth intensity values or gray levels on the silhouette edges of the first and the second silhouette image. According to embodiments, this is achieved by a two-dimensional convolution with a separable moving average filter (box filter) with a plurality of coefficients in each dimension. The number of coefficients in the x and y dimensions may be seven, for example, or may be chosen differently depending on the resolution level. This filtering operation transforms the binary silhouette edges into linear ramps with constant intensity gradients. Thus, the closer a pixel is to a silhouette object, the higher the corresponding intensity value I (x, y) of the pixel at the location (x, y).

According to embodiments, the means 26 for estimating is configured to estimate the alignment parameters 27 based on deviations of intensity values from edge regions of the first and second silhouette images. For this purpose, a system of equations can be set up and solved which is based on a difference (I ² (x, y) -I ^x (x, y)) formed from the first and the second silhouette image. and spatial derivatives I _x ( ^χ > y), I _y ( ^χ > y) depend on a constructive overlay formed from the first and second silhouette images and parameters defining the field of view of the capture device. This is done according to embodiments based on the optical flux equation

l (x, y) -d _x + _y (x, y) -d _y = I \ x, y) -r (x, y) ₍₁₎ and

where I _x ( ^x > y) has an averaged intensity gradient in the x direction, I _y (. ^χ iy) an averaged intensity gradient in the y direction, (I ² (x, y) - I ¹ (x, y)) an intensity difference between the filtered second silhouette image 25 and the filtered first silhouette image 23, and d _x , d _{y describe} two-dimensional displacement parameters in the x and y directions. The two-dimensional displacement parameters d _x , d _y are in accordance with Eq. (2) functionally related to the motion parameters (R _x , R _y , R ₂ , t _x , t _y , t _z ). Eq. (2) information about a rigid body motion model and knowledge about parameters of the camera 12. In addition, Eq. (2) information for each pixel about the distance z between the camera and the associated object point of the synthesized image 25, which can be determined, for example, efficiently from the z-buffer of the graphics card.

A camera model describes a relationship between a 3D virtual world and the camera 12 2D video images and is needed for both rendering and alignment parameter estimation. A perspective projection, in which 3D coordinates of a 3D object point [x, y, z] ^τ are projected into an image plane 60, is shown by way of example in FIG.

The 3D coordinates [x, y, z] ^τ are calculated according to X = X ₀ - Z _x ^

projected into the image plane 60. In this case, f _x and f _{y denote} the focal length of the camera 12 multiplied by scaling factors in the x and y directions. These scaling factors f _x , f _y transform the 3D object coordinates [x, y, z] ^τ into 2D pixel coordinates X and Y. In addition, they allow the use of non-square pixel geometries. The two parameters Xo and Y ₀ describe the center of the image and its displacement from the optical axis of the camera 12 due to an unaccurate placement of a CCD (Charge Coupled Device) sensor of the camera 12. The four parameters f _x , f _y , _x o and For example, yo can be obtained from a camera calibration.

The averaged intensity gradients I _x ( ^x > y), I _y ( ^... Iy) can be determined, for example, by a constructive superimposition according to FIG

^{• ~} 2 ^{(i = x} '^ ⁾ ' ⁽⁴⁾

For example, where I _x ⁿ (x, y) (n = 1, 2) is measured, for example

and I _y ⁿ (x, y) (n = 1, 2), for example, according to r _(xj) (/ '( ^χ , y) - /' ( ^χ , yi)) + (/ '( ^χ -i, y) - / - ( ^χ -i, yi))

(6)

can be determined. Here, I _x ¹ ^ y) corresponds to the intensity gradient of the first filtered silhouette image 23 in the x direction and I _y ¹ (x, y) to the intensity gradient of the first filtered silhouette image 23 in the y direction. decision The same applies to I _x ² (x, y) and I _y ² (x, y) for the second filtered silhouette image 25. I ¹ ^ y) and I ² (x, y) respectively correspond to intensities of the first and second filtered silhouette images at the point (x, y). Of course, other pre-scripts to determine the partial intensity derivatives or intensity gradients J _x (x, y), I _y ( ^x > y) are also possible.

Eq. (1) can be set up for each pixel (x, y) or each inter-pixel position of the silhouette images 23, 25. However, in preferred embodiments of the present invention, it is set up only for those points for which the right-hand part of Eqs. (1) is different from zero.

A combination of Eq. (1) and Eq. (2) Similar to P. Eissert and B. Girod, "Analyzing facial expressions for Virtual Conferencing", IEEE Computer Graphics and Applications, pp. 70-78, Sep. 1998, presents another equation for each pixel ( x, y) for which the right part of Eqn. (1) is different from zero, near the silhouette edges of the silhouette images 23, 25. Thus, an overdetermined linear system of equations which are efficiently solved by the least squares approach is obtained to obtain the alignment parameters (R _x , R _y , R ₂ , t _x , t _y , t ₂ ). Remaining errors in the alignment parameter set (R _x , Ry, R _z , t _x , t _y , t _z ) can be obtained For example, the motion tracking is iteratively applied.

The optical flow condition of Eq. (1) is based on the assumption of a relatively small movement offset between the first silhouette image 23 and the second silhouette image 25. To overcome this limitation, as already described above, according to embodiments, a hierarchical image pyramid approach is followed. In this case, first, a rough estimate of the orientation parameter (R _x, R y, R _z, t _x, t _y, t _z) based on scaled-down and low-pass filtered silhouette images where the assumption of linearity is valid for a larger image area. The alignment parameters

(R _x , R _y , R ₂ , t _x , t _{y /} t _z ) for the 3D object are estimated and remaining errors are reduced based on higher resolution silhouette images 23, 25.

If the alignment parameters (R _x , R _y , R _z , t _x , t _y , t _z ) have been determined to align the 3D object, for example, for each one of a pair of shoes, 3D computer graphics models of individualized shoes may be provided be rendered at the current image position of the real shoes, so that the person's real shoes in the field of view of the camera 12 are replaced or superimposed by the 3D computer graphics models.

The 3D models can be individually configured by, for example, selecting a base model and then choosing between different sole types, materials and colors. In addition, individual embroideries, e.g. Flags or text to be added. Based on these configuration data, an individual 3D model is assembled. To do this, the geometry, texture, and colors of the 3D models are modified to represent the selected design. Each 3D shoe model consists of various 3D subobjects composed of triangular meshes. These 3D subobjects can be replaced to get different geometries.

To model different surface materials, individualized textures can be selected from a database. In addition, the textures can be assigned colors to individualize individual parts of the shoes. In this way, a person can choose between many models and assemble a shoe according to their personal preferences.

The 3D object or 3D objects can be used with common SD software tools at the position of real shoes and with orientation determined by the means 26 for estimation. In the rendering and computerized reality perception enhancement process, a background is first rendered. This can for example consist of real and / or synthetic videos / animation or individual images. Thereafter, the original video sequence is rendered using the corresponding silhouette image sequence as the alpha channel for the RGBA texture map. The use of intermediate values of the alpha channel at the object edges may improve the embedding of the segmented video sequence in the background. The alpha channel (α-channel) is an additional color channel in digital images which, in addition to the color information coded in a color space, stores the transparency or transparency of the individual pixels. Finally, the 3D objects are superimposed corresponding to the virtual shoes that cover the original shoes in the segmented video.

However, the legs in the original 2D video should also cover some parts of the synthesized shoes. By adding a transparent, invisible leg model, the Z-buffer of a graphics card can be manipulated so that all overlaps can be correctly detected and the 3D model inserted into the 2D video. Z-buffering is used in computer graphics to detect hidden areas in a 3D computer graphic. Through information in the Z-buffer, the procedure determines pixel by pixel, which elements of a scene must be drawn and which are hidden. Today's graphics cards support Z-Buffering as the standard technique for solving the visibility problem in hardware. When an object is rendered by a 3D graphics card, the depth information of the generated pixels (the z-coordinate is stored in the so-called Z-buffer.) This buffer, usually constructed as a two-dimensional array (with the indices X and Y), contains for each one on the screen visible point of the object a depth value If another object is to be displayed in the same pixel, the rendering algorithm compares the depth values of both objects and assigns the pixel the color value of the object closest to the observer. The depth information of the selected object is then stored in the Z-buffer and replaces the old value. The Z-Buffer allows the graphics card to simulate natural depth perception: a nearby object hides a distant object. Moreover, the pixel-by-pixel depth values of the z-buffer resulting from the synthesis can be used to efficiently obtain the distance information from object points shown in Eq. (2) needed to be determined.

For this purpose, FIGS. 7a and 7b show two examples of a shoe rendering with some removed shoe parts which are later covered by the legs.

In the following some results of the estimation of the alignment parameters 27 and the rendering are presented. For this purpose, four different shoe models were configured and the virtual mirror system 10 was started.

A camera 12 records a scene with a resolution of 1024 x 768 pixels. A person enters the green area 18 in front of the system 10.

In all cases, the shoes were correctly detected, segmented and tracked in their motion. 8 shows various examples of an output of the virtual mirror system. The upper row of pictures shows some pictures from the original scene, which were captured with the camera 12. The results corresponding to these images, which were output on the monitor 16, are shown in the lower image row. It can be seen that the 3D computer models correctly follow the 3D movement of the original shoes - even for fairly extreme foot positions. Since the entire system should behave like a real mirror, real-time signal processing is needed. All algorithms are therefore optimized in terms of speed. Image processing algorithms are used in a bi-pyramid, and motion tracking is also calculated at a lower resolution.

In summary, embodiments of the present invention provide a concept for real-time 3D motion tracking of objects, particularly shoes, in a virtual mirror environment. From images of a single camera 12, alignment parameters corresponding to the motion of body parts are estimated using low complexity linear optimization methods. Motion tracking is not limited to footwear models but can also be applied to other objects if a corresponding three-dimensional geometry description is available. The motion information or alignment parameters are then used to render customized athletic shoes into the real scene so that a person can observe with the new shoes.

It should be understood that the present invention is not limited to the particular components of the device or the illustrated approach, as these components and methods may vary. The terms used herein are intended only to describe particular embodiments and are not intended to be limiting. When the singular or indefinite articles are used in the specification and claims, these also refer to the majority of these elements unless the overall context clearly makes otherwise clear. The same applies in the opposite direction.

Depending on the circumstances, the methods according to the invention can be implemented in hardware or software. The implementation may take place on a digital storage medium, in particular a floppy disk, CD or DVD with electronic storage medium. nisch readable control signals that can interact with a programmable computer system so that the appropriate method is executed. In general, the invention thus also consists in a computer program product on a machine-readable medium stored program code for carrying out the method according to the invention, when the computer program product runs on a computer. In other words, the present invention is therefore also a computer program with a program code for carrying out the method for aligning, when the computer program runs on a computer and / or microcontroller.

Claims

claims

A device (20) for aligning a 3D object in a field of view of a recording device

(12) corresponding cradle image (22), having the following features:

means (21) for segmenting said picker image (22) into a foreground and background to obtain a first silhouette image (23);

means (24) for synthesizing a second silhouette image (25) of the 3D object according to the visual field in an initial position; and

means (26) for estimating alignment parameters (27) for aligning the 3D object from the starting position based on deviations between the first and second silhouette images.

2. Device according to claim 1, wherein the receiving device (12) comprises a camera.

3. Device according to claim 1 or 2, wherein the 3D object represents a 3D object of a shoe.

A device according to any one of the preceding claims, wherein the device (20) further comprises means for determining an area in the first silhouette image (23) to which the 3D object is to be aligned.

5. The apparatus of claim 4, wherein the means for determining the area is adapted to determine in the first silhouette image intensity distributions in horizontal and vertical dimensions, and to obtain coordinates for the starting position of the 3D object.

The apparatus according to claim 5, wherein the means for determining the area is adapted to provide a coordinate for the initial position of the 3D object in the vertical direction from an abrupt increase in intensity or decrease in intensity in the vertical direction in a lower area of the first silhouette image (23) receive.

7. The apparatus according to claim 5, wherein the means for determining the area is adapted to obtain a coordinate for the initial position of the 3D object in the horizontal direction from an abrupt intensity increase or intensity decrease in the horizontal direction in the first silhouette image ,

Apparatus according to any one of the preceding claims, wherein the means for segmenting (21) is adapted to separate the foreground and background by first displaying the background from the foreground for an image scaled down in resolution relative to the cradle image (22) Based on a background information to obtain a low-resolution silhouette image, and thus to obtain silhouette edges of the first silhouette image (23) in the resolution of the capture device image (22) based on the low-resolution silhouette image and the background information.

9. Apparatus according to any one of the preceding claims, wherein the means (26) is adapted for estimating to low-pass-filter the first and second silhouette images to smooth out silhouette edges of the first and second silhouette images.

10. The apparatus according to claim 1, wherein the means for estimating is configured to set up and to solve a system of equations according to an optical-flux equation that is different from a difference formed by the first and second silhouette images depends on local derivations of a constructive overlay and parameters formed from the first and the second silhouette image, which define the field of view of the recording device.

11. The device according to claim 10, wherein the device

(26) is configured to estimate the alignment parameters (27) based on a combination of

ϊ _x (x, y) -d _{x +} ϊ _y ( ^χ , y) -d _y = i ² ( ^χ , y) -i ^ι ( ^χ , y) and

where f (.) is a function specification, (R _x , R _y , R _z , t _x , t _y , t _z ) the alignment parameters (27), ΛC *>. y) an averaged intensity gradient in the x direction, I, ( ^χ > y) an averaged intensity gradient in the y-direction, (I ² (x, y) - I ¹ (x, y)) an intensity difference between the filtered second silhouette image (25) and the filtered first silhouette image (23) and d _x , d _y denote two-dimensional displacement parameters in the x and y directions.

12. An apparatus according to any preceding claim, wherein the means (26) for estimating the alignment parameters (27) for aligning the 3D object in addition to the first and second silhouette images uses texture information from the cradle image (22) or image information derived therefrom ,

13. A method for aligning a 3D object in a recording device image (22) corresponding to a visual field of a recording device (12), comprising the following steps:

Segmenting the cradle image (22) into a foreground and background to obtain a first silhouette image (23);

Synthesizing a second silhouette image (25) of the 3D object in a starting position; and

Estimating alignment parameters (27) for aligning the 3D object from the baseline based on deviations between the first and second silhouette images.

14. Computer program for carrying out the method according to claim 13, when the computer program runs on a computer and / or microcontroller and / or graphics card.