WO2009027307A1

WO2009027307A1 - Method for automatically detecting at least the type and/or location of a gesture formed using an appendage, particularly a hand gesture

Info

Publication number: WO2009027307A1
Application number: PCT/EP2008/060934
Authority: WO
Inventors: Markus Schlattmann; Reinhard Klein
Original assignee: Rheinische Friedrich-Wilhelms-Universität
Priority date: 2007-08-31
Filing date: 2008-08-21
Publication date: 2009-03-05
Also published as: DE102007041482A1

Abstract

The invention relates to a method for automatically detecting at least the type and/or the location (position and orientation) of a gesture made using an appendage, particularly a hand gesture, a plurality of two-dimensional images of the appendage being captured simultaneously from different directions. According to the invention, the two-dimensional images are combined into a three-dimensional image, and the three-dimensional image is analyzed with regard to at least one gesture characteristic. In this manner, a method for automatically recognizing gestures is disclosed, which can be performed in a simple and reliable manner in real time, and thus allows complex process controls, such as the control of a vehicle.

Description

Rheinische Friedrich-Wilhelms-University Bonn Dusseldorf, 20 August 2008

Our sign: UD 40094 / SAM

Rheinische Friedrich-Wilhelms-Universität

Regina-Pacis-Weg 3

53113 Bonn

Method for automatic recognition of at least the type and / or the position of a gesture formed with a limb, in particular a hand gesture

The invention relates to a method for automatically detecting at least the type and / or position of a gesture formed with a limb, in particular a hand gesture, wherein at the same time a plurality of two-dimensional images of the limb is detected from different directions.

The detection of hand gestures, in particular to control procedures, is of great interest to various technical fields. In general, it is particularly important to recognize the nature of the gesture on the one hand and its location in space on the other hand. By recognizing the nature of the gesture, it is meant to recognize which gesture it is, e.g. may be defined by a hand gesture over whether the respective fingers of the hand are closed or opened. The location of the gesture in space may be given by its position and / or orientation and is thus e.g. defines where a predetermined finger points and where it is located. This means, in particular, that in the present case the term position should be understood to mean, depending on the application, only the orientation in space, only the position in space or together the orientation and the position in space.

Reasonably reliable gesture recognition systems, which allow both the recognition of the type of gesture and its location, exist so far either only in two-dimensional Space or require considerable technical effort and aids, such as attached to various points of the limb, such as at the fingertips of a hand, attached markers. As a result, no methods and systems are known with which the detection of the type and orientation of a gesture can be reliably performed in three-dimensional space.

Thus, it is the object of the invention to provide such a method for automatic gesture recognition, which can be carried out in a simple and reliable manner in real time.

Based on the method described above, this object is achieved in that the two-dimensional images are combined into a three-dimensional image and the three-dimensional image is analyzed with regard to at least one gesture feature.

It is therefore an essential point of the invention to generate two-dimensional images for the time being and then to combine the two-dimensional images into a three-dimensional image, which is then used for analysis. The term "two-dimensional image" here means any image that has at least two dimensions but is not a three-dimensional image, of course, the conventional two-dimensional images that can be captured with conventional and widely used cameras In addition, however, so-called 2.5-dimensional images are known which provide depth values in addition to the two-dimensional image, ie such images are also considered as two-dimensional images in the present case. but provides the analysis of a gesture feature in three-dimensional.

In principle, the captured two-dimensional images may be subjected to different processing steps before being combined into the three-dimensional image. According to a preferred development of the invention, however, it is provided that the two-dimensional be at least segmented before the combination of the three-dimensional image, ie a separation of the region of the limb from the background takes place.

According to a preferred embodiment of the invention, it is further provided that in the combination of the two-dimensional images to the three-dimensional image, a three-dimensional reconstruction of the limb is performed. This three-dimensional reconstruction of the limb does not have to correspond to a "perfect", ie complete, image of the limb, rather, this three-dimensional reconstruction of the limb should be a three-dimensional representation of the limb that allows an analysis of at least one gesture feature.

Basically, various methods for obtaining a three-dimensional limb reconstruction are possible. According to a preferred embodiment of the invention, it is provided that in the three-dimensional reconstruction of the limb whose visual envelope (visual envelope) is determined.

The analysis regarding at least one gesture feature can be done in different ways. According to a preferred embodiment of the invention, however, it is provided that in the analysis of at least one gesture feature, the three-dimensional reconstruction of the limb is analyzed for protrusions. Outloos are understood to be excellent, external points that represent, as it were, the highest elevations of the analyzed three-dimensional structure. In particular, it is possible in this way to determine the location or orientation of fingertips, which can be an essential prerequisite for determining the type and position of a hand gesture.

The analysis of the outputs can also be carried out completely differently. According to a preferred embodiment of the invention, however, it is provided that the protrusions are then analyzed, whether they are based on an approximation of the convex hull the limb lie. This also serves a possible assignment of a projection to a fingertip.

To assign a projection to z. As a fingertip, it may be sufficient that it is found on an approximation of the convex hull of the limb. According to a preferred embodiment of the invention, however, it is provided that the three-dimensional position of the voxels (three-dimensional pixels or pixels) of the protrusions are projected into the two-dimensional images. If they are there on the edge of the picture, it is to be assumed that the corresponding protrusions are probably not formed by fingertips, but by artifacts, namely z. B. by a projecting into the picture, obliquely cut arm. Accordingly, such positions for detecting fingertips can be eliminated.

The recognition of gestures can very much be based on the fact that the spearheads are characterized according to various aspects. According to a preferred embodiment of the invention is provided in this context that the protrusions are characterized at least as a function of their distance to a predetermined point, preferably in dependence on the local center of mass. In this way, one can determine the "furthest" excerpts, in order ultimately to arrive at the determination of fingertips.

As a result, it is provided according to a preferred embodiment of the invention in particular that the protrusions are used to assign the gesture formed by the limb of a predetermined group of gesture types, preferably exactly one gestures gestesten type. This ultimately constitutes the actual recognition of the type of gesture, so that according to a preferred refinement of the invention it can thus also be provided that a predetermined control of a method automatically takes place depending on the predetermined group of gesture types or the predetermined gesture type. to be led. So it is z. B. conceivable, depending on the detected type of gesture to perform a predetermined type of control. If a "show" gesture is detected, it may be provided, for example, to change the viewing direction in the context of a visual simulation or to control a vehicle, ie to determine its direction of travel. It is also preferably provided as a function of the detected position of the gesture in FIG In the example of the "show" gesture, it can thus be provided, for example, that the direction of the pointing in the context of this control indicates what the viewing direction should be or in which direction the ride should take place.

In principle, provision may be made for the detection of the gesture and the control of the method to be carried out with a time delay. According to a preferred development of the invention, however, it is provided that the detection of the gesture and the control of the method take place in real time. In particular, the detection and evaluation of at least 25 images per second can be provided. In this way, sophisticated applications are possible, such as the previously mentioned control of a vehicle.

In principle, it can be provided that the method and possibly also the control by the detected gestures require an initialization. According to a preferred embodiment of the invention, however, it is provided that the gesture recognition and in particular also the control are started automatically as soon as a gesture has been detected and assigned to a predetermined group of gestures or a predetermined gesture. In other words, this means that the above-mentioned procedural clauses can be sensed regularly and the process of the actual gesture recognition automatically starts as soon as a limb is detectable such that at the same time a plurality of two-dimensional images of this limb are recognizable from different directions.

In order to be able to avoid in particular an uncontrolled state of the means of the gestures taking place control, according to a preferred embodiment of the invention further provided that in the case where the gesture formed by the limb can not be assigned to any predetermined groups of gestures or a predetermined gesture, a warning is given, preferably as an optical and / or acoustic signal. In this way, the user is informed that currently no gesture control is possible and he resumed the procedure z. B. must bring his hand back into the area in which the majority of two-dimensional images of the limb from different directions can be detected.

Finally, even in the method described herein, the limb can be provided with markers. According to a preferred embodiment of the invention, however, it is provided that the gesture is detected without attached to the limb marker. This can be achieved in particular by the fact that no analysis of the acquired two-dimensional images takes place, but before the feature gesture analysis a combination of the two-dimensional images into the three-dimensional one is carried out.

The method described above allows the tracking of the spatial position, namely the position and the orientation, for. B, a human hand, in terms of several different gestures, ensuring a natural and efficient human-machine interaction. In particular, this method has the following advantages:

The user only needs his bare hand, so it is not necessary to provide the hand with markers. The initialization can be fully automatic, which means that the tracing of the hand can start immediately as soon as the user moves his hand into the work area. For initialization so no special position or gesture of the hand is required. The calculation can be done in real time so that the method can be used for direct interactions. Even if the user changes, no changes to the settings are required, The acquisition of the two-dimensional images of the limb can be done in different ways. According to a preferred embodiment of the invention, however, three or more cameras are provided which observe the limb from different directions in a special arrangement. To calculate the position and gesture of the limb, as described above, a three-dimensional reconstruction of the limb is first determined from the camera images, the two-dimensionally acquired information being brought into a consistent three-dimensional representation. These are z. B. the images of all cameras synchronously read and each divided into a region that corresponds to the limb, and the background, so segmented.

When all images are segmented, the regions of the limb are projected from the point of view of the respective camera through the three-dimensional space, so that a rough three-dimensional reconstruction of the hand results in the intersection of the three projections. In other words, all voxels belong to the three-dimensional reconstruction of the hand, for which the projections lie in all two-dimensional camera images within the respective hand region. The same is also referred to as reconstruction of the visual envelope or "shape-from-silhouettes" technique.

In the rough three-dimensional reconstruction of the hand, special features can now be searched for. To limit the amount of potential features, z. B are searched only for projections that can be formed by fingertips and lie on a k-DOP (discrete oriented polytope / discrete oriented polytope), an approximation of the convex hull of the limb. A k-DOP is a bounding volume constructed by moving k well-oriented planes from infinity until they touch the three-dimensional reconstruction. The k-DOP is then the convex polytope resulting from the intersection of the half-spaces delimited by these k-planes. For each of these levels there is a voxel belonging to the three-dimensional reconstruction that touches the plane and thus describes its position. In a preferred implementation of the method, a 26-DOP is used so that there are 26 levels and thus 26 voxels are determined. These 26 voxels form the set of possible features for the extraction of fingertip features. These voxels are now classified by analyzing their local environments. For a preferred method, for example, it is envisaged to perform a simple analysis such that only the distance to the local center of mass is used for characterization, as stated above. If the distance is very large, then the voxel or the feature is located on a very prominent part of the three-dimensional reconstruction and thus probably on one of the desired fingertips,

In the following, the method according to the invention will be explained in more detail on the basis of a preferred exemplary embodiment with reference to the drawing. In the drawing shows

1 shows the four types of gestures recognizable by the presently described method according to a preferred embodiment of the invention,

2 shows the visuals determined by means of three segmented two-dimensional images

Cover of a grasped hand,

3 schematically shows the extraction of DOP points in the two-dimensional or in the

Three-dimensional and

Fig. 4 histograms for the determined extent of the protrusions in different

Types of gestures. FIG. 1 shows the four types of hand gestures that can be detected by means of the presently described method according to the preferred exemplary embodiment of the invention. From left to right are the gestures "palm", "grasp", "show A" and "show B". As can be seen from FIG. 1, each hand gesture can be assigned to "furthest projecting" finger tips, which are each marked with an arrow in FIG.

These protruding fingertips are of particular interest in accordance with the presently described preferred embodiment, as one can unequivocally conclude one of the four predetermined gesture modes via the detection of the respective fingertip and the additional detection of the direction of the corresponding finger. Furthermore, if the positions of two protruding fingertips relative to the mass of the hand are known, the orientation, ie the position and the orientation of the hand, can be determined.

For this purpose, an algorithm is used in the present case with which it is possible to extract all the required information in order to recognize both the gesture and its orientation in space. This information is calculated on the basis of a three-dimensional binary voxel-lattice of the visual envelope, which in turn has been created on the basis of the segmented two-dimensional images of the individual cameras.

According to the method described here, three cameras are used, which are arranged in one plane. The angle between the shooting directions of adjacent cameras is 60 °, so it is avoided that a camera is detected by another camera as a background. It has been found that this arrangement is sufficient to achieve a sufficiently accurate determination of the visual envelope of the hand.

After the segmentation of the images, the segmented two-dimensional images (10), as shown schematically in FIG. 2, are combined to form the visual envelope (11) of the hand. Mögli- - lo ¬

Fingertips are defined as the voxels of the visual envelope touching one of the levels of the surrounding DOP. In the case of a 26-DOP, as shown in FIG. 3, on the right, 26-DOP points of the visual envelope in the three-dimensional, which are shown in FIG. 3, as far as they are visible, are indicated by arrows. The corresponding two-dimensional representation (12) is shown on the left.

The fingertips may be considered endpoints of protruding areas of the voxel lattice. In order to judge the potential fingertips, a measure for the emergence must be found. In the present case, the distance of the respective point from the local center of gravity is used as a measure.

It can now be seen from FIG. 4 how this measure for the protrusion or the protrusions can be analyzed. For this purpose, in the histograms shown, 150 images each have the "show A" gesture (left), the "show B" gesture (center) and a "fausf" gesture (right), in which no finger protrudes, than Measured for the projection of the distance of the respective point from local center of mass. The "Show A" gesture clearly shows two excerpts, while the "Show B" gesture still emphasizes a singularity, and in the "Fausf" gesture, no defined prominence is more determinable.

If this shows that there are two fingertips, it must be determined which corresponds to the thumb. The identification of the thumb in the present case is based on the statement that the maximum geodesic distance between the thumb tip and all other possible candidates is less than the correspondingly calculated maximum external distance for the other fingertips. Since the calculation of the exact geodetic distance in real-time applications is currently practically impossible, this is currently estimated. Then it has to be determined whether a finger not grasped as a thumb is the middle finger or the index finger. This is achieved by calculating a covariance matrix locally around the fingertip using a GPU algorithm. The ratio between the largest and the second largest eigenvalue of the covariance matrix makes it possible to determine the identity of the finger. If it is determined that this finger is the index finger, the direction of the finger can be determined at least approximately.

As a result, such a method of automatically recognizing the kind and the

Position of a gesture formed with a limb, in particular a hand gesture, indicated that in a simple and reliable manner in real time feasible and on this

Way sophisticated process controls, such as the control of a vehicle allows.

Claims

claims

A method of automatically detecting at least the nature and / or position of a limb-formed gesture, in particular a hand gesture, simultaneously detecting a plurality of two-dimensional images of the limb from different directions, characterized in that the two-dimensional images become a three-dimensional image combined and the three-dimensional image is analyzed for at least one gesture feature.

A method according to claim 1, characterized in that the two-dimensional images are segmented by separating the region of the limb from the background before combining them into the three-dimensional image.

3. The method according to claim 1 or 2, characterized in that in the combination of the two-dimensional images to the three-dimensional image, a three-dimensional reconstruction of the limb is performed.

4. The method according to claim 3, characterized in that in the three-dimensional reconstruction of the limb whose visual envelope is determined.

5. The method according to claim 3 or 4, characterized in that in the analysis of at least one gesture feature, the three-dimensional reconstruction of the limb is analyzed for protrusions.

6. The method according to claim 5, characterized in that the protrusions are analyzed to see whether they lie on an approximation of the convex hull of the limb.

7. The method according to claim 5 or 6, characterized in that the three-dimensional position of the voxels of the protrusions are projected into the two-dimensional images.

8. The method according to any one of claims 5 to 7, characterized in that the projections are characterized at least as a function of their distance to a predetermined point, preferably in dependence on the local center of mass.

9. The method according to any one of claims 5 to 8, characterized in that the protrusions are used to assign the formed by the limb type of gesture of a predetermined group of gesture types, preferably exactly one predetermined Gestestenart.

10. The method according to claim 9, characterized in that in dependence on the predetermined group of gestures or the predetermined gesture automatically a predetermined type of process control is performed.

1 1. A method according to claim 10, characterized in that the process control is performed in dependence on the detected position of the gesture in the room.

12. The method of claim 10 or 1 1, characterized in that the detection of the type or the position of the gesture and the process control in real time, preferably by evaluation of at least 25 images per second.

13. The method according to any one of claims 10 to 12, characterized in that the process control is started automatically as soon as a gesture has been detected and assigned to a predetermined group of gestures or a predetermined gesture.

14. The method according to any one of claims 9 to 13, characterized in that in the case in which the means of the limb formed type of gesture can not be assigned to a predetermined group of Gestestenarten or no predetermined Gesteart, a warning is issued, preferably as optical and / or acoustic signal.

15. The method according to any one of claims 1 to 13, characterized in that the gesture is detected without attached to the limb marker.