WO2008139399A2

WO2008139399A2 - Method of determining motion-related features and method of performing motion classification

Info

Publication number: WO2008139399A2
Application number: PCT/IB2008/051843
Authority: WO
Inventors: Olivier Pietquin; Vasanth Philomin
Original assignee: Philips Intellectual Property & Standards Gmbh; Koninklijke Philips Electronics N. V.
Priority date: 2007-05-15
Filing date: 2008-05-09
Publication date: 2008-11-20
Also published as: WO2008139399A3; TW200910221A

Abstract

The invention relates to a method of determining motion-related features (Fm) pertaining to the motion of an object (1), which method comprises obtaining a sequence of images (f1, f2,..., fn) of the object (1), processing at least parts of the images (f1, f2,..., fn) to extract a number of first-order features from the images (f1, f2,..., fn), computing a number of statistical values pertaining to the first-order features, combining the statistical values into a number of histograms (HLC, HG), and determining the motion- related features (Fm) for the object (1) based on the histograms (HLC, HG). Furthermore, the invention relates to a method for performing motion classification for the motion of an object (1) captured in a sequence of images (f1, f2,..., fn). Such a motion classification method comprises determining motion-related features (Fm) for the images (f1, f2,..., fn) using the described method of determining motion-related features (Fm), and using the motion-related features (Fm) to classify the motion of the object (1). The invention also relates to a system (3) for determining motion-related features (Fm) pertaining to an object (1) in an image (f1), and to a system (4) for performing motion classification for the motion of an object ( 1 ).

Description

METHOD OF DETERMINING MOTION-RELATED FEATURES AND METHOD OF PERFORMING MOTION CLASSIFICATION

The invention relates to a method of determining motion-related features pertaining to the motion of an object, and to a method of performing motion classification for the motion of an object. Furthermore, the invention relates to a system for determining motion-related features pertaining to an object, and to a system for performing motion classification.

One of the most natural communication means between humans is that of making movements with the hands or head, also called gestures, as a supplement to spoken communication, or as a means of communication in itself. A simple augmentation to speech can comprise, for example, moving one's hand and arm to point in a particular direction when giving another person directions. An example of communication solely by motion is sign language, based entirely on gestures. Since gesturing is second nature to humans, it follows that gesturing would be a natural and easy means of communication between humans and computers, for example in a home dialog system, in a dialog system in an office environment, for automatic sign language interpretation, etc. Owing to developments in technology in recent years, it is conceivable that the use of various dialog systems will become more and more widespread in the future, and that classification of a user's movements, i.e. gesture recognition, will play an increasingly important role.

Several gesture recognition systems have been proposed by state of the art methods. Generally, these proposed systems rely on image analysis to locate a user in each of a sequence of images and to analyse successive postures to determine the gesture he is making. Some systems require that the user wear a coloured suit or position sensors, a requirement which would be most undesirable for the average user of a home dialog system, for example. Other systems such as US 6,256,033 Bl propose extracting from an image a number of coordinates corresponding to certain points on the body of a user, and comparing these to coordinates corresponding to known gestures. An obvious disadvantage of such an approach is the very high level of computational effort required to analyse the images, and that the performance of the camera used to obtain the images must be sufficient to ensure sharp images. Other conventional systems attempt to solve the problem of gesture recognition by working with stereo image data obtained by using two or more cameras. For example, WO 03/009218 proposes the use of two cameras to obtain a stereo image of a user making a gesture, and proceeds by removing the background in the images and tracking the upper body of the user using statistical framework for upper body segmentation of those parts of the images corresponding to the body of the user. Again, the computational level required for gesture recognition using this proposed system is very high, so that gesture recognition will be slow. A further disadvantage of such systems is the amount of hardware involved, namely a high- resolution stereo camera or multiple cameras. A disadvantage common to all of the conventional gesture recognition systems is that a user must perform a gesture slowly enough to ensure that a certain level of image quality is maintained, i.e. that the images are not blurred. The solutions suggested by the state of the art are therefore too expensive, too slow, or too unwieldy for gesture recognition, particularly for systems that need to be non-intrusive, user-friendly, and inexpensive. Often, the effort required by these systems simply to determine information about movement of a user, e.g. to determine motion-related features suitable for use in a motion classification process, is just too costly.

It is therefore an object of the invention to provide easy and economical methods of determining motion-related features pertaining to an object in an image and for classifying the motion of an object. Moreover, it is an object of the invention to provide appropriate systems for these tasks. To this end, the present invention provides a method of determining motion-related features pertaining to the motion of an object, which method comprises obtaining a sequence of images of the object and processing at least parts of the images to extract a number of first-order features from the images. Thereby, the term "sequence of images" can include all the images captured - for example by a camera - over a certain period of time, or only a selection of these images. For example, for a camera with a fast shutter speed, it might suffice to take every tenth image, whereas, for a low-cost camera with slow shutter speed, it may be preferable to use all the images. An image of an object for which motion-related features are to be determined can be processed in its entirety, or it may be that only a certain region of the image is processed, for instance, when only a part of the image is relevant. The first- order features that are extracted from the image or image part can be blur-related features, gradient values, or any other real value that can be extracted from the image, such as, for example, a value describing the colour of a pixel. The term "first-order" feature means a feature that can essentially be extracted from the image information in a first stage of computation. For example, pixel colour values can be directly extracted from the image information, and gradient values can be obtained by directly processing the image data. According to the invention, a number of statistical values pertaining to the first-order features is computed in a second stage of computation, and these statistical values are then combined to give one or more histograms. Based on these histograms, motion-related features are determined for the object. These second-order motion- related features can be any information given by or extracted from the histograms, identifying the areas in the image that are of relevance with regard to the motion of the object, for example, blurred areas in an image caused by a moving object. In particular, the histograms themselves can be regarded as kind of a second-order motion-related feature._The term "object" as used here is to be interpreted to mean a user, the body of a user, or part of the user's body, but could also be any moving object that is capable of moving in a pre-defined manner, such as a pendulum, a robot or other type of machine, etc. The terms "object" and "user" will be used interchangeably in the following. The present invention also provides a method of performing motion classification for the motion of an object captured in a sequence of images, which method comprises determining motion-related features for the object captured in the images in the manner described above, and using the motion-related features to classify the motion of the object.

An obvious advantage of the method according to the invention is that motion or gesture classification can be achieved without resorting to the complex methods of image analysis required by the state of the art gesture recognition systems, and the algorithms used in the method according to the invention to extract the first- order features from an image are easily available and are well-known to a person skilled in the art, such as established algorithms used in image compression. Since statistical information pertaining to the first-order features is simply tallied or counted in histograms, which can then easily and very quickly be analysed to obtain the motion- related features, the proposed method offers clear advantages over image analysis methods used in the state of the art. The method according to the invention does not require identification or localisation of the object, e.g. a person, in the image, and no modelling of body parts is required. Another advantage is that a simple, low-resolution camera such as a webcam is quite sufficient for generating the images, allowing the system according to the invention to be realised cheaply and effectively. Furthermore, since the actual motion of an object does not need to be tracked directly, a low frame rate for the camera is sufficient, so that a cheap camera is quite sufficient.

A corresponding system for determining motion-related features pertaining to an object comprises an image source for providing a sequence of images of the object, a processing unit for processing at least parts of the images to extract a number of first- order features from the image, a computation unit for computing a number of statistical values pertaining to the first-order features, a combining unit for combining the statistical values to give a number of histograms and for determining the motion-related features for the object, based on these histograms.

In addition, a corresponding system for performing motion classification comprises such a system for determining motion-related features pertaining to the object, and also a classification unit for classifying the motion of the object using the motion- related features.

The dependent claims and the subsequent description disclose particularly advantageous embodiments and features of the invention. A digital image generated by an image source such as a camera contains various types of characterizing information, such as brightness, colour or grey-scale values, contour or edge sharpness, etc., which, as mentioned above, can be used as first- order features for the method according to the invention. For the purposes of classifying motion of an object captured in an image, colour or grey-scale values might be of interest, for example to track the colour of the object as it appears to move across the image. However, information describing the level of blur in an image or a part of an image can be particularly informative. If a region of an image is blurred, for example, it might be that this region corresponds to, say, the hand or arm of a user making a gesture or other movement in front of the camera. Therefore, in a particularly preferred embodiment of the invention, the first-order features extracted from the image are blur- related features. Information given by contours or edges in an image is particularly useful for image processing techniques such as object or pattern recognition. Some edges in the image might be sharp (in focus) or blurred (out of focus). For a camera with sufficient depth of field, a blurred part in an image region will indicate that an object in that part of the image was moving while the image was being generated. Several techniques are available for identifying the edges in an image, and the technique of wavelet transformation has been developed in recent years for such applications as efficient image compression. The technique will be known to a person skilled in the art, and need not be explained in detail here. It suffices to say that the technique of wavelet transformation can be used to quickly and reliably identify the edges in an image, and to obtain information as to the level of sharpness of these edges.

Therefore, according to a preferred embodiment of the invention, extraction of blur-related features from an image is preferably carried out by performing a wavelet transform on image data, for a number of scales, to determine a series of wavelet coefficients for each point in the image. The wavelet coefficients, because their values depend on the discrepancies in value of neighbouring pixels, and the level of discrepancy depends in turn on the level of blur in the image, are blur-related features which ultimately provide motion information about the image. Such a method, using blur-related features as first-order features, is particularly advantageous since the very fact that a moving hand, head etc. will introduce blur into an image is used to positive effect. Blur is no longer an undesirable side-effect, but provides valuable information upon which the motion classification, e.g. gesture recognition, is based.

The type of wavelet transform to be used might depend to some extent on the size of the image. For example, the image data of a larger image with sufficient resolution could be used as input to a dyadic wavelet transform (DWT), whereas a continuous wavelet transform (CWT) could be used for images of small size, having only a relatively low resolution, such as those generated by a typical webcam. The wavelet coefficients thus obtained might then be processed to obtain blur-related features for the image. Each pixel or point of the image corresponds to a wavelet coefficient in each scale of the transform, and each scale has the same number of points as the image. Therefore, after carrying out a 10-scale transform, there will be ten coefficients for each point in the image. For a point or pixel on an edge in the image, the modulus of the coefficient associated with the pixel will be a maximum in each scale. The evolution of the ten coefficients across scales associated with the edge pixel provides information about the derivability of the image at that point. Using the wavelet coefficients for a pixel in an image, the image gradient can be computed for that pixel. The direction and the intensity, or magnitude, of the image gradient provide information about the direction of blur and the velocity of an object if the considered pixel belongs to a blurred edge of the moving object. Since the gradient direction is orthogonal to the edge or contour, it provides information about the shape of the object. Similarly, low-order gradient values derived from the pixels in a row or column imply diffuse or smooth edges, and high- order gradient values imply sharp or distinct edges. Therefore, histograms of image gradient intensity and direction provide information about shapes and motion field in the considered areas of the images.

The log of slope of the coefficients' evolution across scales provides an estimate of the so-called Lipschitz coefficient, or exponent, for that pixel. The Lipschitz exponent is a measure of the derivability of the image at the considered pixel and therefore a measure of the degree of continuity. If this Lipschitz exponent is positive, the pixel is located on a smooth or blurred edge. Equally, if the Lipschitz exponent is zero or negative, the pixel is located on a sharp edge. Performing the wavelet transform over a number of scales effectively provides a third dimension in addition to the two dimensions of the image, and it is in this third dimension that the evolution of the coefficients is observed. A wavelet transform is generally performed for a series of pixels, such as the pixels from a certain row or certain column in a set of image data. Therefore, Lipschitz exponents and the image gradient intensity and direction computed from the resulting wavelet coefficients provide information about the degree of blur for neighbouring pixels in a row or column and the direction of the blur, if applicable. For example, a set of mostly negative Lipschitz exponents for a row of pixels would indicate that this row of pixels comprises well-defined edges. On the other hand, mostly positive- valued Lipschitz exponents would indicate that there are smooth transitions between the pixels in this row, implying a lack of sharp edges and therefore possibly that this row belongs to a blurred region in the image.

The wavelet transform might be performed across successive rows of pixels and successive columns of pixels, so that the wavelet transform is performed numerous times for each image or part of the image. First-order features such as the Lipschitz exponents and wavelet coefficient gradients described above are computed for each wavelet transform operation, i.e. for each processed row and column.

An effective way of collating the information provided by the first-order features might be to round up the values for the first-order features to certain discrete values, and to simply count the number of occurrences of each value. Therefore, according to the invention, statistical values, such as the number of occurrences of a particular coefficient, pertaining to a certain kind of first-order feature extracted from at least part of an image are combined in a histogram for this first-order feature. To this end, in a system for obtaining motion-related features pertaining to the motion of an object in an image, a counter may be assigned to each discrete coefficient value, and, whenever this value occurs, the counter is incremented by one. The values thus accumulated are tallied or collected in the histogram, which can be visualised as a simple bar chart. To determine the movement or gesture being made by a user, a number of images (also called frames) might be captured, for example by a camera, following the movement of the user from start to finish. Such a motion or movement is generally referred to as a gesture, and the various stages or positions in the movement are generally referred to as postures, without limiting the scope of the invention in any way. An overall gesture is then given by a sequence of postures, and each posture in turn is captured in a sequence of images. For a gesture or movement in which the user moves his right hand and arm, the images will show the user and a blurred region to the left of the image, which blurred region effectively "moves" across the images over time. If a sequence of images or frames are imagined to be upright and stacked one in front of the other, the resulting three-dimensional block is a virtual volume given by the sequence of images over time. First-order features can be extracted for all of the images in such a volume, and combined to give motion-related information about the entire sequence. In a preferred embodiment of the invention, therefore, first-order features are extracted for the virtual volume, for example by performing the wavelet transform for each of the images in the sequence, and combining statistical values for these features in one or more volume histograms.

However, as mentioned above, only a part of the image may actually be of interest. Therefore, to reduce unnecessary computation, preferably only those areas in the images which are actually of interest might be processed. For example, if the user is making a gesture with only one hand and arm, as described above, and the region in the image comprising this hand and arm is only a part of the overall image, this region or specific area can be identified in one image, and only the corresponding specific areas of successive images in the sequence of images need be processed. This takes advantage of the fact that a user will generally stay more or less in the same place while making a gesture in front of a camera. If the images or frames are imagined to be upright, stacked one behind the other, and the specific areas of each of the successive frames are imagined to be "connected", the result is a virtual sub-volume. In a particularly preferred embodiment of the invention, therefore, a sequence of specific areas is identified in a sequence of images, and first-order features are extracted for the sub-volume given by the sequence of specific areas over time. Identification of the specific area of interest in an image is easily achieved by means of the first-order features for that image, for example by analysing the number of occurrences of a first-order feature for various regions of one of a sequence of images. For example, an entire image can be partitioned or sub-divided into a number of segments or tiles, preferably in such a way that the segment or tiles overlap. A sub-image histogram for a first-order feature can be compiled for each segment or tile. For example, a sub-image histogram of mostly zero values or positive values for Lipschitz coefficients indicates that this region of the image contains blur. Therefore, all regions of the image containing blur can be quickly identified. A selection can be formed comprising these regions which can be visualised by an imaginary rectangle drawn around a blurred moving arm in an image, and defined, for example, by the image coordinates for this rectangle. These image coordinates can then be used to locate the same specific area in all the images of a sequence. Equally, each tile or segment can be analysed by pixel colour value, for instance for a moving object with a colour different from the background colour in the images. Only tiles containing pixels of a certain colour might be regarded as being of interest, so that the remaining tiles might be disregarded.

The step of locating the specific area of interested can be carried out at regular intervals, so that the specific area tracks the moving object in the images. In this way, even motion or gestures involving movement of the user over the entire image can be analysed. Any motion or gesture can be broken down into a series of essentially distinct consecutive stages, which, when taken together, give the overall motion or gesture. In the following, therefore, the term "position" refers to any stage in the motion of an object or user that is captured in a sequence of images. Histograms of statistical values for one or more first-order features can be compiled for each specific area of a sequence of images. When considered together, therefore, the information in these histograms is characteristic of that sequence of specific areas. And, since a sequence of images can be used to determine a position being held or a posture being made, the information of the histograms of a sequence of specific areas is therefore characteristic of the position being held or the posture being made.

Therefore, in a preferred embodiment of the invention, the statistical values pertaining to a type of first-order feature extracted from the sequence of specific areas in a sequence of images are combined in a volume histogram for that feature. Such a volume histogram, describing a position of an object, a posture or part of a gesture, will be used in the motion classification process described in more detail below. While the regions of the image defined by the specific area contain useful information about the type of motion or gesture being made, the remaining regions of the image - the "complementary area" - can also be of use. The complementary area essentially comprises the pixels not included in the specific area. Therefore, in a preferred embodiment of the invention, when a specific area in an image is identified, its complementary area is also identified, and, for a sequence of complementary areas, statistical values pertaining to a certain kind of first-order feature extracted from the sequence of complementary areas are combined in a complementary volume histogram for this first-order feature.

The specific area and/or the complementary area of an image are preferably identified by means of the first-order features determined for that image as explained above. For example, an appropriate boosting algorithm can be used to analyse the image data in order to identify the area or areas of interest. Various boosting algorithms are available and will be known to a person skilled in the art.

The motion-related features obtained from a sequence of images can be used to classify the motion made by an object and captured in the sequence of images. For example, a motion-related feature might be a "distance" between two histograms. Such a distance might be computed between histograms of a specific area and the corresponding complementary area in an image, or it might be calculated between histograms of successive images in the image sequence. A distance between histograms can be computed using a standard statistical distribution technique or by comparing the main properties of the histograms, such as the main gradient direction or main Lipschitz coefficient in the case of blur-related first-order features, or any other simple property of the histogram. These distance are ultimately an indication of the degree of change , with regard to the motion-related features, of the image regions. As described above, a sequence of images can be seen to correspond to a particular position or stationary part of the trajectory in the motion of an object, or to a particular posture of a gesture being made by a user, so that, according to the invention, sub-volume histograms and complementary volume histograms (simply referred to as "volume histograms" collectively in the following), generated as described above using the first-order features of the image, can ultimately describe or characterise the positions of an object in motion. Therefore, in a preferred method according to the invention, such volume histograms are analysed to obtain position-characteristic features which can be used to classify the motion. For example, position-characteristic features such as a ratio of positive histogram values to negative histogram values in a volume histogram for a sequence of images can give an indication of the position and direction of motion of the object. Another type of position-characteristic information might be a distance between two volume histograms for different image sequences.

One way of using a volume histogram might be to compare it, or a derivative of the volume histogram, to previously generated data for various different motion sequences or for a variety of gestures. For example, a volume histogram obtained for a posture tracked by a sequence of images can be compared to a number of

"prototype postures" for a collection of gesture models, such as a collection of state- transition models for a number of gestures. In this way, one or more candidate gestures can be determined which comprise the posture corresponding to the volume histogram. By identifying successive prototypes for the volume histograms of a sequence of postures, the number of candidate gestures can be narrowed down until the gesture most likely corresponding to the sequence of postures has been identified.

Therefore, in a preferred embodiment of the invention, the position- characteristic features for a motion, obtained from the analysis of a sequence of images, are used to identify a state in a generative model which is based on position or posture sub-units for a motion or gesture, e.g. a state-transition model for a motion. Such a state corresponds to a particular position in that motion, for example, to a certain posture of a user or to a certain stage in the motion of an object. A preferred state-transition model well known to those skilled in the art is the Hidden Markov Model (HMM), and is used to determine the probability that, in a gesture or movement, a certain position is followed by another position. However, simpler features can be derived from the histograms. For instance, as described above, a distance between a sub-volume histogram and its complementary volume histogram could be calculated. Essentially, any appropriate method of statistical distribution comparison for obtaining a single real number from a series of volume histograms can be applied. These single values can be used as input to a suitable algorithm, for example a boosting algorithm, in order to build a discriminative criterion.

Therefore, in another particularly preferred embodiment of the invention, a distance between a volume histogram, generated as described above using a sequence of specific areas in a sequence of images, and a complementary volume histogram, generated as described above using the corresponding sequence of complementary areas in a sequence of images, is calculated to give a position-characteristic feature, which feature can then be used in the motion classification procedure.

Since the position-characteristic features are derived from the histograms (second-order motion-related features), they can also be regarded as a kind of third- order motion-related feature. Using these simple third-order features, a motion model corresponding to the positions in a motion can be identified, provided the prototype positions are characterised by this type of feature.

The steps of the methods for obtaining motion-related features and for performing motion classification as described above can be realised in the form of software modules to be run on a processor of a corresponding system for obtaining motion-related features or a system for performing motion classification, respectively. Some of the functions, for example certain image processing steps such as wavelet transforms in the case of extracting blur-related features, might be realised in the form of hardware, for example, as a dedicated application specific integrated circuit (ASIC) or field-programmable gate array (FPGA).

Other objects and features of the present invention will become apparent from the following detailed descriptions considered in conjunction with the accompanying drawing. It is to be understood, however, that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. Fig. 1 is a schematic representation of a camera capturing an image of a user making gesture; Fig. 2a is a schematic representation of a sequence of images of a person with associated motion information and blur-related features;

Fig. 2b is a schematic representation of an image sub-divided into sub-images according to the invention;

Fig. 2c is a schematic representation of a specific area in an image according to the invention;

Fig. 3 a is a schematic representation of a virtual volume given by a sequence of specific areas from a sequence of images, and its associated volume histograms;

Fig. 3b is a representation of frontal view of a virtual volume, showing superimposed specific areas of a sequence of images; Fig. 4 shows a state diagram of a state-transition model for a gesture, and a number of volume histograms;

Fig. 5 is a block diagram of a system for gesture recognition according to the invention;

Fig. 6 shows a flow chart for the steps of the method of performing gesture recognition according to the invention.

In the diagrams, like numbers refer to like objects throughout. Fig. 1 shows a user 1 making a gesture in front of a camera 2, such as a webcam 2. Here, the user 1 is moving his arm in the direction of motion M, as a gesture or as part of a gesture. The webcam generates an image fϊ of the user 1, and this is forwarded to a suitable processing unit, not shown in the diagram, such as a personal computer. Any suitable type of camera can be used, but even a low-cost webcam 2 with a typical resolution of 320 x 480 pixels is sufficient for the purpose. The user 1 can make gestures associated with commands commonly occurring in a dialog between the user 1 and a dialog system, e.g. "stop interaction", "wait", "go back", "continue", "help", etc.. This obviates the need for speech interaction, for example in noisy environments, or when the user 1 is communicating by means of sign language. The user can also provide additional gesture information to supplement a spoken command, e.g. by pointing in a particular direction and verbally instructing the camera to look in that direction, e.g. by saying "watch this way".

The webcam 2 can generate images of the user 1 at more or less regular intervals, giving a sequence of images fi, f₂, ..., f_n as shown in Fig.2a. From this sequence of images fi, f₂, ..., f_n, the first-order features can first be extracted. The first- order features in the following examples used to illustrate the proposed method are blur- related features, chosen because of the advantages described in detail above. Without limiting the invention in any way, it is assumed that a method of gesture classification is being described.

As illustrated in Fig. 2a, various snapshots of the user 1 have been captured in the sequence of images fi, f₂, ..., f_n. These snapshots of the user, taken together, combine to form a certain position or posture in an overall gesture. Edge- detection is then performed on an image fi, by carrying out a wavelet transform on the rows and columns of pixels in the image fi. Depending on the size of the image fi and the resolution of the camera 2, it might suffice to use every nth pixel, or it may be necessary to use each pixel. A set of wavelet coefficients is obtained in this way for each pixel, containing information about the discrepancies between neighbouring pixels, i.e. motion information given by the level of blur.

For each set of wavelet coefficients, discrete blur-related features are derived, in this case the Lipschitz exponents and the image gradient. The occurrences of each value of first-order feature are counted, for example the number of times a Lipschitz coefficient of 1.5 or an image gradient value of 0.5π occurs, and collected in a histogram H_LC, H_G for that first-order feature. The resulting histogram H_LC, H_G therefore contains blur-related information about the overall image fi, for example the proportion of the image fi that is blurred, or the level of sharpness of the edges in the image fi.

This process can be repeated for each image of the remaining images f₂, ..., f_n in the sequence. However, performing edge analysis over the entire image is wasteful of resources, and can easily be circumvented. For instance, a type of "sliding window" could be used to virtually travel over the image data, giving smaller sub- windows for which the wavelet transform is carried out. The dimensions of the sliding window could also vary over time. For ease of clarification, Fig. 2b shows another technique of reducing computational effort, easier to present visually. Here, an image fϊ is virtually divided into smaller sections, tiles, or sub-images. Edge detection can be performed for each of the rows and columns in each of the sub-images of the overall image fϊ. One such sub-image 20 is shown, for which first-order feature histograms sH_Lc, SHQ have been derived. Since the process is carried out for each sub-image in the image fϊ, such histograms are derived for each of the sub-images. These can then be analysed to decide which of the sub-images are actually of interest. Since the interesting elements of a user making a gesture are his moving limbs such as arm, hand, etc., and since moving elements will be accompanied by a certain level of blur, the "interesting" parts of the image fϊ can easily be located by examining the first-order feature histograms to determine which of them is characteristic of a blurred sub-image. As mentioned above, these interesting parts of the image can be located easily by using the type of "sliding window" on the image data, effectively minimising the computational effort.

In Fig. 2c, two such sub-images 21, 22 have been located in the image ft. Naturally, the image fϊ could be sub-divided into many more sub-images than are shown here, and the sub-images could also overlap. For the sake of simplicity, however, only two sub-images 21, 22 of interest are shown. These sub-images 21, 22 combine to form a specific area Ai, indicated in the diagram by a thick rectangle.

A user positioned in front of a camera or webcam and making gestures will usually stand more or less in the same place, and only move one or both of his hands and arms. Therefore, the specific area identified in one image or frame can simply be propagated from one image in the sequence to the next in order to define a specific area in each of the following images or frames in a sequence. This saves on computational power and resources, since only the image data in the specific areas actually changes significantly over time. The rest of the image - the "complementary area" - remains to all intents and purposes the same, and can be regarded as stationary. This is illustrated in Fig. 3a, which shows a sequence of images fi, f₂, ..., f_n stacked vertically one behind the other. If imaginary lines are drawn to connect the corners of these frames fϊ, f₂, ..., f_n, a virtual volume V results. The specific area Ai shown in the first frame fi is propagated through to all the following frames f₂, ..., f_n, so that these frames f₂, ... , f_n have their specific areas A₂, ... , A_n in the same relative positions. Again, if imaginary lines are drawn to connect the corners of these specific areas Ai, A₂, ..., A_n, a virtual sub-volume V_s is created, excluding the complementary areas Ai', A₂', ..., A_n' of the images fi, f₂, ..., f_n. As already described under Fig. 2a and 2b, first-order feature histograms are computed for each of the sub-images in the specific areas Ai, A₂, ..., A_n of the frames fi, f₂, ..., f_n. By combining these histograms into corresponding volume histograms VH_LC, VH_G, the information contained in these histograms is extended to include time as well. In other words, the changes in the level of blur in the specific areas of the images are tracked over time as well. This is illustrated graphically in Fig. 3b, in which the image data of the sequence of frames fi, f₂, ..., f_n have been superimposed in the image 30, and all the "interesting" information made by the user while making a gesture is found in the region 31 , corresponding to the specific areas Ai , A₂, ... , A_n in the sequence of images fi , f₂, ... , f_n.

In Fig. 4, a state-transition model G for a motion or gesture is shown, in this case a discrete Hidden Markov Model (HMM), in which a finite number of states Si, S₂, S3, S4 correspond to a number of prototype positions of a motion, or postures of a gesture. This type of model is used to classify a motion, e.g. a gesture made by the user. Transitions are weighted by the probabilities of stepping from one state to another, as shown, for example, by the probabilities P(S₃->S₄), P(Si->S₃), P(S₄->S₄), which give a measure of the probability of stepping or making the transition from state 3 to state 4, from state 1 to state 3, and from state 4 to state 4, respectively.

For example, a volume histogram of a particular posture made by the user and captured in a sequence of images can be compared to a collection of prototype postures associated with various gestures. The diagram shows four prototype posture histograms PHi, PH₂, PH₃, PH₄ for the states Si, S₂, S₃, S₄ of the state transition model G, but there may be any number of such state transition models available, each of which can be associated with various prototype postures. These prototype posture histograms PHi, PH₂, PH₃, PH₄ can have been generated previously using a suitable learning algorithm. When comparing a volume histogram of a posture with the prototype postures, it may be found to be most similar to, for example, the prototype posture PHi associated with state Si of the state transition model. The comparison simply involves computing a distance, as already described above, between the volume histogram and the prototype histogram and converting this distance into a probability that is a number between 0 and 1. Subsequent volume histograms collected over time may be found to correspond to the consecutive states S₂, S₃, S₄ so that the gesture being made by the user would in all likelihood correspond to the gesture modelled by this state-transition model G. In the event that a further volume histogram does not correspond to the next state S₂, the classification procedure can conclude that this state transition model G does not correspond to a candidate gesture.

Fig. 5 shows a block diagram of a system 4 for performing gesture classification, without limiting the scope of the invention in any way, comprising a camera 2 for obtaining a sequence of images fϊ, f₂, ..., f_n of a user 1 (in the diagram, only a single image fϊ is indicated). Each image is first processed in a processing unit 5 to obtain a number of blur-related features 12. In this embodiment, the processing involves carrying out a wavelet transform on the image data to obtain Lipschitz exponents and wavelet coefficient gradients. These blur-related first-order features 12 are then forwarded to a computation unit 6, in which statistical values 14 are computed for the first-order features 12. In this computation unit 6, the values of the first-order features 12 are first rounded up or down , as appropriate, to the closest of a set of pre-defined discrete values, such as -0.5, 0.0, 0.5, 1.0, etc., before the occurrences of each discrete value are counted, or tallied, to give a number of counts 14. In a combining unit 15, the counts 14 are combined to give a number of histograms H_LC, H_G, one for each type of blur-related feature, in this case two (Lipschitz exponents and wavelet coefficient gradients), which can be regarded as one kind of motion-related features pertaining to the image. In this unit 15, the histograms H_LC, H_G can be used to derive further motion- related features F_m. For instance, the ratio of the total number of positive values to the total number of zero or negative values in a histogram H_LC, H_G can be supplied as a motion-related feature F_m for that histogram H_LC, H_G. Equally, a histogram H_LC, H_G itself can be used as a motion-related feature F_m. The combining unit 15 might comprise a separate analysis unit for performing any required histogram analysis, but is shown as a single entity in the illustration.

The units and blocks 2, 5, 6, 15 described up to this point comprise a system 3 for determining motion-related features, and which comprises the front end of the system 4 for performing gesture classification.

The histograms H_LC, H_G are forwarded to a second combining unit 8, where they are collected and combined with previous histograms derived from image data of previous images in the sequence of images, or virtual image 'volume', generated by the camera 2 of the user 1 performing a gesture. The output of the second combining unit 8 is then a set of volume histograms VH_LC, VH_G, VH_LC', VH_G', where the volume histograms VH_LC, VH_G correspond to specific areas of interest identified in the sequence of images generated by the camera 2, and the complementary volume histograms VH_LC', VHQ' correspond to complementary areas in the sequence of images.

In the second combining unit 8, the volume histograms VH_LC, VH_G, VH_LC', VHQ', corresponding to certain postures made by the user 1 while performing the gesture are processed to obtain position-characteristic features F_p, such as the distances between the volume histograms VH_LC, VH_G, or the distances between a volume histogram VH_LC, VH_G and its complementary volume histogram VH_LC', VH_G'.

The volume histograms VH_LC, VH_G, VH_LC', VH_G' and the position- characteristic features F_p are forwarded to a classification unit 7, where they analysed to classify the gesture made by the user. To this end, prototype position or posture information 13, including prototype posture histograms PHi, PH₂, PH₃, PH₄, is retrieved from a database 9 of candidate gestures. A volume histogram VH_LC, VH_G, VH_LC', VH_G' can be compared in some suitable manner to a corresponding prototype histogram PHi, PH₂, PH₃, PH₄ to determine a number of candidate gestures. By comparing successive volume histograms, obtained over time as the user moves through the postures of the gesture he is making, the number of candidate gestures from the database can be narrowed down, until the most likely candidate gesture is identified. The result of the classification can be forwarded as a suitable signal 10 to a further processing block 11, for example a gesture interpretation module. Although the second combining unit 8 is shown to be part of the system 4 for performing motion classification, this second combining unit 8 could also conceivably be integrated in a system 3 for determining motion-related features.

Fig. 6 shows a block diagram of the main steps in the method of motion classification according to the invention, for an embodiment based on the analysis of blur-related first-order features. In a first step 500, images 600 are obtained from an image source such as a camera. Edge-detection is performed on the images 600 in an edge-detection block 501, for example by performing wavelet transform on the image data, to give blur-related information 601, such as a set of wavelet coefficients 601. These are processed in a feature extraction block 502 to give a set of first-order features 602 for each set of wavelet coefficients 601. In a histogram compilation block 503, statistical values for the first-order features 602 are computed, and a number of histograms or volume histograms 603 for the first-order features for an image, a specific area of an image, a sequence of images, or a sequence of specific areas of an image is compiled. The histograms 603 are input to a histogram analysis block 504, in which motion-related features or position-characteristic features 604 are derived from the histograms 603. These features 604 in turn are input to a motion classification block 505, where they are used to determine the motion to which the corresponding position belongs. Once the classification procedure is completed, an output signal 605 can indicate the identified motion, e.g. a gesture made by a user, or whether the motion classification has failed.

Although the present invention has been disclosed in the form of preferred embodiments and variations thereon, it will be understood that numerous additional modifications and variations could be made thereto without departing from the scope of the invention. The invention can be used in any system in which a simple camera and a computer with sufficient computational resources are available. For example, the method and system for gesture recognition according to the invention could be used to recognise sign-language gestures and to convert these to text. Another implementation might be a surveillance system to classify and recognise different types of moving objects such as people, cars, etc. Another possible implementation might be in automated production processes, where the system and method according to the invention could be used to track items being moved and to control different stages of the production process.

For the sake of clarity, it is to be understood that the use of "a" or "an" throughout this application does not exclude a plurality, and "comprising" does not exclude other steps or elements. A "unit" or "module" can comprises a number of units or modules, unless otherwise stated.

Claims

CLAIMS:

1. A method of determining motion-related features (F_m) pertaining to the motion of an object (1), which method comprises obtaining a sequence of images (fi, f₂, ..., f_n) of the object (1); processing at least parts of the images (fi, f₂, ..., f_n) to extract a number of first-order features from the images (fi, f₂, ..., f_n); computing a number of statistical values pertaining to the first-order features; combining the statistical values to give a number of histograms (H_LC, H_G); and determining the motion-related features (F_m) for the object (1) based on the histograms (H_LC, H_G).

2. A method according to claim 1, wherein the first-order features comprise blur-related features.

3. A method according to claim 2, wherein extraction of blur-related features from an image (fi, f₂, ..., f_n) comprises performing a wavelet transform on image data to determine a number of wavelet coefficients, and processing the wavelet coefficients to give the blur-related features.

4. A method according to any of the preceding claims, wherein first-order features are extracted for a volume (V) given by the sequence of images (fi, f₂, ..., f_n) over time.

5. A method according to any of the preceding claims, wherein a sequence of specific areas (Ai, A₂, ..., A_n) is identified in the sequence of images (fi, f₂, ..., f_n), and first-order features are extracted for a sub-volume (V₈) given by the sequence of specific areas (Ai, A₂, ..., A_n) over time.

6. A method according to claim 4 or claim 5, wherein statistical values pertaining to a certain kind of first-order feature, extracted from a sequence of images (fi, f₂, ..., f_n) or from a sequence of specific areas (Ai, A₂, ..., A_n) in a sequence of images (fi, f₂, ..., f_n), are combined in a volume histogram (VH_LC, VH_G) for this first- order feature.

7. A method according to claim 6, wherein a sequence of complementary areas (Ai', A₂', ..., A_n') is identified in a sequence of images (fi, f₂, ..., f_n), where a complementary area (Ai', A₂', ..., A_n') in an image (fi, f₂, ..., f_n) is complementary to the specific area (Ai, A₂, ..., A_n) in that image (fi, f₂, ..., f_n), and statistical values pertaining to a certain kind of first-order feature extracted from the sequence of complementary areas (Ai', A₂', ..., A_n') are combined in a complementary volume histogram (VH_LC', VHQ') for this first-order feature.

8. A method according to any of claims 5 to 7, wherein the specific area (Ai, A₂, ... , A_n) and/or the complementary area (Ai', A₂', ... , A_n') of an image (fi, f₂, ... , f_n) are identified by means of the first-order features determined for that image (fi, f₂, ...,

Q.

9. A method of performing motion classification for the motion of an object (1) captured in a sequence of images (fi, f₂, ..., f_n), which method comprises determining motion-related features (F_m) for the object (1) in the images (fi, f₂, ..., f_n) using the method of any of claims 1 to 8, and using the motion-related features (F_m) to classify the motion of the object (1).

10. A method according to claim 9, wherein volume histograms (VH_LC, VH_G, VH_LC', VHQ') are generated using the method according to claims 6 or 7 to obtain position-characteristic features (F_p), and these position-characteristic features (F_p) are analysed to classify the motion of the object (1).

11. A method according to claim 10, wherein a distance between a volume histogram (VH_LC, VH_G) and a complementary volume histogram (VH_LC', VH_G') is calculated to obtain position-characteristic features (F_p), and these position-characteristic features (F_p) are analysed to classify the motion of the object (1).

12. A method according to claims 10 or 11, wherein the position- characteristic features are used to identify a state (Si, S₂, S3, S₄) in a state-transition model for a motion, which state (Si, S₂, S3, S₄) corresponds to a particular position in that motion.

13. A system (3) for determining motion-related features (F_m) pertaining to an obj ect ( 1 ) , which system comprises an image source (2) for providing a sequence of images (fi, f₂, ..., f_n) of the object (1); a processing unit (5) for processing at least parts of the images (fi, f₂, ... , f_n) to extract a number of first-order features from the images (fi, f₂, ..., f_n); - a computation unit (6) for computing a number of statistical values pertaining to the first-order features; a combining unit (15) for combining the statistical values into a number of histograms (HLC, HG) and for determining the motion-related features (Fm) based on the histograms (HLC, HG).

14. A system (4) for performing motion classification for the motion of an object (1), comprising a system (3) according to claim 13 for determining motion-related features (F_m) pertaining to the object, and a classification unit (7) for classifying the motion of the object (1) using the motion-related features (F_m).

15. A computer program product, directly loadable into the memory of a programmable device for use in a system for determining motion-related features pertaining to an object and/or a system for performing motion classification for the motion of an object, comprising software code portions for performing the steps of a > method according to claims 1 to 12 when said computer program product is run on the device.