WO2008053433A2

WO2008053433A2 - Hand gesture recognition by scanning line-wise hand images and by extracting contour extreme points

Info

Publication number: WO2008053433A2
Application number: PCT/IB2007/054412
Authority: WO
Inventors: Alexander A. Danilin; Yannick Bihan
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2006-11-02
Filing date: 2007-10-31
Publication date: 2008-05-08
Also published as: WO2008053433A3; JP2010509651A; CN101536032A

Abstract

Apparatus and a method for image recognition are disclosed, in which an image is electronically scanned line-by-line. Each line of pixels is processed and pixels which qualify according to both tonal and positional criteria are selected as contour points. A set of contour points is compared with a stored reference to determine the nature of the image.

Description

IMAGE RECOGNITION APPARATUS AND METHOD

FIELD OF THE INVENTION

The present invention relates to apparatus and a method for image recognition and is concerned particularly, although not exclusively, with apparatus and a method for the recognition of a human hand shape.

BACKGROUND OF THE INVENTION

Apparatus for the recognition of objects, such as human hand shapes, are well known and several prior systems exist which aim to recognize different hand signals using cameras and electronic processing apparatus. Such prior systems typically require a relatively large amount of memory and/or involve relatively intensive computations in order to distinguish between different hand signs. Because of this they consume relatively large amounts of power.

One such prior system for recognizing an image of a human hand is described in Japanese patent number JP2003-346162. This describes a computer input system that aims to recognize which finger extends from a human hand by the electronic processing of an image of the hand. The technique requires the detection of a hand, based on a skin tone recognition procedure. A detailed polygonal shape, described by points and angles, is built up and is used to determine how many fingers are raised. The calculations necessary to process the image are numerous and thus the electronic processing capacity and memory of the apparatus used in this technique are both necessarily relatively large, as is its power consumption.

SUMMARY OF THE INVENTION

Embodiments of the present invention aim to provide a robust technique for recognizing the shape of an object, such as a human hand, which requires relatively little electronic processing power and memory, and involves low power consumption, and which may therefore be suitable for applications which use wireless "smart camera" apparatus.

Smart cameras, i.e. cameras with built in processing capability, process locally the raw image data and send only keywords of information by wireless transmission, to a host system. The inventors found that this is more efficient for power consumption than broadcasting live video to an analyzing host computer.

In accordance with the invention an image is electronically scanned line-byline. The lines of pixels are processed and pixels which qualify according to both tonal and positional criteria are selected as contour points. A set of contour points is compared with a stored reference to determine the nature of the image.

According to one aspect of the present invention there is provided image recognition apparatus comprising an image sensor, a first electronic processor, a second electronic processor, and a memory, - wherein the first electronic processor is a parallel video processor comprising a plurality of line memories,

- wherein the first electronic processor sweeps horizontally line by line an image sensed by the image sensor,

- each line of pixels is processed and stored in one of the line memories, - and wherein each pixel in a line stored in a line memory is first compared with a qualification criterion based upon its tone, and, for each line after the first line in which a tonally qualified pixel was detected, the first and last qualifying pixels on each line are compared with positional criteria, relative to tonally qualified pixels on a previous line,

- pixels which qualify according to both their tone and position are selected as contour points, and wherein the apparatus is arranged to store contour points, and a set of the stored contour points is processed by the second electronic processor which compares them with stored information to determine the nature of the image as described by the contour points. The sensor registers the image comprising a plurality of pixels organized in a line by line basis. The image is communicated to the first electronic processor. The image is horizontally swept line by line by the first electronic processor in order to detect the presence of the hand in the image based on the tone of the pixels. For each of the lines, when the tonally qualified pixel is detected, the first and last qualifying pixels are validated as the contour points. The points of the contour are further provided to the second electronic processor that compares the received contour points with the stored information in order to determine the nature of the image as described by the contour points.

The advantage of such image recognition apparatus is that it does require just the most significant points of the contour in order to determine the nature of the image. The determination of the contour points takes advantage of the line organization of the memory and of the parallel video processor by processing the received image in line by line manner. Since the apparatus processes the image line by line in a parallel way, power consumption can be kept to a minimum. In a preferred arrangement the apparatus comprises a wireless camera, as the image sensor, with embedded first and second electronic processors.

Using a wireless camera as the image sensor has a lot of advantages as it makes positioning of the camera independent from availability of the power supply in the vicinity of the camera. In a preferred arrangement the tonal criterion qualification of a pixel is that the pixel has a value within a predetermined range of values, which is indicative of a skin tone, in UV color space. Such tonal criterion qualification is a simple and convenient way of determining the presence of the hand in the image.

In a preferred arrangement the apparatus is arranged to recognize images of a human hand.

In a preferred arrangement the camera comprises at least one filter. This for the purpose of filtering skin color pixels.

The invention also provides a method of electronically identifying an image, the method comprising the steps of: -sweeping the image horizontally as a number of lines;

-for each swept line of the image, detecting which pixels qualify according to a predetermined tonal criterion;

- for each line after the first in which a tonally qualified pixel was detected, determining which are the minimum and maximum qualifying pixels in terms of their position on the line;

- comparing qualified minimum and maximum pixels with counterparts on a preceding line to determine which pixels qualify according to a positional criterion;

- storing pixels, which are qualified according to both tonal and positional criteria, in a memory as contour points; and - processing a set of contour points by comparing them with stored contours to identify the nature of the image.

In a preferred arrangement the first contour point is taken as the first detected pixel that qualifies according to the tonal criterion. Preferably, a pixel is considered to meet the positional criterion when a difference in its position, as compared with a corresponding maximum or minimum value pixel from a preceding line, falls within a predetermined range.

The method comprises a method of recognizing an image of a human hand, wherein a tonally qualified pixel may be one for which its value lies within a predetermined range of values of UV color space.

Preferably the method comprises comparing contours with a set of stored contours each of which corresponds to a different hand shape or sign.

The method comprises determining a sequence of hand shapes or signs in order to identify a hand gesture.

BRIEF DESCRIPTION OF DRAWINGS

A preferred embodiment of the present invention will now be described by way of example only with reference to the accompanying drawings in which: Figure 1 is a schematic view of image recognition apparatus according to an embodiment of the present invention;

Figure 2 is a schematic flow diagram showing a method of acquiring contour points from a scanned image;

Figures 3a and 3b show schematically examples of a line-scan technique for different images;

Figure 4 shows an image contour derived from the image scan shown in Figure 3 a, and

Figure 5 shows a plurality of images and their corresponding image contours, derived by a technique according to the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENT

The embodiment described herein uses a wireless "smart camera" to detect images of a human hand. Power consumption must be kept to a minimum to prolong battery life. Accordingly, a parallel processing architecture is used since this keeps to a minimum the number of memory accesses, the clock speed and the decoding of instructions. Processing the image data using a parallel video processor in the wireless camera is more power efficient than transmitting raw captured data to a fixed device. The smart camera consists of basically four components, one or two image sensors, an SIMD (Single Instruction Multiple Data) processor for low level image processing, a general purpose processor for intermediate and high level processing and control, and a communication module. Both processors are coupled using a dual ported RAM that enables them to work in a shared workspace at their own processing speed.

Figure 1 shows schematically the architecture of an embodiment of image recognition apparatus according to the invention. In this case the apparatus comprises a smart camera, shown generally at 10. The apparatus comprises a sensor 12, a video processor 14, a general processor 16, a dual ported RAM (DPRAM) 18, an EEPROM 20, a wireless communication subsystem 22, an inter-chip control device (I2C) 24, connecting the video processor 14 and the general processor 16, DPRAM buses 26 and 28, an EEPROM bus 30, and UART (or other serial alternative) bus 32.

The video processor 14 comprises a linear processor array (LPA) and a plurality of line memories (not shown).

The video processor is a parallel processor and may be realized by an IC3D, which is a member of the Philips Xetal family of SIMD processors. As the general processor an Atmel 8051 device may be used.

The heart of the video processor 14 is formed by the linear processor array (LPA) with 320 Reduced Instruction Set Computer (RISC) processors. Each of these processors has simultaneous read and write access within one clock cycle to memory positions in the LPA. Both of the memory address and the instruction of the processors are shared in SIMD sense. Each processor in the LPA can also read the memory data of its left and right neighbors directly. At the extremes of the linear array the inputs of these processors are optionally coupled or mirrored. The LPA processors can handle up to 64 instructions ranging from arithmetic and single cycle multiply-accumulate to compound instructions. In addition to these there are conditional guarding instructions enabling data dependent operations. Data paths are 10 bits wide. Each processor has two word registers and a flag register.

The line memories in the video processor store 64 lines of 3,200 bits. Pixels of the image lines are placed in an interlaced way on this memory.

The peak pixel performance of the video processor is around 50 GOPS (= Giga Operations Per Second). Despite its high pixel performance the device is inherently a low power processor as not only instruction decoding is shared between all 320 processing elements, but also memory access is on ultra wide memory words that contain complete image lines instead of energy consuming access to multiple-pixel- wide memory locations. For typical applications such as feature finding, the power consumption is measured to be below 100m W in active processing modes.

Programs for the video processor 14 are stored in the EEPROM and can be uploaded from the general processor 16 via the I²C 24. The general processor 16 can load a program into the video processor for a specific task that has to be carried out for an image.

The software for the device 10 consists of three parts that are almost independently developed. Programs for the video (parallel) processor 14 are written in a C++ language with implicit parallel data types. All programs are written in a line-based manner where complete image lines are processed in single clock cycle instructions. By guarding constructions, data adaptive software structures can be implemented. Typical functions, which can run on this processor, are image improvement, motion analysis, object detection and tracking algorithms. The programs on the general processor 16 are dedicated to keep track of the object data over time. The general processor performs the host function (running the operating system) and can decide to transmit events to a host system via the communications subsystem 22.

The purpose of the video processor program is, in this embodiment, to detect "contour points" of a hand and to store them in the DPRAM.

The video processor receives information from one or two VGA sensors 12 (only one in this embodiment) on four channels with a YUYV format (depicted by the element 34 in the figure). Also other formats are possible for communicating the image from the sensor to the video processor. The first step consists in filtering skin color pixels. Low- pass or median filters with appropriate thresholds are employed in order to remove noise from the detected image because for the next step a very robust detection with minimal noise is required. In this embodiment the video processor is an SIMD processor so it can process the image only on a complete line and not pixel by pixel. After detecting a hand, a line sweep technique is used to build an object contour, which in this embodiment is a hand contour. The technique involves sweeping a horizontal line across the image, keeping track of certain data, and performing certain actions every time a certain event is encountered during the line sweep. Pixels which qualify as contour points are stored in the DPRAM 18. The contour thus derived is then analyzed by the general processor 16, by comparison with stored reference contours, in order to determine the nature of the object, or in this particular case to determine the nature of a hand sign. When pixels are processed initially by the processor 14 a determination is made as to which pixels qualify, due to their tone, as image pixels (i.e. of the object in question). In the present embodiment the tone which is of interest is a skin tone, and accordingly pixels whose value falls within a predetermined range of values appropriate to skin color in UV space-space are selected as pixels which qualify tonally.

Figure 2 shows schematically a method of building a set of contour points defining the image, according to this embodiment of the invention.

At a step 100, a line of pixels is read into the processor 14 and at step 110 a determination is made as to whether the line of pixels contains a tonally qualified pixel (in this case a skin tone pixel).

If no tonally qualified pixel is detected then the next line of pixels is read into the processor.

If a tonally qualified pixel is detected, a determination is made at step 120 as to whether the pixel is the first detected said pixel. If so the pixel with the corresponding coordinates is stored as a contour point at step 160.

If the pixel is not the first such pixel then at step 130 the left most (MinX) and right most (MaxX) tonally qualified pixels on the line are obtained.

At step 135 a noise reduction process is performed. At step 140 these pixels (MinX and MaxX) are compared with their counterparts from the previously considered line.

At step 150 a determination is made as to whether the pixels qualify positionally.

If neither of the pixels meets the positional qualification criteria then the next line of pixels is read into the processor 14 at step 100.

If either or both of the pixels meet the positional qualification criteria then it or they are stored as a contour point at step 160 and the next line of pixels is read in.

Figure 3 shows an example of the line sweep technique and contour points. In Figure 3 a the line begins sweeping from the top of the screen (Line=Line 0) and moves down keeping track of MinX and MaxX values where MinX and MaxX are the minimum and maximum X coordinates of tonally qualified pixels in the line (i.e. skin tone pixels in the line). For the first contour point Ei (X₁, Yi) Line = Yi and MinX == MaxX == X₁. For the second point E₂ (X₂, Y₂) Line = Y₂ and MinX == Xi and MaxX == X₂ and so on. The table below shows an example of the minimum and maximum values of X for each of the contour points.

If . MinX = MinX - Current MinX and . MaxX = MaxX - Current MaxX where Current MaxX and Current MinX are respectively the maximum and minimum values of X coordinates for skin tone pixels, then a contour point is generated only when . MinX is within a predetermined range ( . i, .₂) or . MaxX is within the predetermined range ( . l, . ₂). The reason for imposing this positional qualification criteria is that pixels can be ignored when they are too close to contour points on a preceding line. Such pixels may, for example, indicate merely the curvature of the hand and are not needed to form the uniquely identifying contour of the hand. This is with the exception of the first contour point E. Contour point Ei is generated for the very first skin tone pixel to be detected. Using this approach a contour C (Ei ... E_n) can be built.

Figure 4 shows the contour derived from the five finger hand sign depicted in Figure 3. Figure 5 shows a number of other hand signs and their contours, which may be derived in the above described manner.

The most important thing is that for every hand sign shown in Figure 5, its contour is distinguished from all other contours.

The X and Y ratio describe every contour. For example in Figure 4, if Y5 is greater than Y₄ then this is a right hand, otherwise it is a left hand. Different hand signs correspond to different X and Y ratios.

A sequence of hand shapes may be used to determine a hand gesture. While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive; the invention is not limited to the disclosed embodiments.

Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Claims

CLAIMS:

1. Image recognition apparatus (10) comprising an image sensor (12), a first electronic processor (14), a second electronic processor (16), and a memory, wherein the first electronic processor (14) is a parallel video processor comprising a plurality of line memories; - wherein the first electronic processor is arranged to sweep horizontally line by line an image sensed by the image sensor (12),

- each line of pixels is processed by the video processor and stored in one of the line memories,

- and wherein each pixel in a line stored in a line memory is first compared with a qualification criterion based upon its tone, and, for each line after the first line in which a tonally qualified pixel was detected, the first and last qualifying pixels on each line are compared with positional criteria, relative to tonally qualified pixels on a previous line,

- pixels which qualify according to both their tone and position are selected as contour points, - and wherein the apparatus is arranged to store contour points, and a set of the stored contour points is processed by the second electronic processor which compares them with stored information to determine the nature of the image as described by the contour points.

2. Image recognition apparatus according to Claim 1 comprising a wireless camera as the image sensor (12), with embedded first (14) and second (16) electronic processors.

3. Image recognition apparatus according to Claim 1 or Claim 2, wherein the tonal criterion qualification of a pixel is that the pixel has a value within a predetermined range of values, which is indicative of a skin tone, in UV color space.

4. Image recognition apparatus according to Claim 3 wherein the apparatus is arranged to recognize images of a human hand.

5. Image recognition apparatus according to any of the preceding claims, wherein the sensor comprises at least one filter.

6. A method of electronically identifying an image, the method comprising the steps of : electronically sweeping the image horizontally as a number of lines; for each swept line of the image, detecting which pixels qualify according to predetermined tonal criteria; for each line after the first in which a tonally qualified pixel was detected, determining which are the minimum and maximum qualifying pixels in terms of their position on the line; comparing qualified maximum and minimum pixels with counterparts on a preceding line to determine which pixels qualify according to a positional criterion; storing pixels, which are qualified according to both tonal and positional criteria, in a memory as contour points; and processing a set of contour points by comparing them with stored contours to identify the nature of the image.

7. A method according to Claim 6, wherein the first contour point is taken as the first detected pixel that qualifies according to the tonal criterion.

8. A method according to Claim 6 or Claim 7 wherein a pixel is considered to meet the positional criterion when a difference in its position, as compared with a corresponding maximum or minimum value pixel from a preceding line, falls within a predetermined range.

9. A method according to any of the Claims 6 to 8 wherein a tonally qualified pixel is one for which its value lies within a predetermined range of values of UV color space.

10. A method according to Claim 9 wherein the method comprises a method of recognizing an image of a human hand.

11. A method according to any of Claims 6 to 10, wherein the method further comprises comparing contours with a set of stored contours each of which corresponds to a different hand shape or sign.

12. A computer program comprising instructions for enabling a computer to perform a method according to any of Claims 6 to 11.

13. A computer-readable storage medium having recorded thereon a computer program according to Claim 12.