WO2001084498A2

WO2001084498A2 - Image processing

Info

Publication number: WO2001084498A2
Application number: PCT/GB2001/001962
Authority: WO
Inventors: Mark Pawleski; Charles Nightingale
Original assignee: British Telecommunications Public Limited Company
Priority date: 2000-05-04
Filing date: 2001-05-04
Publication date: 2001-11-08
Also published as: WO2001084498A3; AU2001256465A1

Abstract

A method of image recognition comprises a number of image recognition steps each of which provides a classification result according to the characteristics of the image. In the case of identification of an image into more than one category, this invention addresses the problem of choosing a category for an image. Some classes may be defined as combinations of two or more detection determinations, e.g. a class may be for 'cartoons of buildings'.

Description

IMAGE PROCESSING

The present invention is concerned with image processing, and more particularly with - in a broad sense - image recognition. In this specification the term "recognition" means that the image is processed to produce some result which makes a statement about the image.

There are a number of different contexts in which producing such a result can be useful, for example, if the image is of a single object, it may be desired to identify the object as being as a specific one of a number of similar objects: the recognition of human face as being that of a particular person where a picture is stored in a reference database would fall into this category. Alternatively, it may be desired to identify an image as containing one or more pictures of objects and to classify it according to the nature of those objects. Thus the automation of the process of indexing or retrieval of images from a database could be facilitated by such recognition, particularly where a large database (or a large number of databases, as in the case of internet searches) is involved. Recognition may be applied not only to still pictures but also moving pictures - indeed, the increasing availability of audio-visual material has identified a need to monitor material transmitted on television channels, or via video on demand systems, perhaps to verify that a movie film transmitted corresponds to that actually requested. Such image recognition can be done either by the service provider or by the service receiver.

In accordance with a first aspect of the present invention there is provided a method of classifying an image as one of a plurality of classes, the method comprising the steps of: analysing the image using, for each of a plurality of image types, a respective image analysis method to determine whether the image exhibits attributes characteristic of the corresponding image type; applying the respective determinations of the analysing step as inputs of a combinatorial logic function; and classifying the image as one of the plurality of classes in dependence upon the output of that combinatorial logic function of the respective determinations of the analysing step. In accordance with a second aspect of the present invention there is provided an apparatus for classifying an image as one of a plurality of classes, the apparatus comprising: means for analysing the image using, for each of a plurality of image types, a respective image analysis method to determine whether the image exhibits attributes characteristic of the corresponding image type; means for performing a combinatorial logic function upon the respective determinations of the analysing step; and means for classifying the image as one of the plurality of classes in dependence upon the output of the performing means.

Other, preferred, features of these aspects of the invention are set out in the sub-claims.

An embodiment of a method of and apparatus for image recognition incorporating image classification in accordance with the present invention will now be described, by way of example only, with reference to the accompanying drawings in which:

Figure 1 is a block diagram of an image recognition apparatus;

Figure 2 is a flowchart illustrating the operation of the image recognition apparatus of Figure 1 ; Figure 3 is an illustration of part of a database of images;

Figure 4 is an illustration of images identified as landscapes using the image recognition apparatus of Figure 1 ;

Figure 5 is an illustration of images identified as people using the image recognition apparatus of Figure 1 ; Figure 6 is an illustration of images identified as buildings using the image recognition apparatus of Figure 1 ;

Figure 7 is an illustration of images identified as cartoons using the image recognition apparatus of Figure 1 ;

Figure 8 is a flowchart showing the operation of a building detector; Figure 9 is a flowchart illustrating operation of a vertical edgelet detector;

Figures 10a, 10b and 10c show the results of application of a building detector to an image; Figure 1 1 is a flowchart showing the operation of a landscape detector used in the image recognition apparatus of Figure 1 ;

Figure 1 2 is a flowchart illustrating operation of a horizontal edgelet detector; Figures 13a, 13b, and 13c show the results of application of a landscape detector to an image;

Figures 14a, 14b, and 14c show the results of application of a landscape detector to a second image; and

Figures 1 5a, 1 5b, and 1 5c show the results of application of a landscape detector to a third image.

In Figure 1 there is shown an image recognition apparatus comprising an acquisition device 1 which is in the form of a scanner for scanning photographs or slides. In an alternative embodiment, the acquisition device is arranged to capture single frames from a video signal. Acquisition devices are well-known, as indeed is software for driving the devices and storing the resulting image in digital form. The apparatus also has an image store 2 for receiving the digital image, a processing unit 3 and a program store 4. In the present embodiment, these items are conveniently implemented in the form of a conventional desktop computer.

The program store 4 contains a program for implementing the process now to be described. The program performs the following method steps as illustrated in Figure 2.

At step 20 building recognition is performed. The process will be described in more detail later with reference to Figures 8, 9 and 10.

At step 22 landscape recognition is performed, this process will be described in more detail later with reference to Figures 1 1 to 1 5.

Cartoon recognition is then performed at step 24. Cartoon recognition may be performed, for example using the methods described in our co-pending European applications number 99307971 .4 and 00301687.0.

The final recognition step is performance of skin recognition at step 26. The choice of skin recognition technique used in the present invention is not significant, and can be any of such techniques known in the art.

The recognition steps 20 to 26 may be performed in any order, as each step is independent of the other steps. Finally, at step 28 logic is applied to decide upon a single classification of an image in case of conflict between the output from the recognition steps.

Figures 3 to 7 show how images in a database, a subset of which is shown in Figure 3, are classified as a building, a landscape, a person or a cartoon by the apparatus of Figure 1 . It can be seen from visual inspection of this small sample that the images in Figure 4, which have been classified as landscapes, are indeed subjectively identifiable as such. Similarly all the images of Figure 5 have been correctly classified by this described embodiment of the invention as containing people within the image. The images in Figure 6 have been classified as buildings, although some of the images, for example the leftmost image in the middle row, contain people and a water feature. The images of Figure 7 have all be correctly identified as being of a cartoon nature.

It is well established that edges are detected by the early human vision system, as well as other primitive image features; but moving from these primitive image features to recognisable boundary shapes cannot easily be emulated using computer algorithms. It is as if only partial boundaries, together with textural and tonal variations are sufficient to interpret images, with the ability to trace boundaries arising after recognition of an object rather than before. The ability to thus recognise as objects groups of features without clearly defined boundaries is known as perceptual grouping. Perceptual grouping is used in this embodiment to recognise features from images without using or requiring high level cognitive knowledge.

A method of vertical structure detection using perceptual grouping techniques is now described with reference to Figure 8. The method is used in this embodiment of the invention to recognise features of the image which have at least one prominent axial edge, such features being most commonly buildings, or man- made structures in which vertical lines are prominent.

Assume that the image is digitised as an MxN image and stored as pixel luminance values p(x,y) (x = ... M- , y= 0 ... /-1 ) where p(0,0) is bottom left, x is horizontal position and y vertical. If the image is stored as R, G, B values, luminance values may be calculated from these luminance values in a conventional manner (step 80). Colour information is not used in this embodiment of the invention. Methods of edge detection are well known, however the method described here for detection of horizontal and vertical edgelets (small edges) are specifically designed to detect edges which extend in a direction which is parallel to the direction of the axis of the pixels in a digital image. In other words, along an axial line, i.e. a row or a column of a rectangular image. In this specification the term axial edgelet, or edgelet point or edgelet pixel, is used to refer to a pixel which is deemed to be part of an edge, and the term linelet is used to refer to a run of axial edgelets extending along an axial line. In the specific embodiment described the linelets extend in the vertical and horizontal direction. However, if the axis of the pixels were other than the horizontal and the vertical the technique could still be used to detect linelets extending in the relevant direction.

At step 81 each pixel is analysed to decide whether the pixel is an axial edgelet forming part of a vertical linelet within the image.

Referring now to Figure 9, each pixel is analysed in turn with each vertical line of pixels being analysed in ascending order of horizontal index, and for each pixel point in each vertical line, in ascending order of vertical index.

The gradient of each point is measured as shown at steps 91 and 92. At step 91 the difference between the luminance value of the pixel to the left of the current pixel and the luminance value of the current pixel and the difference between the luminance value of the pixel below the current pixel and the luminance value of the current pixel are calculated. For a vertical edge it would be expected that the difference in luminance values between a pixel and the pixel below it would be small, and the difference in luminance values between a pixel and the pixel to the left of it would be large. Therefore the angle and the gradient (dy/dx), calculated at step 92 would be small.

A decision process is then used at step 95 to decide whether or not each pixel gives evidence of a strong gradient in the horizontal direction. In this embodiment an edge function, which is normalised with respect to the maximum difference between adjacent pixels, F(x,y) = Dx * cos(Angle) / MaxDx is calculated at step 94 and compared with a predetermined threshold at step 95.

If the function is greater than the predetermined threshold then the pixel is considered to be an axial edgelet, and part of a vertical linelet, at step 96 and E(x,y) is set to be equal to one, otherwise it is not part of a vertical linelet, as shown at step 97 and E(x,y) is set to be equal to zero.

In other embodiments of the invention more sophisticated processes, such as a neural network may be used to perform the decision process. The magnitude of the gradient vector may also be used as a parameter in the decision making process.

Figure 10a shows an image, and Figure 10b shows the corresponding pixels which are determined to be part of a vertical linelet as determined by the method described above.

In order to decide whether or not a collection set of axial edgelets as defined by E represent a building, some perceptual grouping ideas are used.

Ak = ∑i E(i,k) (the axial edgelet aggregate) is a sum of the number of points which are determined, as above, to be axial edgelets. Using the continuity principle of perceptual organisation and considering the lengths of each interval of the column for which E(x,k) is continuously equal to 1 , i.e. run lengths of axial edgelets, the variable ewk is defined as the number of linelets of length w in the column k. The maximum value for w = N. Wk is defined as the sum ∑ew from w = Pmm to w = n, where Pmin is a predetermined minimum value of w which depends upon the size of the image in pixels. Therefore Wk is effectively a count of the linelets of run length greater than or equal to Pmin. Lk is defined as the sum ∑we k from w = Rmin to w = n, where Rmin is a predetermined minimum value of w which again depends upon the size of the image in pixels. This means that an linelet is not counted in the sum for calculating Wk, or Lk, if it is too short.

The presence of a vertical line can then be detected by a suitable analysis of these three parameters, Ak, Wk and Lk. One method is to apply the following thresholds, namely Amin, Wmiπ and Lmin, which, as defined below, depend upon the size of the image in pixels. Another method is to train neural net using the three parameters as an input and with a set of classified images as a training set. Another method would be by making Wk an inverse function of the maximum length of a linelet, i.e. run of adjacent edgelets, in column k. In the apparatus of Figure 1 , a vertical line is then deemed to be present in the image in the event that Ak > = Amin, and Wk > = Wmm and Lk > = Lmin This system will spot broken vertical lines, although some continuous pieces are required if either Pmin or Rmin is greater than one. It will be appreciated that If Rmin is equal to one then Ak and Lk are equivalent. If Pmm is equal to one then Wk is a count of the total number of linelets.

In a first variant, the presence of a vertical line is detected on the basis of a function of only Ak and Wk, i.e. that both Ak and Wk are greater than or equal to their respective thresholds; and in a second variant, the presence of a vertical line is detected on the basis of a function of only Lk. i.e. that Lk is greater than or equal to its threshold.

It is worth mentioning at this point that in this embodiment of the invention vertical edgelets are determined using two columns of adjacent pixels only. However, it would be possible in other embodiments to define function F(x,y) which operated using the average gradient over more than two columns of adjacent pixels. It would also be possible to define function W so that it operated over plurality of columns. The number of columns over which each of these functions operates could be dependent upon the width of the image. Referring again to Figure 8, in this embodiment of the invention at step 87

Pmm, Amin, W in, Lmm, R in and C in are calculated based on absolute predetermined percentage values and on the size of the image between the y co-ordinate (ImageTop) of the top of the edgelet nearest to the top of the image and the y coordinate (ImageBottom) of the bottom of the edgelet nearest the bottom of the image, and correspondingly on the size of the image between the x co-ordinate (ImageRight) of the right of the edgelet nearest to the right of the image and the x co-ordinate (ImageLeft) of the left of the edgelet nearest the left of the image, as follows:

Pmm = Pminpercent * (ImageTop - ImageBottom)

Amin = Aminpβrcent * (ImageTop - ImageBottom)

Wmm = Wmmpercent * (ImageTop - ImageBottom)

Lmm = L inpercent * (ImageTop - ImageBottom)

Rmm = Rminpercent * (ImageTop - ImageBottom) Cmin = Cmmpβrcent * (ImageRight - ImageLeft)

Then at step 82 Wk is calculated for each column. At step 83 the number of columns for which Ak > = Amm, Wk > = Wmm, and Lk > = Lmm are calculated. If the total number is greater than Cmin, which similarly is dependent upon the width of the image then the image is determined to be a building at step 86. If the number of columns meeting the criteria is less than Cmin, then the image is determined not to be a building at step 86. Figure 10c shows an image in which the pixels which meet the criteria of step 83 in Figure 8 are highlighted (in white).

In the first variant mentioned above, the corresponding step 83 counts columns for which Ak > = Amin, and Wk > = Wmin; and in the second variant, the corresponding step 83 counts columns for which Lk > = Lmin.

Whereas in the apparatus of Figure 1 the step 87 is performed after the step 81 and before the step 82, in a variant, the step 87 is performed after the step 80 and before the step 81 .

Referring now to Figure 1 1 , a method of landscape recognition using perceptual grouping will now be described. The method is based on the observation that many landscapes have horizontal continua extending across a large proportion of the image. Sometimes one of the lines is due to the horizon, but the inventors have discovered the surprising result that many images of landscapes have such horizontal continua, which do not correspond to the presence of the horizon.

At step 1 10 the image is converted to grey scale, if required. At step 1 1 1 each pixel is analysed to decide whether the pixel forms part of an linelet within the image. The previously described method of determining whether a pixel is a vertical edgelet is modified mutatis mutandis as follows to determine whether a pixel is a horizontal edgelet. The method of edgelet detection used in this embodiment may be used to detect edgelets which extend in a direction parallel to the orientation of pixels in the image. Referring now to Figure 12, each pixel is analysed in turn with each horizontal line of pixels being analysed in ascending order of vertical index, and for each point in each horizontal line in ascending order of horizontal index. The gradient of each point is measured as shown at step 1 21 and 1 22. At step 1 21 the difference between the luminance value of the pixel to the left of the current pixel and the luminance value of the current pixel and the difference between the luminance value of the pixel below the current pixel and the luminance value of the current pixel are calculated. For a horizontal edge it would be expected that the difference in luminance values between a pixel and the pixel below it would be large, and the difference in luminance values between a pixel and the pixel to the left of it would be small. Therefore the angle (which is not the same angle as that described above for detecting vertical edgelets) would be small and the gradient (dy/dx), calculated at step 1 22, would be large. A decision process is then used at step 1 25 to decide whether or not each pixel gives evidence of a strong gradient in the vertical direction. In this embodiment an edge function, which is normalised with respect to the maximum difference between adjacent pixels, F(x,y) = Dy * cos(Angle) / MaxDy is calculated at step 1 24 and compared with a predetermined threshold at step 1 25. If the function is greater than the predetermined threshold the pixel is considered to be a horizontal edgelet at step 1 26 and E(x,y) is set to be equal to 1 , otherwise is it not a horizontal edgelet, as shown at step 1 27 and E(x,y) is set to be equal to 0.

As mentioned before with reference to vertical edge detection, in other embodiments of the invention more sophisticated processes, such as a neural network, may be used to perform the decision process. The magnitude of the gradient vector may also be used as a parameter in the decision making process .

Figures 1 3a, 14a and 1 5a show an image, and Figures 1 3b, 14b, 1 5b show the corresponding pixels which are determined to be part of a horizontal linelet as determined by the method described above.

In order to decide whether or not a collection set of edgelets as defined by E represent a landscape, again, some perceptual grouping ideas are used.

Using the continuity principle of perceptual organisation and considering the lengths of each interval of each row for which E(r,y) is equal to 1 over Hmin rows, where Hmin is dependent upon the height of the image. The variable βr,ι is defined as the number of intervals of length I in the row r, r-Hmin or r 4- Hmin. What this means is that, say Hmin = 1 , then if E(r, 1 ) = 1 , E(r,2) = 0 and E(r,3) = 1 then if E(r + 1 ,2) = 1 or E(r-1 ,2) = 1 then the E(r,y) is considered to be continuous over three rows. To put it another way: if E(r,c + 1 ) = 1 or E(r,c) = 1 or E(r,c-1 ) = 1 then E(r,c) = 1 where r refers to a particular row and c refers to a particular column.

The maximum. possible value for I = M, Lr is given by the sum ∑ler, from r = Rmin to r = m, where Rmin is a predetermined minimum value of w which depends upon the size of the image in pixels. For horizontal line recognition, in this embodiment of the invention, one threshold Lmin is used, which also depends upon the size of the image in pixels. The presence of a horizontal line is then detected by the system in the event Lr > = Lmin.

Considering E(x,y) over Hmin rows effectively defines the function W to operate over Hmin rows.

Referring again to Figure 1 1 , at step 1 1 7 Hmin and Lmin are calculated based on absolute predetermined percentage values and on the size of the image between the rightmost x co-ordinate (ImageRight) of the edgelet nearest to the right of the image and the leftmost x co-ordinate (ImageLeft) of the edgelet nearest the left of the image as follows:

Hmin = H inpβrcent * (ImageRight - ImageLeft) L in = L inperoeπt * (ImageRight - ImageLeft)

Then at step 1 1 2 Lk is calculated for each row. At step 1 13 the maximum value for Lr is calculated. If Lmax is greater than Lmin then the image is determined to be a landscape at step 1 1 6. Otherwise the image is determined not to be a landscape at step 1 1 6. Figures 13c, 14c and 1 5c show images in which the horizontal lines responsible for L ax are highlighted at 1 30, 140 and 1 50 respectively. It is worth noting that these horizontal lines do not necessarily coincide with the horizon. In the method just described, it would be possible to terminate the algorithm once an Lk is calculated which is greater than Wmin. However, it is desirable to calculate the greatest Lk, i.e. Lmax as then the 'best' horizontal line can be detected in the image. The following table enumerates all possibilities for each of the four recognition steps and shows the final classification which results. A ' 1 ' in a particular column indicates that the corresponding recognition step gave a positive identification.

Thus, it can be seen from the above table that the four detectors, namely Cartoon, Building, Landscape and Skin, are effectively ranked in that order, and that the classification provided by the image recognition apparatus is that of the highest ranking detector giving a positive determination. For example, if both cartoon and building detectors give a positive determination, then for this embodiment, the classification is "cartoon", i.e. the classifications are mutually exclusive.

In an alternative embodiment, other classifications are permitted, e.g. if both cartoon and building detectors give a positive determination, then for that alternative embodiment, the classification is "cartoon + building", i.e. that classification is for cartoons of buildings. The skilled person will be able to find other such

"combination" classifications.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising" and the like are to be construed in an inclusive as opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including, but not limited to".

Claims

1 . A method of classifying an image as one of a plurality of classes, the method comprising the steps of: analysing the image using, for each of a plurality of image types, a respective image analysis method to determine whether the image exhibits attributes characteristic of the corresponding image type; applying the respective determinations of the analysing step as inputs of a combinatorial logic function; and classifying the image as one of the plurality of classes in dependence upon the output of that combinatorial logic function of the respective determinations of the analysing step.

2. A method according to claim 1 , in which the combinatorial logic function defines at least one of said classes as a combination of two or more of said image types.

3. A method according to claim 1 , in which the combinatorial logic function ranks the image analysis methods whereby the output corresponds to the highest ranking positive determination.

4. A method according to claim 3, in which the highest ranked image analysis method is cartoon detection.

5. A method according to claim 3, in which the highest ranked image analysis method is building detection, other than when the image analysis methods include cartoon detection, in which case building detection is outranked only by cartoon detection.

6. A method according to claim 3, in which the highest ranked image analysis method is landscape detection, other than when the image analysis methods include cartoon detection and building detection, in which case landscape detection is outranked only by cartoon detection and building detection.

7. An apparatus for classifying an image as one of a plurality of classes, the apparatus comprising: means for analysing the image using, for each of a plurality of image types, a respective image analysis method to determine whether the image exhibits attributes characteristic of the corresponding image type; means for performing a combinatorial logic function upon the respective determinations of the analysing step; and means for classifying the image as one of the plurality of classes in dependence upon the output of the performing means.

8. An apparatus according to claim 7, in which the combinatorial logic function of the performing means defines at least one of said classes as a combination of two or more of said image types.

9. An apparatus according to claim 7, in which the combinatorial logic function of the performing means ranks the image analysis methods whereby the output corresponds to the highest ranking positive determination.

10. A method of classifying an image as one of a plurality of classes according to claim 1 , and substantially as herein described with reference to the drawings.

1 1 . An apparatus for classifying an image as one of a plurality of classes according to claim 7, and substantially as herein described with reference to the drawings.