WO2009069071A1

WO2009069071A1 - Method and system for three-dimensional object recognition

Info

Publication number: WO2009069071A1
Application number: PCT/IB2008/054935
Authority: WO
Inventors: Richard P. Kleihorst; Anthony Martiniere; Serafim Efstratiadis
Original assignee: Nxp B.V.
Priority date: 2007-11-28
Filing date: 2008-11-25
Publication date: 2009-06-04

Abstract

Three-dimensional features are obtained from multiple views, i.e. by using multiple cameras that view a scene from different sides, and by having the cameras collaborate to establish a correct description of such a three-dimensional feature. In this way real three- dimensional features are obtained, i.e. not merely two-dimensional projections of three- dimensional features. A 3D feature descriptor is computed using 2D feature descriptors and the camera positions in space (known a priori).

Description

Method and system for three-dimensional object recognition

FIELD OF THE INVENTION

The invention relates to a video camera system, in particular to a method and a system for three-dimensional object recognition in images of scenes observed by said video camera system.

BACKGROUND OF THE INVENTION

Object recognition is a procedure to determine which of a set of objects is present in an image of a scene observed by a video camera system. Generally speaking, the first step in object recognition is to build a database of known objects. The database is populated with data that may be obtained in several ways, for example by controlled observation of known objects. The second step in object recognition is to match a new observation of a previously viewed object with its representation in the database.

Prior work in object recognition can be divided into two basic approaches: geometry- based approaches and appearance-based approaches. Broadly speaking, geometry-based approaches rely on matching the geometric structure of an object. Appearance-based approaches rely on using the intensity values of one or more spectral bands in the camera image; this may be grey-scale, color, or other image values.

The field of three-dimensional (3D) object recognition has been investigated extensively in recent years. While human beings can easily recognize arbitrary 3D objects in arbitrary situations, computer vision algorithms can only solve the object recognition problem in constrained conditions.

Most model-based 3D object recognition systems use information from a single view.

For example, see the object recognition systems described in: "Model-based recognition of 3D objects from single images", I. Weiss and M.Ray, IEEE Trans. On Pattern Analysis and Machine Intell, 23(2), pages 116-128, 2001; "Recognition and Reconstruction of 3D Objects Using Model Based Perceptual Grouping", LK. Park, K.M. Lee and S.U. Lee, Proc. 15^th International Conference on Pattern Recognition, pages 720-724, 2000; "Towards True 3D Object Recognition", J. Ponce, S. Lazebnik, F. Rothganger and C. Schmid, Proc. CVPR, Vol. II, pages 272-277, 2003. Unfortunately, a single view may not contain sufficient information to recognize the object, because the detected features depend on the camera viewpoint and the viewing geometry. Features are defined as key points in images, for example corners, centers of areas, edges etc. These features are used to create an abstract description of parts of the image. This abstract description can be used for depth estimation and object recognition, for example. Object recognition is performed by comparing features detected in the image with a set of stored features from a model object in a database. A single- view approach may not be suitable for 3D object recognition.

To overcome this problem there has been some research on combining data from several views, i.e. from several cameras, in order to recognize the object of interest. For example, see: "Multi-view Technique For 3D Polyhedral Object Recognition Using Surface Representation", M.F.S. Farias and J.M. de Carvalho, Revista Controle & Automacao, 10(2), pages 107-117, 1999; "Integration of Multiple Feature Groups and Multiple Views into a 3D Object Recognition System", J. Mao, P.J. Flynn and A.K. Jain, Computer Vision and Image Understanding, 62(3), pages 309-325, 1995; "3D object recognition system using multiple views and cascaded multilayered perception network", M.K. Osman, M.Y. Mashor and M.R. Arshad, Cybernetics and Intelligent Systems, IEEE Conference, Vol. 2, pages 1011-1015, 2004.

However, these 3D object recognition systems are based on the combination of two- dimensional (2D) features detected from different views. A 3D object recognition based on such a combination of 2D features, detected from completely different angles of view, requires the use of reliable 2D features. Unfortunately, this is difficult to achieve. The result of a 2D projection of the real 3D space is that some key features are not found from the images because they become disformed in the projection process. This hinders object recognition and leads to unreliable results. In order to properly recognize 3D objects from 2D images, a large number of features must be used to describe such an object, which requires a lot of processing resources. Also, the model object is stored as multiple sets of features in the database, i.e. a set of features for each side of view, in order to find a match with the object to be recognized in the image. Hence, the database needs to have a relatively large size. SUMMARY OF THE INVENTION

It is an object of the invention to perform three-dimensional (3D) object recognition in an accurate and efficient way. This object is achieved by the method according to claim 1 and by the system according to claim 6. According to the invention, three-dimensional features are obtained from multiple views, i.e. by using multiple cameras that view a scene from different sides, and by having the cameras collaborate to establish a correct description of such a three-dimensional feature. In this way real three-dimensional features are obtained, i.e. not merely two-dimensional projections of three-dimensional features. A 3D feature descriptor is computed using 2D feature descriptors and the camera positions in space (known a priori).

Up to now object recognition or scene description has been done on the basis of 2D feature finding; these 2D features are mere projections of 3D features on the image plane. As a result of this mismatch and feature deformation, object recognition has to be done by detecting an overkill of features and by storing multiple views of the objects under different capturing angles in the database. The invention introduces multiple cameras that see the object and scene from different sides. The network of calibrated cameras is able to see the scene in 3D. By collaborating, they can establish the real 3D features and their location in space. This makes a description of the scene much more simple and elaborate and it reduces the number of features needed for recognising 3D objects. An example of such a 3D feature is a 3D corner. With collaborative cameras 3D corners can be detected as follows. For example, one camera finds corners in its captured image and than compares its findings with corners found by the other cameras. If one or some of the other corners fall on the epipolar lines of the camera set-up, a 3D construction is found. By high-level reasoning it is checked whether the feature is a real 3D corner: each of the cameras should than see the corner at specific orientations. Result is that this 3D corner and its position in space can be used for effective description of structures in space and of genuine 3D object recognition. If "orientation" is taken as a 2D feature descriptor, a certain combination of the orientations, seen from different views, results in a specific 3D feature descriptor. The combination depends on the 2D feature descriptors. It is deducted by a high- level reasoning algorithm.

Advantageous embodiments of the invention are defined in the dependent claims. BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is described in more detail with reference to the drawings, in which:

Fig. 1 illustrates an example of corner orientation deducted after a background subtraction;

Fig. 2 illustrates a general example of 3D corner detection with a camera set-up comprising four cameras;

Fig. 3 illustrates an example of an algorithm according to the invention; Fig. 4 illustrates an example of a database containing for each object information about its 3D corners.

DESCRIPTION OF PREFERRED EMBODIMENTS

The invention is based on the idea to recognize objects in the 3D space using a collaborative multi-view camera system. The system which has been used for the experiments is the Wireless Camera (WiCa) platform, developed by NXP Research. Each camera is equipped with a Xetal 3D processor, dedicated for video processing, and a communication module using a ZigBee protocol. The Xetal 3D processor is combined with a 30 frames per second color VGA- format image sensor. The processor is fully programmable and therefore able to run a variety of computer vision algorithms. Xetal 3D is able to achieve high computational performances (up to 50 GOPS) with modest power consumption.

The aim of the method according to the invention is to detect 3D objects, using their 3D features and their center of mass. In order to achieve this, the method combines several 2D features, i.e. 2D feature descriptors, obtained from different views by different cameras, and then defines a new type of 3D feature. The Intersecting Line Technique, which has been described in "Embedded Object Recognition Using Smart Cameras and the Relative Position of Feature Points", D. Rankin, Master Thesis Report, University of Glasgow, pages 54-59, 2007, can be used to recognize the 3D object, if it is extended to 3D. The camera network is assumed to be calibrated in space.

According to a preferred embodiment of the invention, the particular 3D feature is a 3D corner. Each smart camera uses corner detection for feature detector and corner orientation for feature descriptor. In order to find the 3D features, a collaboration between the 2D corner descriptions to the 3D corner description needs to be computed. Thus, the corner orientations computed from different views are compared. The 3D corners detected are used to find the center of mass of the object in 3D space. Thus, the new 3D corner is defined by comparing the corner orientations computed from different views. It is assumed that a spatial calibration of the cameras has been performed beforehand. The following basic steps are performed:

1) Corner Detection The aim is to define a 3D corner based on the real-time detection of 2D corners from each camera. A common corner detector can be used for corner detection, for example the Harris- Stephens corner operator ("A combined corner and edge detector", C. Harris and M.J. Stephens, Alvey Vision Conference, pages 147 - 152, 1988). The Harris- Stephens algorithm is not only sensitive to corners, but also to local image regions which have a high degree of variation in all directions. Therefore, all interest points are detected in the image, containing all corner orientations. The detected corners can be limited to those of the object itself by background subtraction.

2) Corner Description

As soon as a 2D corner is detected, a descriptor needs to be assigned to this corner. For this purpose the corner orientation, defined as the direction from the corner to the object, can be used. A relevant corner is detected on the object boundary. The orientation is computed by looking at the angle of the edges around the interest point, and by performing a background subtraction algorithm on the current scene. A comparison between edge orientations only gives the direction of the corner but not its orientation, depending on the position of the object.

Fig. 1 illustrates an example of corner orientation deducted after a background subtraction. Fig. 1 shows a corner (C) and the corresponding edges (El, E2). For each corner point, it needs to be defined on which side of the edges the object of interest is positioned. This is done by background subtraction, which defines the relative position of the object with respect to the position of the edges. The dotted vector represents the other orientation detected if no background subtraction is applied. Alternatively, any of the object's grey-level values could be used instead of background subtraction, but this would be more sensitive to permanent luminosity change.

Fig. 2 illustrates a general example of 3D corner detection with a camera set-up comprising four cameras. Fig. 2 shows how 2D corner orientations are combined in order to find a 3D feature. Assuming a calibrated system in space, the 2D vectors can be interpreted and the presence of a 3D corner can be deducted. In practice, for each corner detected in one camera, the pixels on the other cameras corresponding to the same position in space are considered. If a corner exists at this position, its orientation is used to establish the shape of the 3D corner. By applying the same process in each camera, the system can deal with occlusions.

Fig. 3 illustrates an example of an algorithm according to the invention. Prior to executing this algorithm, a 3D background subtraction method (see for example "Nonstationary Background Removal via Multiple Camera Collaboration", H. Lee, C. Wu, and H. Aghajan, Proc. of 1st International Conference on Distributed Smart Cameras,

Vienna, Austria, Sept 2007) may remove any detected corners which do not belong to the object.

In Fig. 3, 'Pi' and 'Pj' are pixels located on cameras i and j, respectively. The variable 'table' is a table containing the corner orientations from each camera. The length of 'tab' is equal to the number of cameras used. The function 'Correspondence()' computes a correspondence rate between the 2D corner orientations. The goal is not to minimize the distance between the feature vectors, but to compare their difference relative to the positions of the cameras in space. The correspondence rate gets higher if the relation between the different orientations is verified. This rate depends on the number of distributed cameras and their layout in space. The positions of the cameras in space are assumed to be known by calibration.

The proposed 3D recognition process is based on the Intersecting Line Technique.

This method has the advantage of being simple, fast, scale-invariant and robust to occlusions. It uses a database of objects which contains, for each feature point, the type of the feature and its 3D line gradient to the object center of mass.

It is noted that the Intersecting Line Technique has been explained in European patent application EP07104583, titled "Object recognition method and device", filed by the applicant on 21 March 2007. Fig. 4 illustrates an example of a database containing for each object information about its 3D corners. Generally speaking, the database contains for each object the number of

3D features, the type of 3D features, and the line gradients. In Figure 4 an example of how to create such a database for a simple cubic object is shown. There are 8 different types of 3D corners and the corresponding 3D line gradients. The centre of mass is calculated by taking the average x, y and z values of the object feature point coordinates. So, for an object consisting of n 3D feature points, the centre of mass is expressed as:

CoM_x = (X₁ + ... + X_n) In CoM_y = (_yι + ... + y_n)/n

where CoMx, CoMy and CoMz are its coordinates.

Once the shape is defined in the database and the feature points have been extracted from the image, the two must be combined to recognize the object. Each feature point detected in the image is considered separately. When a feature from the image is processed, then all occurrences of the same feature type in the database are retrieved. From these retrieved database entries the associated line is drawn on the image, starting at where the feature point has been detected. More than one line can emerge from a specific image feature point if this type of feature is used multiple times in the object shape stored in the database. The newly drawn line emanates from the image feature point in the direction of the expected location of the centre of mass.

It is noted that the computation tasks necessary to perform the method according to the invention may be performed by one or more processors which form part of the video camera system. The skilled person will be able to select appropriate processing means for these computation tasks in accordance with the amount of processing involved and the required functionality. For example, some tasks may be performed by a general-purpose processor, whereas other tasks may be performed by a dedicated microcontroller.

Furthermore, it is proposed to use color to obtain more information than grey-scale under certain conditions. This leads to a more accurate detection in certain cases. In case of the algorithm described above, the goal is to decrease the number of lines in order to decrease the number of false detections. Therefore, if the feature color is known, the image can be segmented and only the interest region can be kept. In this way, a smaller amount of corners is detected and fewer line gradients will be drawn on the screen. It is noted that 3D features do not have to be corners, but they can be any 3D feature whose shape can be deducted from collaborating 2D views. It is remarked that the scope of protection of the invention is not restricted to the embodiments described herein. Neither is the scope of protection of the invention restricted by the reference symbols in the claims. The word 'comprising' does not exclude other parts than those mentioned in a claim. The word 'a(n)' preceding an element does not exclude a plurality of those elements. Means forming part of the invention may both be implemented in the form of dedicated hardware or in the form of a programmed general-purpose processor. The invention resides in each new feature or combination of features.

Claims

CLAIMS:

1. A method for three-dimensional (3D) object recognition using a collaborative camera network comprising at least two cameras,

(a) wherein each camera captures an image, detects at least one two-dimensional (2D) feature in the image, and assigns a two-dimensional feature descriptor to the two-dimensional feature;

(b) wherein a three-dimensional feature descriptor is derived from the two- dimensional feature descriptors and from the camera positions in space;

(c) wherein the three-dimensional feature descriptor is compared to three-dimensional feature information stored in a database, in order to recognize a three-dimensional object.

2. A method as claimed in claim 1, wherein the two-dimensional feature is a 2D corner, the 2D feature descriptor is the orientation of the 2D corner, the three-dimensional feature is a 3D corner, and the three-dimensional feature descriptor is the orientation, relative to a camera in the collaborative camera network, of the 3D corner.

3. A method as claimed in claim 2, wherein the 2D corner is detected using the Harris- Stephens corner operator.

4. A method as claimed in claim 2, wherein the orientation of the 2D corner is computed by performing a background subtraction algorithm on the current scene.

5. A method as claimed in claim 1, wherein the Intersecting Line Technique is used to recognize the three-dimensional object.

6. A system for three-dimensional (3D) object recognition in a collaborative camera network comprising at least two cameras,

(a) wherein each camera is arranged to capture an image, to detect at least one two- dimensional (2D) feature in the image, and to assign a two-dimensional feature descriptor to the two-dimensional feature; (b) further comprising means for deriving a three-dimensional feature descriptor from the two-dimensional feature descriptors and from the camera positions in space;

(c) and means for comparing the three-dimensional feature descriptor to three- dimensional feature information stored in a database, in order to recognize a three- dimensional object.