GB2523776A

GB2523776A - Methods for 3D object recognition and registration

Info

Publication number: GB2523776A
Application number: GB1403826.9A
Authority: GB
Inventors: Minh-Tri Pham; Frank Perbert; Bjorn Stenger; Riccardo Gherardi; Oliver Woodford; Sam Johnson; Roberto Cipolla; Stephan Liwicki
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2014-03-04
Filing date: 2014-03-04
Publication date: 2015-09-09
Anticipated expiration: 2034-03-04
Also published as: GB201403826D0; US20150254527A1; GB2523776B; JP2015170363A

Abstract

A method for comparing a plurality of objects obtained from point cloud data. The method comprises representing at least one feature of each object as a 3D ball representation, the radius of each ball representing the scale of the feature in the with respect to the frame of the object and the position of each ball representing the translation the feature in the frame of the object. The objects may then be compared by comparing the scale and translation as represented by the 3D balls to determine similarity between objects and their poses. The feature locations may be aligned using a hash table and the comparison may be carried out using a search tree. The comparison process may be used for object recognition wherein votes are assigned to predicted object poses and positions and the comparison process is used to determine the vote that provides the best match to an object.

Description

Methods for 3D Object Recognition and Registration

FIELD

Embodiments of the present invention as described herein are generally concerned with

the field of object registration and recognition.

BACKGROUND

Many computer vision and image processing applications require the ability to recognise and register objects from a 3D image.

Such applications often recognise key features in the image and express these features in a mathematical form. Predictions of the obj ect and its pose, termed votes, can then be generated and a selection between different votes is made.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a schematic of an apparatus used for capturing a 3-D image; Figure 2 is an image demonstrating a feature; Figure 3(a) is a point cloud generated from a captured 3-D image of an object and figure 3(b) shows the image of figure 3(a) with the extracted features; Figure 4 is a flow chart showing how votes are generated; Figure 5 is a flow chart showing the construction of a hash table from training data; Figure 6 is a flow chart showing the steps for selecting a vote using the hash table; Figure 7 is a flow chart showing a variation on the flow chart of figure 6 where rotation of the poses is also considered; Figure 8 is a plot showing a 2D method for comparing distances between points; and Figure 9 is a plot showing the results of a 3D method for comparing distances between points; Figures 10(a) to 10(d) are plots showing the performance of different measures for comparing arrays of rotations for different distributions of the rotations; Figure 11 is a flow chart showing the construction of a vantage point search tree from training data Figure 12 is a flow chart showing the steps for selecting a vote using the search tree of figure 11; and Figure t3 is a schematic of a search tree of the type used in figures 11 and 12.

DETAILED DESCRIPTION OF THE DRAWINGS

According to one embodiment, a method for comparing a plurality of image data relating to objects is provided, the method comprising representing at least one feature of each obj ect as a 3D ball representation, the radius of each ball representing the scale of the feature with respect to the frame of the object, the position of each ball representing the translation the feature in the frame of the object, the method further comprising comparing the objects by comparing the scale and translation as represented by the 3D balls to determine similarity between objects and their poses.

The frame of the object is defined as a local coordinate system of the object. In an example, the origin of the local coordinate system is at the center of the object, the three axes are aligned to a pre-defined 3D orientation of the obj ect, and one unit length of an axis corresponds to the size of the object.

In a further embodiment, the 3D ball representations further comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations. The 3D orientation is assigned to a 3D ball which will be referred to as a 3D ball with 3D orientation, or a 3D oriented ball. Technically, a 3D ball is represented by a direct dilatation and a 3D oriented ball is represented by a direct similarity.

In an embodiment, comparing the scale and translation comprises comparing a feature of a first object with a feature of a second object to be compared with the first object using a hash table, said hash table comprising entries relating to the scale and translation of the features of the second object hashed using a hash function relating to the scale and translation components, the method further comprising searching the hash table to obtain a match of a feature from the first obj ect with that of the second object.

In the above embodiment, the hash function may be described by: h(X) =110 (XD).

where h(X) is the hash function of direct similarity X, xsxt 0 1 is the dilatation part of a direct similarity X where X is the scale part of direct similarity X and X is the translation part of direct similarity X, (h:ì)( )(/XI)' and i/isa quantizer.

In this embodiment, the hash table may comprise entries for all rotations for each scale and translation component.

The hash table may be used to compare features using the 3D ball representations which do not contain rotation information and those which comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations, the method further comprising comparing the rotations stored in each hash table entry when a match has been achieved for scale and translation components, to compare the rotations of the feature of the first object with that of the second object.

Many different measures can be used for comparing the rotations in 3D. In an embodiment, the rotations are compared using a cosine based distance in 3D. For example, the cosine based distance may be expressed as: d(ra, rb)2 i (1 -Va,j Vbi) COS(&a,j ab,) -ç (1 + Va,j Vb,j'\ COS(aa,j + nb,j) 2) N j=1 Where ra=(v, aa) and rh=(vj,, rib) are arrays for 3D rotations represented in the axis-angle representation. v,1 and a, respectively denote the rotation axis and the rotation angle of the jb component of the array r. and ahj, respectively denote the rotation axis and the rotation angle of thef component of the array rh, The above embodiment has suggested the use of a hash table to search for the nearest features between two objects to be compared. However, in an embodiment, this may be achieved by comparing a feature of a first object with a feature of a second obj ect to be compared with the first object using a search tree, said search tree comprising entries representing the scale and translation components of features in the second obj ect, the scale and translation components being compared using a closed-form formulae.

Here, the search tree is used to locate nearest neighbours between the features of the first object and the second object. The scale and translation components may be compared by measuring the Poincare distance between the two features, For example, the distance measure may be expressed as: dtD;,n=cosn' Where di(x,y) represents the distance between two balls x andy that are represented by x = (r-; c) andy = (r1,.; c) where r; r1,.> 0 denote the radii, c. ç 1R3 denote the ball centres in 3D and cosh() is the hyperbolic cosine function.

The search tree may also be used when the 3D ball representations further comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations using the formulae: d9.(;1:/ . = ajdv, U ftfr " where d2(x,y) represents the distance between two balls randy as defined above and the two balls x and y are associated with two 3D orientations, represented as two 3-by-3 rotation matrices1' fE1 5O(O, the tenn represents a distance function between two 3D orientations via the Frobenius norm, and coefficients aj; a2> 0. In a further embodiment, the distance ftmction between two 3D orientations is the cosine based distance d(r, rb) above, In an embodiment, a method for object recognition is provided, the method comprising: receiving a plurality of votes, wherein each vote corresponds to a prediction of an objects pose and position; for each vote, assigning 3D ball representations to features of the object, wherein the radius of each ball represents the scale of the feature in the with respect to the frame of the object, the position of each ball representing the translation the feature in the frame of the object, determining the vote that provides the best match by comparing the features as represented by the 3D ball representations for each vote with a database of 3D representations of features for a plurality of objects and poses, wherein comparing the features comprises comparing the scale and translation as represented by the 3D balls; and selecting the vote with the greatest number of features that match an object and pose in said database.

In the above embodiment, the 3D ball representations are assigned to the votes and the objects and poses in the database further comprise information about the rotation of the feature with respect to the frame of the object and wherein determining the vote comprises comparing the scale, translation and rotation as defined by the 3D ball representations.

In the above method, receiving a plurality of votes may comprise: obtaining 3D image data of an object; identifying features of said object and assigning a description to each feature, wherein each description comprises an indication of the characteristics of the feature to which it relates; comparing said features with a database of objects, wherein said database of objects comprises descriptions of features of known objects; and generating votes by selecting objects whose features match at least one feature identified from the 3D image data.

In a further embodiment, a method of registering an object in a scene may be provided, the method comprising: obtaining 3D data of the object to be registered; obtaining 3D data of the scene; extracting features from the object to be registered and extracting features from the scene to determine a plurality of votes, wherein each vote corresponds to a prediction of an object's pose and position in the scene, and comparing the object to be registered with the votes using a method as described above to identify the presence and pose of the object to be registered.

In a yet frirther embodiment, an apparatus for comparing a plurality of objects is provided, the apparatus comprising a memory configured to store 3D data of the objects comprising at least one feature of each object as a 3D ball representation, the radius of each ball representing the scale of the feature in the with respect to the frame of the object, the position of each ball representing the translation the feature in the frame of the object, the apparatus further comprising a processor configured to compare the objects by comparing the scale and trans'ation as represented by the 3D balls to determine similarity between objects and their poses.

Since the embodiments of the present invention can be implemented by software, embodiments of the present invention encompass computer code provided to a general purpose computer on any suitable carrier medium, The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an e'ectrical, optical or microwave signal.

A system and method in accordance with a first embodiment will now be described.

Figure shows a possible system which can be used to capture the 3-D data. The system basically comprises a camera 35, an analysis unit 21 and a display (not shown).

In an embodiment, the camera 35 is a standard video camera and can be moved by a user. In operation, the camera 35 is freely moved around an object which is to be imaged. The camera may be simply handheld. However, in further embodiments, the camera is mounted on a tripod or other mechanical support device, A 3D point cloud may then be constructed using the 2D images collected at various camera poses. In other embodiments a 3D camera or other depth sensor may be used, for example a stereo camera comprising a plurality of fixed apart apertures or a camera which is capable of projecting a pattem onto said object, LIDAR sensors md time of flight sensors, Medical scanners such as CAT scanners and MRI scanners may be used to provide the data. Methods for generating a 3D point cloud from these types of cameras and scanners are known and will not be discussed fbrther here.

The analysis unit 2 comprises a section for receiving camera data from camera 35, The analysis unit 21 comprises a processor 23 which executes a program 25. Analysis unit 21 further comprises storage 27. The storage 27 stores data which is used by program 25 to analyse the data received from the camera 35, The analysis unit 2] further comprises an input module 31 and an output module 33. The input module 31 is connected to camera 35. The input module 31 may simply receive data directly from the camera 35 or alternatively, the input module 31 may receive camera data from an external storage medium or a network, In use, the analysis unit 21 receives camera data through input module 31. The program executed on processor 23 analyses the camera data using data stored in the storage 27 to produce 3D data and recognise the objects and their poses. The data is output via the output module 35 which may be connected to a display (not shown) or other output device either local or networked, In figure 4, the 3D point cloud of the scene is obtained in step SlOl. From the 3D point cloud, local features in the form of 3D balls together with their descriptions are extracted from the point cloud of the input scene in step S 103. This may be achieved using a known multi-scale keypoint detector like SURF-3D or ISS. Figure 2 shows an example of such an extracted feature. The feature corresponds to a corner of the object and can be described using a descriptor vector or the like, for example a spin-image descriptor or a descriptor that samples a set number of points close to the origin of the feature.

Figure 3(a) shows a point cloud of an object 61 and figure 3(b) shows the point cloud of the object 61 after feature extraction, the feature being shown as circles (63).

At test time, features extracted from the scene are matched with previously extracted features from training data by comparing their descriptions and generating an initial set of votes in step SIOS. The votes are hypotheses predicting the object identity along with its pose, consisting of a position and an orientation and additionally a scale if scales are unknown, The best vote is then selected and returned as final prediction in step S]09.

In an embodiment, step S107 of aligning the feature locations is executed using a hash

table.

Figure 5 is a flow diagram showing the steps for constructing the hash table from the training data.

In this embodiment, the more general case of 3D recognition in which object scale varies will be considered and object poses and feature locations are treated as direct similarities. For notational convenience, X, XR and X will denote the scale, rotation and translation part respectively of a direct similarity X. The steps of the flow diagram of figure 5 will generally be performed off-line.

In the offline phase training data is collected for each object type to be recognized. In step SI 51, all feature locations that occur in the training data are collected, The features extracted from the training data and are processed for each object (i) and each training instance (j) of that object. In step Si 53 the object count (/) is set to 1 and processing of the 1th object starts in step S155. Next, the training instance count (/) for that object is set to I and processing of the 1th training instance begins in step SI 59, Next, the selected features are normalized via left-multiplication with their corresponding object pose's inverse. This brings the features to be normalised to the object space in step Si6i.

Next, a hash table is created such that all normalised locations of object i are stored in a single hash table -i in which hash keys are computed based on the scale and translation components. The design of the hash thnction [ç) is detailed below. The value of a hash entry is the set of rotations of all normalized locations hashed to it.

The scale and translation parts of a direct similarity forms a transformation called (direct) dilatation, in the space: {[ t]4} (t) Where: x8xt XD:= 1 the dilatation part of a direct similarity X. Given a query direct similarity X, Xp is converted into a 4D point via a map VT(3) -# ø(IXI (in)(<,. )cr/,c8)'. (2) Then, the 4D point is quantized into a 4D integer vector, i.e. a hash key, via a quantizer 1It, -, (lxii lxL,F ixI X.:j1 iy(x):= ti-I -F H-H -I \Lci LUtJ Lti [CtJJ (3) where o and at are parameters that enable making tradeoffs between scale md translation, and operator U finds the integer value of a real number. Thus, the hash function h) is defined as h(X) =77o(XD).

An efficient hash table should ensure that every hash entry be accessed with roughly the same probability, so that collisions are minimized. To achieve this, ) is created so that the following lemma holds.

Lemma L The Euclidean volume element of R4 is pulled back via to a left-invariant 4-form on VY(3 Proof Denote by D(x) dx1d.x2dx3dx4 the Euclidean volume element at X Y'(x1 To prove the lemma, it is sufficient to show that for all Y VT(3) and x E D(x) = D((5(ya (xfl). () Let Y (Y). By substituting (2) into (4) yields: c5(Ycb (x.)) (5) ( Y.t +X1 fYi ±X1 + fYI Y2:4 0} (6) n(y1+x1 x41+xtyJ,L (7) It can be seen from (7) that the Jacobian determinant of (5) is equal to 1, Therefore, D(c5(Yç5'(x))) = Ilklxidx2dx3dx4 = D(x).

Lemma 1 implies that if the dilatations are uniformly distributed in V'T(3), i.e. distributed by a (left-) Haar measure, their coordinates via *) are uniformly distributed in lR, and vice versa. Combining this fact with the fact that the quantizer 71, partitions R4 into cells with equal volumes, it can be deduced that if the dilatations are uniformly distributed, their hash keys are uniformly distributed.

Algorithm I below shows the off-line training phase as described above with reference to figure 5, Algorithm 1 Offline phase: creating hash tahks.

Input: training feature locatIons and poses C for all object i: 2: Create hash table R. for all u amuig umiance j ot he oh}ect 4: for all feature k of the tramtng instance: X C,, Fiud/nsert hash cntiv V H (ILX) 7: V -V U fXrt}.

8: Return H..

Here, and C are multi-index lists such that i.. . denotes the Jib object's training instance's k" feature location, and C,, denotes the Jib object'sf' training instance's pose.

Figure 6 is a flow diagram showing the steps of the matching features from a scene using the hash table as described with reference to figure 5. The same feature detector should be used in the off-line training phase and the on-line phase.

In step S201, the search space is restricted to the 3D ball features are selected from the scene. Each ball feature is assigned to a vote which is prediction of the objects identity and pose. In step 5203, the vote counter v is assigned to 1. In step 5205, features from vote v are selected.

In step 5207, the scene feature locations denoted by S for that vote are left multiplied with the inverse of the vote's predicted pose to normalise the features from the vote with respect to the object.

Next, each feature is compared with the training data using the Hash table H constructed as explained with reference to figure 5, The number of matches of features for a particular vote are calculated. Then the process determines if there are any further votes available in step 5211. If further votes are available, the next vote is selected in step 5213 and the process is repeated from step 5205. Once all votes have been analysed, the vote with the highest number of matching features is selected in step S215 as the predicted pose and object.

In the methods of the above embodiments, the votes are selected by comparing the feature locations and not the feature descriptions, this exploits the geometry of the object as whole.

The above two methods have only used the feature locations. However, in a further embodiment, the rotations of the features is also considered. Returning to the collection of training data as described with reference to figure 5, the hash table is created in step Si 63. Each hash entry is the set of rotations of all normalised locations hashed to it, When rotation is compared, the hash table will be operated in the same mariner as described before, but each hash entry will contain a set of rotations.

When rotations are compared as described above, the on-line phase is similar to the on-line phase described with reference to figure 6. To avoid unnecessary repetition, like reference numerals will be used to denote like features, The process will proceed in the same manner as described with reference to figure 6 up to step S209. However, in figure 7, there is a further step S210 which takes place where the rotation of the feature from the scene is compared with the set of rotations located at the hash entry. The rotation of the feature from the scene is then compared with the set of rotations for the hash entry. If the hash entry matches the selected feature for scale, the match will be discounted if there is no match on rotation.

Then the process progresses to step S21 I where the process checks to see if the last vote has been reached. If the last vote has not been reached then the process selects the next vote and loops back to step S205.

Once all votes have been processed, the vote is selected with the largest number of matching votes.

The above process can be achieved with the following algorithm: Algorithm 2 Online phase: vote evaluation Paranieters: hash tables Ii and scene feature locations S Input: vote = (object identity i, pose Y) 1: Ui -0.

2: for all scene feature j: 3: X - 4: Find hash entry V -fl(h(X)).

5: if found: 6: it -it + 4-minaj, d(R, X)2.

7: Return it.

Thus, the array of scene features, and in particular their rotations are compared, to the training data. Note, as explained above, the method does not involve any feature descriptions, as only pose is required. Therefore, the geometry of an object as a whole is exploited and not the geometry of local features.

The rotations can be compared using a number of different methods. In an embodiment a 3D generalisation of the 2D cosine distance is used.

A robust cosine-based distance between gradient orientations can be used for matching arays of rotation features. Given an image Ij, the direction of the intensity gradient at each pixel value is recorded as rotation angle r1,3,j = 1 IV i.e. thej" angle value of the 1th image. The square distance between two images, 1 and 4, is provided by: d(ra, rb)2 1 -cos(ra,j -rb,) 2=1 (8) The distance function and its robust properties can be visualized as shown in figure 8.

The advantages of this type of distance function stem from the sum of cosines. lii particular for an uncorelated area P, with random angle directions, the distance values are almost uniformly distributed, such that E6 cos(ra,j - and the distance tends to 1. However, for highly correlated arrays of rotations, the distance is near 0.

Thus, while inliers have more effect and pull the distance towards 0, outliers have less effect and shift the distance towards I -not 2.

In 2D, rotation, was solely provided by an angle a1,1. In 3D, it can be assumed that the rotations are described as an angle-axis pair r1,1 = (a11g,vj,j) C SO(3) In an embodiment, the following distance frmnction can be used for comparing arrays of 3D rotations: d(ra, rb)2 1 - (1 -Va,j vbi) cos(aa,j-ab,) - (1 + Va,j,j cos(aj + 2) N j=1 (9) 1+v,1 *Vbj lVo,j *Vbj I It should be noted that 2 + 2 = ,i.e. both terms act as a weighting. The weight is carefully chosen to depend on the angle between the rotations' unit axes.

The specia' properties of the weight are shown in figure 9, Considering 2 rotations, r1 and Tb1 -If both share the same axis v,1 = h.1, the dot-product i'*1 -yb,1 = 1 and the distance turns into its 2D counterpart in (I), In the case of opposing axes, v,1 = -voi, -v1,= -1 and the sign of ah,1 is flipped. Notice that (abl, -vt,1) = (-abi, Hence, again the problem is reduced to (1). A combination of both parts is employed when-i < va_i v11,1 < I. The proposed cosine-based distance in 3D can be thought of as comparing the strength of rotations. If rotations are considered "large" and "small" according to their angles, it seems sensible to favor similar angles. The robust properties of the above 3D distance function stem from the pretty evenly distributed distance count of random rotations.

The mean of outliers is near the centre of the distance values, while similar rotations are close to 0. This corresponds to the robust properties of the cosine distance in 2D.

The above described 3D distance induces a new representation for 3D rotations, which allows for efficient and robust comparison. This will hereinafter be termed a full-angle quarternion (FAQ) representation.

The squared distance can be rewritten as follows:

N

1 -COSEYa,jCOSOtb,j UkYa,rb) =1-__ N j=1

N

-\ (Va,j Vb,j) sin sin a6,1

N

1=1 (10) N 2 - (cosaj -coscEb,) 2N j=1 N 2 + \ Va,j Sifl -Sill ab,vb, 2N j=1 (11)

N

2N Mqa,j -qb,j2, (12) where q1.1 is a unit quarteniion given by: qj,j cos + (iv,,i +JVj,,2 + kv,,3) sina. (13) The above equation defines the FAQ representation. Here, the trigonometric functions cosC) and sinC) are applied to the fuli angle a, instead of the half angle a,/'2. Thus, each 3D rotation corresponds to exactly one unit quartemion under FAQ, In addition, the above equation shows that the new distance proposed above has the form of the Euclidean distance using the new FAQ representation.

The mean of 3D rotations under FAQ is global and easy to compute. Given a set of unit quaternions, the mean is computed simply by summing up the quaternions and dividing the result by its quaternion norm, The FAQ representation comes with a degenerate case as every 3D rotation by 1800 maps to the same unit quaternion: q = (-1; 0; 0; 0).

The above new FAQ representation can be used to compare the rotation of the scene feature with the set of rotations at each Hash entry. Unlike the general case of robust matching of 3D rotations when both inputs can be corrupted, it can be assumed that the rotation of a training feature is usually an inlier, since the training data is often clean, Thus, the method mostly compares a rotation from the scene with an inlier. To utilize this fact, apart from using ( equation 9), a left-invariant version of it is used: £!F XR) (i(I U4) where I is the 3-by-3 identity matrix, R is the rotation of a training feature, and XR is a rotation from the scene.

5. FR --XR = (1 -cso:)2 ± sin. n? (15) * 9 = (1 -(USU) 10 (16) fi1.q(R1XR) L = d'(R, X)2 (17) where a and v are respectively the angle and axis of}C'XR, and faq(.) denotes the FAQ representation of a rotation matrix.

The above embodiment has compared rotations using the new FAQ representation described above. However, other embodiments can use alternative methods for comparing rotation. Most of these are Euclidean (and variants) under different representations of 3D rotations. The Euler angles distance is the Euclidean distance between Euler angles. L2-norms of differences of unit quaternions under the half-angle quarternion (HAQ) representation lead to the vectorial/extrinsic quaternion distance and the inverse cosine quaternion distance, Analysis of geodesics on leads to intrinsic distances which are the L2-norm of rotation vectors (RV), i.e. the axis angle representation. The Euclidean distance in the embedding space R of 503; induces the chordal/extrinsic distance between rotation matrices (RM), In an embodiment, an extrinsic distance measure is used, e.g. Euclideari distance of embedding spaces, based on the HAQ and Rts4 representations, due to their efficient closed-forms and their connections to efficient rotation means.

Figure 10 compares the new 3D distance measure described above with the HAQ, RM and RV distances. When similar rotations are compared (fig. 0(a)), the RV representation is sensitive to rotations with angles close to 180°, here the normalized distance may jump from near 0 to near 1. All other methods are able to identify close rotations successfully. When comparing random rotations (fig. 10(b)), RM and RV strongly bias the results either towards small or large distances. The distance under H.AQ and the 3D cosine-based distance, on the other hand, are more evenly distributed.

The 3D cosine-based distance shows similar properties to the distance under RM when utilized for rotations with similar rotation axes (fig. 10(c)). Here HAQ produces overall smaller distances. The distance under RY is quite unstable for this setup, as no real trend can be seen. However, when exposed to similar rotation angles (fig. 10(d)), it behaves similarly to the 3D cosine-based distance, RM shows a bias towards large distances, while I-IAQ has an even distribution of distances.

The new cosine-based distance in 3D can be thought of as comparing the strength of rotations, If rotations are considered "large" and "small" according to their angles, it seems sensible to favour similar angles. The robust properties of the 3D cosine-based distance function stem from the pretty evenly distributed distance count of random rotations, In an embodiment, for the 3D cosine based distance, there is a a maximum distribution of 20% in a single bin.

The mean of outliers is near the centre of the distance values, while similar rotations are dose to 0. This corresponds to the robust properties of the cosine distance in 2D.

The above embodiments have used a hash table to match features between the scene and the training data. However, in a frirther embodiment, a different method is used.

Here, a vantage point search tree is used as shown in figure 11. In the offline phase training data is collected for each object type to be recognized. In step S35 t, all feature locations that occur in the training data are collected, The features extracted from the training data and are processed for each object (1) and each training instance (j) of that object. Instep S353 the object count (/) is set to 1 and processing of the /" object starts in step S355, Next, the training instance count (j) for that object is set to I and processing of thef" training instance begins in step S359.

Next, the selected features are normalized via left-multiplication with their coresponding object pose's inverse. This brings the features to be normalised to the object space in step S36i.

Instep S363, the process checks to see if all instances of an object have been processed.

If not, the training instance count is incremented in step S365 and the features from the next training instance are processed. Once all of the training instances are processed, a search tree is constructed, in an embodiment, the search tree is a Vantage point search tree of the type which ll be described with reference to figure 13.

In step S367, a vantage point is selected and a threshold C. The tree for an object is then constructed with respect to this vantage point. In an embodiment, the vantage point and threshold are chosen to generally divide the set of features from the training data into 2 groups. However, in other embodiments the vantage point is selected at random, The vantage point has a threshold C. The distance of each training feature from the vantage point is determined.

In an embodiment, a closed form solution is used for comparing the distance of a feature from the vantage point, the vantage point being expressed in the same terms as a feature. In one embodiment, the features are expressed as 3D balls which represent scale and translation of the features. If two balls x and y are given by x = (r,.; c) and); = (ru; ci,) where r; r> 0 denote the radii and ?, denote the ball centers in 3D. The formula below compares x and y as a distance function: 1/ ti (O"i 1 / (18) Where the function cosh() is the hyperbolic cosine function, The distance is known in the literature as the Poincare distance, In a further embodiment, the features are also expressed and compared in terms of rotation. If two balls x andy are associated with two 3D orientations, represented as two 3-by-3 rotation matrices1r. *<(fl, they can be compared using the following distance function: Y) rxjd(i 2 --R; (19) where the second term 02 " !i represents a distance function between two 3D orientations via the Frobenius norm, and coefficients a1; a2> 0 pre defined by the user which enables making trade-offs between two distance functions. In practice, a1 = a2 = 1 can be set to obtain good performance, but other values are also possible. Different distance measures can be used in equation (19), for example distance function between two 3D orientations via the Frobenius norm can be substituted by the distance of equation (9).

Depending on whether or not the features are to be compared using scale and transition or scale, translation and rotation, equation (18) or equation (19) will be used to calculate the distance. The tree is constructed from the training data and the tree is constructed as a binary search tree. Once the training data has been divided into 2 groups by selection of the vantage point and threshold, each of the 2 groups are then subdivided into a further 2 groups by selection of a suitable point and threshold for each group. The search tree is constructed until a training data cannot be divided further, Once a search tree has been established for one object, the process moves to step S371 where a check is performed to see if there is training data available for further objects.

If further training data is available, the process selecting the next obj ect at step S373 and then repeats the process from step S359 until search trees have been constructed for each object in the training data.

Figure 12 is a flow diagram showing the on-line phase. In the same manner as described with reference to figure 6, in step S501, the search space is restricted to the 3D ball features are selected from the scene. Each ball feature is assigned to a vote which is prediction of the objects identity and pose. In step S503, the vote counter v is assigned to 1. Instep S505, features from vote v are selected.

In step S507, the scene feature locations denoted by S for that vote are left multiplied with the inverse of the vote's predicted pose to normalise the features from the vote with respect to the object.

In step S509, the search tree is used to find the nearest neighbour for each of the scene features within a vote. The search is performed as shown in figure 13. Here, the scene feature is represented by "A". Each internal tree node i has a feature B and a threshold C,. Each leaf node / has an item D,. To find a nearest neighbour for a given feature A is done by comparing the distance between A and B using either of equations (18) or (19) above, Eventually, a leaf node ft will be selected as the nearest neighbour.

In step SM 1, the distance between the scene feature and the selected nearest neighbour is compared with a threshold. If the distance is greater than the threshold then the nearest neighbour is not considered to be a match. If the distance is less than a threshold then a match is determined, The number of matches for each vote with an object are determined and the vote with the largest number of matches is determined to be the correct vote.

The above methods can be used in object recognition and registration.

In a first example, a plurality of training objects are provided. These may be objects represented as 3D CAD models or scanned from a 3D reconstruction method. The goal is to detect these objects in a scene where the scene is obtained by 3D reconstruction or by a laser scanner (or any other 3D sensors).

In this example, the test objects are a bearing, a block, bracket, car, cog, flange, knob, pipe and two types of piston. Here, training data in the form of point clouds of the objects were provided. If the objects were provided in the form of 3D CAD models, then the point cloud is simply the set of vertices in the CAD model.

Then point clouds were provided to the system in the form of a dataset consisting of 1000 test sets of votes, each computed from a point cloud containing a single rigid object, one of the 10 test objects.

The process explained with reference to figures 5 and 7 was used. The method of figure 7 and 5 variants on this method were used. These methods differ in line 6 of alg. 2, where different weighting strategies corresponding to different distances are adopted as shown in table 1. Hashing-CNT was used as the baseline method for finding o and a1, Hashing-CNT is the name given to the method described with reference to figure 6 where the comparison is purely based on matching dilatations without matching rotation. Table I shows weighting strategies for different methods, Functions HAQ), RV), FAQC) are representations of a 3D rotation matrix,

Table I

Method name Weight lashing-tNT I Hashing-HAQ 1 --Il"'1.av haci(R) --IiaqçK) 2 Ha'Thing-RV muiR, v IIr (R) -Hashing-li-RY,_2 1R'v i. (ft X1) 12 HisIimg-EAQ -1 -min1, faq(R) -Hashing-LI-FAQ 4 nünp,:j,r faq(J.) faq(R Xj1,) To find the best values for o and o, a grid search methodology was adopted using leave-one-out cross validation. The recognition rate was maximised, followed by the registration rate, The best result for hashing-CNT was found at (a,; a,) = (0:111; 0:92) where the recognition rate is 100% and the registration rate is 86.7% (table 2, row 2), Cross validation over the other S variants was run using the same values for (a5; a1), so that their results can be compared (see table 2). In all cases, 100% recognition rates were obtained. Hashing-LI-FAQ gave the best registration rate, followed by hashing-HAQ, hashing-LI-RY, and hashing-FAQ, and then by hashing-RY, The left-invariant distances of RV and FAQ outperformed their non-invariant counterparts respectively, The results are shown in table 2

N

0 3..'. bjt1 a p 3at tn itt Fl._i I $1 3 9 >t 3(41 \t " 4.4 1 fl' - *S 33 190 97 300 95 99 92 It 0? &t.7 300 tOOl ftthia:.-HAQ 24 22 300 51$ :lOiü 0-5151 Ott &5 e aL> 400 OiO$ 31-J,o.,R'. 2' 103) > IOU n 1' 100 33as010'*iJ-RV 512 2< 100 91 300 94 0 91) 09 $7> 300 94 27 103) *5 100 9> 9') SO 344 934 *7.7 190 00)7 4;4csa13RSQ 94 2* 100 1S 3004 9'> 09 a) $2 $7.9 RI In a further example, the above processes are used for point cloud registration. Here, there is a point cloud representing the scene (e.g. a room) and another point cloud representing an object of interest (e.g. a chair). Both point clouds can be obtained from a laser scanner or other 3D sensors.

The task is to register the object point cloud to the scene point cloud (e.g. finding where the chair is in the room). The solution to this task is to apply the feature detector to both point clouds and, then the above described recognition and registration is used to find the pose of the object (the chair).

While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions, Indeed the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims

CLAIMS: 1 A method for comparing a plurality of objects, the method comprising representing at least one feature of each obj ect as a 3D ball representation, the radius of each ball representing the scale of the feature in the with respect to the frame of the object, the position of each ball representing the tnmslation the feature in the frame of the object, the method further comprising comparing the objects by comparing the scale and translation as represented by the 3D balls to determine similarity between objects and their poses.
2. A method according to claim 1, wherein the 3D ball representations further comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations.
3, A method according to claim I, wherein comparing the scale and translation comprises comparing a feature of a first object with a feature of a second object to be compared with the first object using a hash table, said hash table comprising entries relating to the scale and translation of the features of the second object hashed using a hash function relating to the scale and translation components, the method further comprising searching the hash table to obtain a match of a feature from the first object with that of the second object.
4. A method according to claim 3, wherein the hash function is described by: h(X) := o (XD).where h(X) is the hash function of direct similarity X, xsxtXD0 1 is the dilatation part of a direct similanty X, where X is the scale part of direct similarity X and X1 is the translation part of direct similarity X, C)CD) (In X)(l. /J( / ;and ij is a quantizer.
5, A method according to claim 3, wherein the hash table comprises entries for all rotations for each scale and translation component.
6. A method according to claim 5, wherein the 3D ball representations further comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations, the method further comprising comparing the rotations stored in each hash table entiy when a match has been achieved for scale and translation components, to compare the rotations of the feature of the first object with that of the second object.
7. A method according to claim 6, wherein the rotations are compared using a cosine based distance in 3D.
8. A method according to claim 7, wherein the cosine based distance is expressed as: d(ra, rb)2 1 - (1 -Va,j Vbi) COS(aa,j ab,) - (1 + Va,j Vo,g cos(aa,j + ab,) 2) N j1 where FQ(Vq, a) and rh-(vb, ab) are arrays for 3D rotations represented in the axis-angle representation. ;,,j and a,1, respectively denote the rotation axis and the rotation angle of the/tb component of the array ra. vbf and cx1,1, respectively denote the rotation axis and the rotation angle of the 1th component of the array ra,
9. A method according to claim 1, wherein comparing the scale and translation comprises comparing a feature of a first object with a feature of a second object to be compared with the first object using a search tree, said search tree comprising entries representing the scale and translation components of features in the second object, the scale and translation components being compared using a closed-form formulae.
10. A method according to claim 9, wherein the search tree is used to locate nearest neighbours between the features of the first object and the second object.
11, A method according to claim 9, wherein the scale and translation components are compared by measuring the Poincare distance between the two features.
12. A method according to claim 11, wherein the distance measure is expressed as: / . ftr)2 k 9 (18) Where di(x,y) represents the distance between two balls randy that are represented by r = (-; c1) andy = (t'; c1.) where r; r>0 denote the radii, ric denote the ball centers in 3D and coshO is the hyperbolic cosine function.
13. A method according to claim 9, wherein the 3D ball representations further comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations using the formulae: II \ I + I) JR U L where I.. 2' -i i r ii? / (18) and di(r,y) represents the distance between two balls r andy that are represented by x = (r; c) and j; = (r; c) where r; r1,.> 0 denote the radii, . denote the ball centers in 3D and cosh() is the hyperbolic cosine function, and the two balls x and y are associated with two 3D orientations, represented as two 3-by-3 rotation matrices J C SO3, the term I] represents a distance function between two 3D orientations via the Frobenius norm, and coefficients a1; a2> 0.
14. A method according to claim 9, wherein the 3D ball representations further comprise information about the rotation of the feature with respect to the frame of the object and wherein comparing the object comprises comparing the scale, translation and rotation as defined by the 3D ball representations using the formulae: d3(x,y) = ,Jaidi(x,y)2 + a2d(x,y)2 where \ -<o4 tI) and di(x,y) represents the distance between two balls x andy that are represented by x = (r; c) andy = (r.; c) where r; r> 0 denote the radii, e iv denote the ball centers in 3D and cosh() is the hyperbolic cosine function, and the two balls x andy are associated with two 3D orientations, represented as two 3-by-3 rotation matrices it. R* the term, d(xy)2 represents a distance function between two 3D orientations via a cosine based distance, and coefficients a1; a2 >0.
15, A method for object recognition, the method comprising: receiving a plurality of votes, wherein each vote corresponds to a prediction of an objects pose and position; for each vote, assigning 3D ball representations to features of the object, wherein the radius of each ball represents the scale of the feature in the with respect to the frame of the object, the position of each ball representing the translation the feature in the frame of the object, determining the vote that provides the best match by comparing the features as represented by the 3D ball representations for each vote with a database of 3D representations of features for a plurality of objects and poses, wherein comparing the features comprises comparing the scale and translation as represented by the 3D balls; and selecting the vote with the greatest number of features that match an object and pose in said database.
16, A method according to claim 15, wherein the 3D ball representations assigned to the votes and the objects and poses in the database further comprise information about the rotation of the feature with respect to the frame of the obj ect and wherein determining the vote comprises comparing the scale, translation and rotation as defined by the 3D ball representations.
17. A method according to claim 15, wherein receiving a plurality of votes comprises: obtaining 3D image data of an object; identifying features of said object and assigning a description to each feature, wherein each description comprises an indication of the characteristics of the feature to which it relates; comparing said features with a database of objects, wherein said database of objects comprises descriptions of features of known objects; and generating votes by selecting objects whose features match at least one feature identified from the 3D image data.
18. A method of registering an object in a scene, the method comprising: obtaining 3D data of the object to be registered; obtaining 3D data of the scene; extracting features from the object to be registered and extracting features from the scene to determine a plurality of votes, wherein each vote corresponds to a prediction of an object's pose and position in the scene, and comparing the object to be registered with the votes using a method in accordance with claim 1 to identify the presence and pose of the object to be registered.
19. A computer readable medium carrying processor executable instmctions which when executed on a processor cause the processor to carry out a method according to claim I.
20. An apparatus for comparing a plurality of objects, the apparatus comprising a memory configured to store 3D data of the objects comprising at least one feature of each object as a 3D ball representation, the radius of each ball representing the scale of the feature in the with respect to the frame of the object, the position of each ball representing the translation the feature in the frame of the object, the apparatus frirther comprising a processor configured to compare the objects by comparing the scale and translation as represented by the 3D balls to determine similarity between objects and their poses.