GB2566443A

GB2566443A - Cross-source point cloud registration

Info

Publication number: GB2566443A
Application number: GB1714179.7A
Authority: GB
Inventors: Huang Xiaoshui; Fan Lixin; Zhang Jian; Wu Quiang
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2019-03-20
Also published as: GB201714179D0

Abstract

Method comprising: identifying first and second structures in a scene from first and second 3D coordinates (point cloud) produced by first and second 3D scene capture devices; deriving 3D structure coordinates of each of the first structures and each of the second structures; registering the first 3D structure coordinates with the second 3D structure coordinates by: determining correspondence between the first and second 3D structure coordinates; determining and applying a geometric transformation between the first 3D structure coordinates and second 3D structure coordinates. Point cloud topology may be segmented into voxels whose centres or corner points (vertices) may form the structure coordinates. Registration may be iterative until an energy function, optimised using tensors and comprising a structural similarity score and pixel-wise refinement score, converges to an optimal value. Similarity values of co-ordinate triplets may be selected using wide baseline strategy. Transformation may be affine transformation, rotation based on matched point pairs, scale estimation, translation based on mean points. Scene scanning may be LIDAR (Light Detection and Ranging), structure from motion (SFM), simultaneous localisation and mapping (SLAM). 3D model scene may be constructed.

Description

Cross-Source Point Cloud Registration

Field

The present application relates to the field of computer vision. More particularly, the 5 present application relates to creating three-dimensional (3D) models of scenes using point clouds obtained from multiple sensor sources.

Background

The reconstruction and analysis of 3D scenes from 3D point clouds captured by sensors is an important part of many 3D imaging systems.

A 3D scene capture device, such as a 3D scanner, is often used to generate a 3D point cloud. A 3D point cloud comprises a set of 3D coordinates for points that represent the 15 scene or object captured by the 3D scene capture device. Typically, the points may be representative of an external surface of an object. The 3D point cloud can be used to create a 3D model of the scene or object captured by the device, for example to be used in automated navigation, mapping and reconstruction tasks.

Multiple 3D point clouds representative of a scene or object can be captured in order to improve the 3D modelling, such as 3D point clouds viewing the scene from different angles. Aligning the points in the different point clouds to form a globally consistent data set can allow for the captured 3D point cloud data to be fully utilised in creating a 3D model of the scene. This alignment is known as point cloud registration. The registration of point clouds with each other is a long standing and difficult challenge in computer vision, computer graphics, robotics, and medical applications.

There currently exists a wide diversity of techniques for obtaining 3D point clouds. For example, in Structure-From-Motion (SFM), three-dimensional structures are estimated 30 from two-dimensional image sequences, where the observer and/or the obj ects to be observed move in relation to each other. In Light Detection And Ranging (LiDAR) methods, distances are measured by illuminating an object with a laser beam and analysing the reflected light.

When the 3D point clouds are captured using different devices, the registration problem becomes much more difficult. Each sensor may have different properties, such as having a varying density of points when compared with other sensors, a different noise model, produce different outliers, have missing data, and/or have a scale variation. Additionally, structure consistence between cross-source point clouds can be weak, increasing the challenge of cross-source point registration. This can result in 3D models based on the sensor data taking too long to produce and/or being too inaccurate for the purpose that they are required for.

Summary

According to a first aspect, the specification describes a method comprising: identifying a first plurality of structures from a first set of 3D coordinates representative of a scene produced by a first 3D scene capture device; deriving one or more first structure 3D coordinates for the first plurality of structures; identifying a second plurality of structures from a second set of 3D coordinates produced by a second 3D scene capture device in representative of the same scene; deriving one or more second structure 3D coordinates for the second plurality of structures; registering one or more of the first structure 3D coordinates with one or more of the second structure 3D coordinates, wherein the registering comprises: determining a correspondence between one or more of the first structure 3D coordinates and one or more of the second structure 3D coordinates; determining a geometric transformation between the first structure 3D coordinates and the second structure 3D coordinates based on the correspondence; and applying the geometric transformation to the second structure 3D coordinates to produce transformed second structure 3D coordinates such as to more closely match the first structure 3D coordinates to the second structure 3D coordinates.

Registering the first structure 3D coordinates with the second structure 3D coordinates may further comprise iterating the registering operations until a predetermined condition is met.

The iterations may be performed until an energy function converges to an optimal value, wherein the energy function comprises a structural similarity score and a pixelwise refinement score.

The energy function may be optimised using a tensor optimisation.

-3The structural similarity score may be determined by: selecting a first plurality of triplets from the first structure 3D coordinates; selecting a second plurality of triplets from the second structure 3D coordinates; determining one or more similarity values between one or more of the first plurality of triplets and one or more of the second 5 plurality of triplets; and calculating the structural similarity score, based on the one or more similarity values and the correspondence between one or more of the first structure 3D coordinates and one or more of the second structure 3D coordinates.

The first plurality of triplets and/or the second plurality of triplets may be selected 10 using a wide baseline strategy.

The pixel-wise refinement score may be determined by: calculating a pixel-wise similarity between one or more points in the first set of structural 3D coordinates and one or more points in the second set of structure 3D coordinates; and calculating, based 15 on the pixel-wise similarity and the correspondence between one or more points in the first set of structural 3D coordinates and one or more the second set of structural 3D coordinates, the pixel-wise refinement score.

Identifying the first plurality of structures may comprise segmenting the first set of 3D 20 coordinates into a plurality of voxels.

Deriving one or more first structure 3D coordinates may comprise assigning a 3D coordinate to a centre of each of the plurality of voxels.

Identifying second plurality of structures may comprise segmenting the second set of 3D coordinates into a plurality of voxels.

Deriving one or more second structure 3D coordinates may comprise assigning a 3D coordinate to a centre of each of the plurality of voxels.

The geometric transformation may comprise an affine transformation.

Determining the geometric transformation may comprise determining a rotation based on matched pairs of points from the first structure 3D coordinates and the second 3D structure coordinates.

-4Determining the geometric transformation may comprise performing a scale estimation.

Determining the geometric transformation may comprise determining a translation based on a mean point of the matched points in the first structure 3D coordinates and a mean point of the corresponding matched points in the second structure 3D coordinates.

At least one of the first and/or second 3D scene capture device may capture the first set 10 of 3D coordinates and/or the second set of 3D coordinates using at least one of: LiDAR;

Structure-from-Motion(SFM); Simultaneous Localisation and Mapping(SLAM); and/or KinectFusion.

The first 3D scene capture device may be of a different type to the second 3D scene 15 capture device.

The method may further comprise constructing a 3D model of a scene based on the first structure 3D coordinates and the transformed second structure 3D coordinates.

The method may further comprise applying the geometric transformation to the second set of 3D coordinates to align the second set of 3D coordinates with the first set of 3D coordinates, thereby to produce an aligned set of 3D coordinates.

The method may further comprise determining a refined geometric transformation 25 using the aligned set of 3D coordinates.

According to a second aspect, the specification describes a system comprising: a first scene capture device for capturing a first set of 3D co-ordinates representative of a scene; a second scene capture device for capturing a second set of 3D co-ordinates 30 representative of the scene; and a data processing system, wherein the system is configured to perform any of the methods according to the first aspect.

According to a third aspect, the specification describes apparatus comprising: at least one processor; and at least one memory including computer program code which, when 35 executed by the at least one processor, cause the apparatus to perform any of the methods according to the first aspect.

-5According to a fourth aspect, the specification describes apparatus configured to perform any of the methods according to the first aspect.

According to a fifth aspect, the specification describes computer readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform any of the methods according to the first aspect.

According to a sixth aspect, the specification describes a computer readable medium having computer readable code stored thereon, the computer readable code, when executed by at least one processor, causing performance of any of the methods according to the first aspect.

List of Figures

The invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:

Figure 1 shows a computer graphics system suitable to be used in a 3D point cloud analysis process according to an embodiment;

Figure 2 shows a schematic overview of a 3D scanning and modelling system;

Figure 3 shows a flow diagram of a method for cross-source point cloud registration;

Figure 4 shows an example of extracting structural representations for crosssource point clouds; and

Figure 5 shows a flow diagram of an embodiment of a method of registration of 25 the point clouds in further detail.

Detailed description

System overview

Figure 1 shows a computer graphics system suitable to be used in image processing, for example in a 3D point cloud analysis process according to an embodiment. The generalized structure of the computer graphics system will be explained in accordance with the functional blocks of the system. For a skilled man, it will be obvious that several functionalities can be carried out with a single physical device, e.g. all calculation procedures can be performed in a single processor, if desired. A data processing system of an apparatus according to an example of Figure 1 includes a main processing unit too, a memory 102, a storage device 104, an input device 106, an

-6output device 108, and a graphics subsystem 110, which all are connected to each other via a data bus 112.

The main processing unit too is a conventional processing unit arranged to process data within the data processing system. The memory 102, the storage device 104, the input device 106, and the output device 108 are conventional components as recognized by those skilled in the art. The memory 102 and storage device 104 store data within the data processing system too. Computer program code resides in the memory 102 for implementing, for example, a 3D point cloud analysis process. The input device 106 inputs data into the system while the output device 108 receives data from the data processing system and forwards the data, for example to a display. The data bus 112 is a conventional data bus and, while shown as a single line, it may be a combination of a processor bus, a PCI bus, a graphical bus, and an ISA bus. Accordingly, a skilled man readily recognizes that the apparatus maybe any conventional data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer. The input data of the 3D point cloud analysis process according to an embodiment and means for obtaining the input data are described further below.

It should be understood that different embodiments allow different parts to be carried out in different elements. For example, various processes of the 3D point cloud analysis may be carried out in one or more processing devices; for example, entirely in one computer device, or in one server device or across multiple user devices. The elements of the 3D point cloud analysis process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud.

3D point clouds are used in various image processing and computer vision applications. 3D point clouds are sets of data points in a 3D coordinate system typically representing 30 an external surface of an object. 3D point clouds may be obtained by a 3D capturing device, such as a 3D scanner. A large number of points are measured on the surface of an object, and the obtained point cloud may be stored in a file.

Various sensing methods for obtaining 3D point clouds have been developed. In

Structure-From-Motion (SFM), three-dimensional structures are estimated from twodimensional image sequences, where the observer and/or the objects to be observed

-Ίmove in relation to each other. The obtained geometric models are stored as 3D point clouds. In real applications, SFM uses images captured by RGB cameras to create point clouds for urban scenes and heritage objects.

In Light Detection And Ranging (LiDAR) methods, distances are measured by illuminating an object with a laser beam (e.g. ultraviolet, visible, or near-infrared light) and analysing the reflected light. The resulting data is stored as point clouds. The LiDAR point clouds may be considered a set of vertices in a three-dimensional coordinate system, wherein a vertex maybe represented by a planar patch defined by a 10 3D vector.

Also the Microsoft Kinect® sensor can be used to obtain standard point cloud datasets.

For computer vision applications, the 3D points are typically mapped to a recognized feature of a 2D image of the object. Simultaneous localization and mapping (SLAM) of 3D point clouds refers to a problem of updating a map of an unknown environment while simultaneously localizing an observer within it.

In applications of multiple views, combining several point clouds into a global consistent data set is typically required. The problem of matching a given set of 3D point clouds with another is a long standing open question in computer vision. This problem becomes even more challenging when two sets of points are yielded by different sensing techniques, e.g. one obtained from LiDAR while the other is obtained from Kinect, to give but two examples.

Existing 3D point cloud detection and registration methods are developed to typically work with point clouds that are yielded by the same sensing technology. Cross-sourced point cloud matching, however, imposes a number of challenges that often obstruct existing methods from working:

1. Scale difference: Most existing methods assume there is no significant scale change between two point clouds. This assumption, however, is not fulfilled for cross sourced point clouds. Even though a registration method is supposed to recover scale and rotation angles, exceedingly large variations in scales and angles are often out of the capture zones of many existing methods.

2. Density difference: Most existing methods assume that the two point clouds in question are of similar densities. This assumption, again, is not fulfilled for cross-8sourced point clouds. Usually, the LiDAR point cloud is much denser than, for example, the SFM point cloud. Large variations in densities of cross-sourced point clouds often lead to the failure of existing registration methods.

3. Missing data: due to the different nature of sensing techniques, cross-sourced point clouds of the same object may suffer from missing data corresponding to different parts of the object. For instance, this problem is pronounced for point clouds created by SFM as it is unable to generate points in uniform image regions.

Figure 2 shows a schematic overview of a 3D scanning and modelling system. A first 3D scene capture device 114 and a second 3D scene capture device 116 are used to capture a 3D image of one or more real objects 118. In this example, the real object 6 comprises a scene inside a building. The first and second 3D scene capture devices 114,116 are operable to capture an image of the real object 118 in the form of a first set of 3D coordinates 120 and a second set of 3D co-ordinates 122 (herein also referred to as the first and second point clouds respectively). In general, further 3D scene capture devices can be used to image the real object in order to generate additional point clouds representing the real object 118. The sets of 3D co-ordinates 120,122 are combined using a point cloud registration process to produce an aligned set of 3D co-ordinates 124 that better represents the scene or object being captured. The aligned set of 3D co20 ordinates can then be used to produce a 3D model of the real object 118.

In general, the first and second 3D capture devices 114,116 will not be of the same type. The sets of 3D co-ordinates captured by each device 114,116 may then have different properties which make their alignment difficult. The problem of aligning and matching 25 points within sets of 3D co-ordinates captured from different 3D capture devices (herein also referred to cross-source point clouds) is known as cross-source point cloud registration.

Due to variations in different types of 3D capture devices, for example the density of 3D co-ordinates captured, the presence of noise, various outliers, missing data and scale variation, the structure consistency between cross-source point clouds can be very weak making the problem of cross-source point cloud registration challenging.

Overview of a cross-source point cloud registration method

Figure 3 shows a flow diagram of a method for cross-source point cloud registration.

The method may comprise processing operations performed under program control.

-9An aim is to find a geometric transformation, T, that aligns the first set of 3D coordinates captured by a first scene capture device with a second set of 3D co-ordinates captured by a second scene capture device. In the embodiment described in relation to Figure 2, two point clouds, each captured by a different device, are used, but in general 5 any number of two or more devices can be used to capture respective point clouds representing a scene. In embodiments where more than two point clouds are captured, a single point cloud is chosen to which the additional point clouds can be aligned individually.

It will be appreciated from the following that certain operations may be omitted and/or re-ordered. Certain operations may be performed in parallel.

Initially, in an operation 126, the first 3D scene capture device captures a first set of 3D co-ordinates representative of points of a real object being captured. In a further operation 128, a second 3D scene capture device also captures a second set of 3D coordinates representative of points of the real object being captured. In general, each set of 3D co-ordinates will be captured by their respective device from a different perspective, e.g. from a different spatial location and/or angle. In this embodiment, each set of 3D co-ordinates represents a point cloud captured by a different type of 3D scene capture device. However, the sets of 3D co-ordinates can, in some embodiments, be captured by the same device or type of device, but under different conditions.

The first and second set of 3D co-ordinates provide a first point cloud C_r, containing N_r3D co-ordinates, and a second point cloud C₂, containing N₂ 3D co-ordinates respectively.

Representative points of salient structures within each point cloud are then extracted from the two cross-source point clouds. The salient structure extraction allows the density variation in each of the captured sets of 3D co-ordinates to be accounted for.

In a further operation 130, relating to preferred embodiments, the salient structure extraction begins with identifying a first plurality of structures (herein also referred to as salient structures) in the first set of 3D co-ordinates. To identify the salient structures, a segmentation method is used to segment the first set of 3D co-ordinates into a plurality of voxels based on their geometrical topology, as described below in relation to Figure 4.

- 10 In a further operation 132, a first set of structure 3D co-ordinates is then derived from the first plurality of structures. The first set of structure 3D co-ordinates is derived by assigning each identified structure one or more 3D co-ordinates based on its shape and 5 position. For example, the central point of each voxel can be used to represent the structure. In other examples, the corner points of each voxel can be used to represent the structure.

In a further operation 134, a second plurality of structures (herein also referred to as salient structures) is identified from the second set of 3D co-ordinates 134. To identify the salient structures, a segmentation method is used to segment the first set of 3D coordinates into a plurality of voxels based on their geometrical topology as described below in relation to Figure 4. In some embodiments, the same segmentation method used in relation to the first set of 3D co-ordinates is used.

In a further operation 136, a second set of structure 3D co-ordinates is then derived from the first plurality of structures. The second set of structure 3D co-ordinates is derived by assigning each identified structure a 3D co-ordinate based on its central point. For example, the central point of each voxel can be used to represent the structure. Identification of the second plurality of structures can be performed in parallel with or in series with the identification of the first plurality of structures.

The first and second sets of structure 3D co-ordinates provide a first structure point cloud Ci, containing 3D co-ordinates, and a second structure point cloud C₂, containing M₂ 3D co-ordinates respectively. The first and second structure point clouds are representative of the macro-structure of the captured scene or image.

The sets of structure 3D co-ordinates are used to iteratively find a correspondence (also herein referred to as an assignment), X, between points in each of the sets of 3D structure co-ordinates, and the geometric transformation, T, that aligns one set of structure 3D co-ordinates with another. Tn the example described here, iterations of the method continue until an objective function converges to an optimal value. However, in other embodiments, such as RANSAC, the iterations are repeated until a maximum iteration number is reached. In yet further embodiments, such as KNN, the iterations continue until all points in the cloud are covered. In other embodiments, the

- 11 correspondence can be found by the non-iterative methods such as Hungarian algorithms.

In a further operation 138, the first operation of each iteration is to identify any correspondence between points in the first set of structure 3D co-ordinates and points in the second set of structure 3D co-ordinates. The correspondence is determined by iteratively optimising an objective function with respect to the correspondence, X, as described in more detail below in relation to Figure 5. The correspondence, X, is an M₁ x M2 matrix with elements Xy = 1 if point Pi in point cloud C( is matched with 10 point Pj in point cloud C'₂ and Xy = 0 otherwise.

Following the identification of a correspondence between the points in the first set of structure 3D co-ordinates and points in the second set of structure 3D co-ordinates (operation 138), in a further operation 140 the identified points are used to determine a 15 geometric transformation that more closely aligns the points in the second set of structure 3D co-ordinates and points in the first set of structure 3D co-ordinates. In a further operation 142, the determined geometric transformation is then applied to the second set of structure 3D co-ordinates to produce an updated set of structure 3D coordinates. In some embodiments, to account for scale and origin variations between 20 point clouds, the geometric transformation is an affine transformation.

In a further operation 144, it is determined if the updated set of second structure 3D coordinates is determined to be within a threshold of an optimal value of the objective function. If so, then in a further operation 146, the updated set of second 3D structure 25 co-ordinates is output, along with the geometric transformation that produces the updated set of second 3D structure co-ordinates from the original set of 3D structure co-ordinates. The determined geometric transformation can be applied to the original second set of 3D co-ordinates to produce aligned point clouds.

Otherwise, further iterations of the registration process are performed, for example by returning to operation 138, each iteration using the updated set of second 3D structure co-ordinates obtained from the previous iteration.

In some embodiments, a further refinement of the alignment of the first and second sets of 3D co-ordinates is performed after using the determined geometric

- 12 transformation to align them. A refined geometric transformation is determined using the aligned first and second sets of 3D co-ordinates determined by the above described method. The Iterative Closest Point (ICP) method is an example of a method that can be applied to the aligned first and second set of 3D co-ordinates to determine the refined geometric transformation. The refined geometric transformation can be used to more closely align the first and second sets of 3D co-ordinates.

Salient structure extraction

The salient structure extraction will now be described in more detail. Figure 4 shows an 10 example of extracting structural representations for cross-source point clouds. Any two given point-cloud sets of the same scene or object captured by different devices will hold an intrinsic structural similarity. However, the intrinsic structural similarity may be weak and degraded by the noise characteristics, inconsistent point cloud density and/or outliers. Using a structural representation can overcome these issues.

The first set of 3D co-ordinates 120 representing a scene that have been captured by a first device will in general have different properties to the second set of 3D co-ordinates 122 representing the same scene captured by a second device. In the example shown here, the scene is a building. In this example, the first set of 3D co-ordinates 120 has a 20 lower, but more uniform density than the second set of co-ordinates 122, as well as being less noisy.

Structures can be extracted from the sets of 3D co-ordinates using a variety of methods. Examples include graph-based approaches, such as Markov Random Fields and

Conditional Random Fields, and clustering algorithms, such a K-means clustering. In some embodiments, the use of Voxel Cloud Connectivity Segmentation (as described in “ Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds” by Papon et al. [DOI: 10.1109/CVPR.2013.264]) provides a robust method of structure extraction.

The VCCS provides an over-segmentation algorithm which uses voxel relationships to 30 produce over-segmentations which are fully consistent with the spatial geometry of the scene in 3D, rather than projective, space. Enforcing the constraint that segmented regions must have spatial connectivity prevents label flow across semantic object boundaries which might otherwise be violated. A segmented radius is defined as the radius of the minimum sphere containing all of the points of a segmented point cloud.

An interval is in some embodiments defined as 1% of the diameter of the sphere containing the point cloud.

-13Whichever method is used, the two point clouds 120,122 are segmented into a plurality of voxels. In some embodiments, the centres of each of the voxels are used to represent the identified structures. The result is a first set of structural 3D co-ordinates 148 representing the salient features in the first set of 3D co-ordinates and a second set of structural 3D co-ordinates 150 representing the salient features in the second set of 3D co-ordinates. The first and second sets of structure 3D co-ordinates provide a first structure point cloud Cj, containing 3D co-ordinates, and a second structure point cloud C₂, containing M₂ 3D co-ordinates, respectively.

Registration

General method

To conduct the optimization, weak regional affinity and pixel-wise refinement can be assembled into a unified high-order tensor based graph. The weak regionals could be 15 pair-wise or triplet constraints that are stored in second or third order tensors. The pixel-wise refinement is a point-point or point-plane residual error that is stored in a first-order tensor. To obtain a robust geometric transformation, the geometric transformation is integrated into the tensor optimization. An objective function (herein also referred to as an energy function) is used to obtain an optimal geometric transformation.

In general, the registration method needs to optimize a geometric transformation, T, and correspondence, X, by considering both similarity and constraints. The similarity and the constraints can be formulated into tensors so that the optimization for registration becomes a tensor optimization problem. A unified framework can be constructed using an objective function S(X, T):

S(X,T) = λ₂Η₂(Τ)®₂Χ®₂Χ®_ίΧ + λ₂Η₂(Τ)®₂Χ®_ίΧ +2!^!®!% - λ_οφ(Τ) where the tensor H₃ embeds similarity between points in the two point clouds into a triplet constraint, the tensor H₂ embeds similarity between points in the two point clouds into a pair-wise constraint, and the tensor H₃ embeds similarity between points in the two point clouds into a pixel-pixel distance. Further constraints on T unrelated to the point cloud correspondence can be introduced using the additional constraint φ(Τ).

-14The variables [2,} provide different weights to the terms that can be adjusted depending on the required application.

The objective function, S(X, T), can be iteratively optimised to find a suitable correspondence and geometric transformation. As used herein, the term optimising is preferably used to connote the approximate local or global maximising or minimising of a function. For example, the objective function can be considered to be optimised when its value is within a threshold of a maximum of the energy function.

Firstly, from an initial geometric transformation (which is in some embodiments no transformation of the initially captured point clouds), an optimal correspondence, X, is obtained from the objective function. The optimal correspondence can be obtained using, for example, the tensor power iteration method. From this correspondence, X, an optimal geometric transformation is then obtained. The tensors Hj(i = 1; 2; 3) are then updated using the optimal geometric transformation. That is, the tensor space is updated.

Following the tensor update, a new optimal correspondence, X, in the new tensor space is obtained. This new correspondence is obtained using the same method as determining the initial correspondence. In some embodiments, the power iteration method is used.

In summary, the correspondence is determined from a current tensor space and is used to optimize a geometrical transformation, T. The geometrical transformation is then used find a more optimal tensor space. These operations are carried out iteratively and the objective function will be optimised until convergence to within a threshold.

A number of registration methods can be derived from the general objective function described in Table 1. For example, if 2₁₃ = 0, the Iterative Closest Point (ICP) method is obtained. Setting 2₃ = 0 gives ICP that additionally considers pair-wise constraint.

Specific embodiment

Figure 5 shows a flow diagram of an embodiment of a method of registration of the point clouds in further detail. This method illustrates a possible implementation of operations 138 and 140 shown in, and described with reference to, Figure 3.

-ι₅It will be appreciated from the following that certain operations maybe omitted and/or re-ordered. Certain operations may be performed in parallel.

In the following described embodiment 2_{2 0} = 0, obtaining ICP that also accounts for triplet constraints. This is referred to herein as the Geometric Constraint Tensor-based Registration (GCTR) method. The GCTR considers local scale-invariant geometrical properties by using a third-order tensor and global properties by using first-order geometrical constraint. It achieves both efficiency and high accuracy in the problem of io registration.

The input to the method is two point clouds, Cj, containing M_r 3D co-ordinates, and C'₂, containing M₂ 3D co-ordinates. In the embodiment described here, these point clouds are the first and second structure 3D co-ordinates determined by the above 15 described salient structure extraction method. However, in general, the method is applicable to any two point clouds. In some embodiments, the method is applied directly to the sets 3D co-ordinates obtained from the 3D scene capture devices, with no salient structure extraction being performed.

At an operation 152, triplets of points [Pt, Pj, P_k] forming a triangle are selected from the first structural 3D co-ordinates. The points are selected randomly, but can alternatively be selected using a defined strategy. Triplets of points (P_t’ ,Ρ^ ,P_k' } forming a triangle are also selected from the second structural 3D co-ordinates using the same method.

An example of a defined strategy used in some embodiments is as follows. The triangles are constructed from an ordered list of points in the point cloud. The first 3 points are formed into a triangle, then the second 3 points into a triangle and so on. The algorithm is then used to find the potential matched triangles. In some embodiments using this method, the number of triangles N is the same to the number of point cloud with fewer points.

In some embodiments, the number of points selected depends on the number of salient structures determined to be present in each of the point clouds. If the first set of 3D structure co-ordinates contains N_r salient structures and the second set of 3D structure

-16co-ordinates contains N₂ salient structures, then in some embodiments between ^N_rN₂and 5N₁N₂ triplets are selected. In some embodiments, N_rN₂ triplets are selected.

The triplets are, in some embodiments, selected using a wide baseline strategy. This can lead to an improvement in global alignment of the point clouds. The triangles are randomly selected, but only triangles with one or more edges larger than 50% of the diameter of the point cloud are kept. The wide baseline strategy can be summarised in the as follows.

The input is a 3D point cloud of points, P. The output will be N triangles (or triplets) formed from points in P. At a first step, 3 points in P are randomly selected. The points are then converted into a triangle by ordering them to form edges (e.g. an order of 1-2-3 defines edges 1-2, 2-3 and 3-1).

The edge lengths of the three edges of the triangle are then computed. If the edge lengths are greater than 50% of the diameter of the 3D containing voxels, then the triangle is kept and will form one of the triangles used in the registration method.

The method then returns to the first step and selects a further random 3 points from P.

These points may include points already assigned to triplets by the method. Once a number of triangles N have been selected, the triplet selection ends.

If a selected triplet does not have edges that are greater than 50%% of the diameter of the 3D containing voxels, then the selected triplet will not be used in the registration method. The triplet selection then returns to the first step.

In a further operation 154, from the two sets of triplets selected in this way, a triplet constraint tensor, H®? and a pixel-pixel distance tensor, are constructed. In this embodiment, these are defined as:

H^..,_kk,(T) = exp(-a\\f_ljk = exp(-\\Pi -T(P_e)\\₂) where fj_jk is a feature vector consisting of the cosines of the three inner angles of the triangle for the triplet 3D points. For the computation of cosine of inner angles, dot vector multiplication can be used. TCfi'/fc') is a feature vector for a triangle selected

-17from the second structural 3D points that has had the geometric transformation, T, applied to it. a is a constant. Pi is a point in the first set of structural 3D co-ordinates and P_tr is a point in the second set of structural 3D co-ordinates.

The skilled person will recognise that other definitions of the triplet constraint tensor and pixel-pixel distance tensor are possible. For example, the triplet constraint tensor can be the three coordinate differences of the correspondent nodes. In some embodiments, the pixel-pixel distance is the 3D feature distance of these two pixels.

Following the construction of the triplet constraint tensor and pixel-pixel distance tensor, a correspondence between the first structure 3D co-ordinates and the second structure 3D co-ordinates is iteratively determined 26. Operations 156 to 162 of the method are iterated until convergence of the objective function S(X, T) is achieved.

To determine the correspondence, the tensors constructed at operation 154 are used to define the objective (or energy) function, S(X, T):

nx,T)= ς ii’jj’kk’ 11' i,j,k i,j,k r = H₃(T)®₃X®₂X®1X + H^X, where X_ur, X^r, X_kk, are defined to be the affinity matrices of the three nodes pairs between two triplets (Pt, Pj, P_k} selected from the first set of structure 3D co-ordinates and {P_ti, Pj/, P_k/} selected from the second set of 3D structure co-ordinates. X = vec(X) is an M₁M₂dimensional vector form of X by concatenating the columns of Xy.

H₃ = vec[H⁽³⁾) is a three dimensional tensor with (MjM₂)³ components, where each element represents the similarity of two triplets. = vec^H^ is an

MjM2dimensional vector representing the pixel-wise similarity. is a node-node similarity of the two point clouds.

The first term in equation (8) is referred to as the salient structure similarity score. For the m^tftiteration of operations 54 to 60 of the method, this is denoted by β^. The

-18second term in equation 8 is referred to as the pixel-wise refinement score. For the m^tftiteration of steps 156 to 162 of the method, this is denoted by y^(mJ.

A structural similarity score will in general be a measure of how closely aligned the structural content of the first point cloud and second point cloud are. In some embodiments, the form of the structural similarity score is chosen such that it is maximised when structures in the first point cloud are aligned with structures in the second point cloud. Alternatively, the form of the structural similarity score is chosen such that it is minimised when structures in the first point cloud are aligned with structures in the second point cloud. In the embodiment described above, the structures chosen are triangles, though the skilled person will recognise that other examples are possible.

In some alterative embodiments, the structural similarity score may be a structural similarity index (SSIM).

A pixelwise similarity score will in general be a measure of the overall distance between points in the first point cloud and points in the second point cloud. In some embodiments, the form pixelwise similarity score is chosen such that it is maximised when points in the first point cloud are aligned with points in the second point cloud. Alternatively, the form of the pixelwise similarity score is chosen such that it is minimised when points in the first point cloud are aligned with points in the second point cloud.

During the m^th iteration of operations 156 to 162 of the method, the salient structure similarity score and the pixel-wise refinement score are determined/updated 156. The values of the current affinity matrix, are used along with the tensors constructed at operation 154 to calculate these terms. From these, the objective function S(X, T) is calculated 158.

An updated correspondence is then determined 160. During the m^th iteration, the (m + l)^tft correspondence, is determined. An — norm normalisation is also applied to make the assignment matrix double-stochastic. The updated correspondence can be determined using:

_x/(m+i) ₌ + H_lt

-19y(m+l) _____y/(m+l) «2

Once the new correspondence has been determined, the convergence of the energy function is checked 162. If the difference between the updated energy function and the energy function from the previous operation is not within a threshold value, the operation 156 is returned to.

If the difference between the updated energy function and the energy function from the previous operation is within a threshold value, the updated correspondence is used to determine an updated geometric transformation 130. The geometric transformation comprises at least one of: a rotation (R); a translation (t); and/or a scaling (s).

Given a first set of structure 3D co-ordinates and a second set of structure 3D coordinates, with a correspondence X between them, ordered sets of points, A and B, can be constructed. Set A comprises points from the first set of structure 3D co-ordinates that correspond to points in the second set of structure 3D co-ordinates. Set B comprises points from the second set of structure 3D co-ordinates that correspond to points in the first set of structure 3D co-ordinates. The sets are constructed such that point A[ in set A corresponds to point Bt in set B.

The scaling, s, can be calculated from A and B using:

Z*=i(r_al/r_bl) ^{S =} ---------Λ-----’ n - 1 where r_ai is the distance between point A[ and point A_i+1 and r_bi is the distance between point Bt and point B_i+1. n is the total number of corresponding pairs.

The rotation, R, can be determined from A and B using:

UDV = svd(AB^T),

R = UDV^T.

From the scaling and rotation, the translation, t, can be determined using:

t = U_A — s * R * U_B, where U_A is the mean point of A and U_B is the mean point of B.

- 20 At operation 142, the determined transformation is then applied to the second set of 3D structure co-ordinates to produce an updated set of second structure 3D co-ordinates.

Once the optimal transformation and correspondence have been determined, they are output for use at operation 146, for example for use in 3D model building. The point cloud registration method can be used, in general, to localize camera positions with respect to captured 3D objects, to recover depth maps of camera images.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “memory” or “computer-readable medium” maybe any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

Reference to, where relevant, “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc., or a “processor” or “processing circuitry” etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.

As used in this application, the term ‘circuitry’ refers to all of the following: (a) hardware-only circuit implementations (such as implementations in only analogue and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memoiy(ies) that work together to cause an apparatus, such as a mobile phone or

- 21 server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.

This definition of ‘circuitry¹ applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that flow diagrams of Figures 3 and 5 are examples only and that various operations depicted therein may be omitted, reordered and or combined.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Claims

1. A method comprising:

identifying a first plurality of structures from a first set of 3D coordinates

5 representative of a scene produced by a first 3D scene capture device;

deriving one or more first structure 3D coordinates for the first plurality of structures;

identifying a second plurality of structures from a second set of 3D coordinates produced by a second 3D scene capture device in representative of the same scene;

10 deriving one or more second structure 3D coordinates for the second plurality of structures;

registering one or more of the first structure 3D coordinates with one or more of the second structure 3D coordinates, wherein the registering comprises:

determining a correspondence between one or more of the first structure 15 3D coordinates and one or more of the second structure 3D coordinates;

determining a geometric transformation between the first structure 3D coordinates and the second structure 3D coordinates based on the correspondence; and applying the geometric transformation to the second structure 3D

20 coordinates to produce transformed second structure 3D coordinates such as to more closely match the first structure 3D coordinates to the second structure 3D coordinates.

2. The method of claim 1, wherein registering the first structure 3D coordinates

25 with the second structure 3D coordinates further comprises iterating the registering operations until a predetermined condition is met.

3. The method of claim 2, wherein the iterations are performed until an energy function converges to an optimal value, wherein the energy function comprises a

30 structural similarity score and a pixel-wise refinement score.

4. The method of claim 3, wherein the energy function is optimised using a tensor optimisation.

35 5· The method of any of claims 3 or 4, wherein the structural similarity score is determined by:

-23selecting a first plurality of triplets from the first structure 3D coordinates; selecting a second plurality of triplets from the second structure 3D coordinates; determining one or more similarity values between one or more of the first plurality of triplets and one or more of the second plurality of triplets; and

5 calculating the structural similarity score, based on the one or more similarity values and the correspondence between one or more of the first structure 3D coordinates and one or more of the second structure 3D coordinates.

6. The method of claim 5, wherein at least one of the first plurality of triplets

10 and/or the second plurality of triplets is selected using a wide baseline strategy.

7. The method of any of claims 3 to 6, wherein the pixel-wise refinement score is determined by:

calculating a pixel-wise similarity between one or more points in the first set of 15 structural 3D coordinates and one or more points in the second set of structure 3D coordinates; and calculating, based on the pixel-wise similarity and the correspondence between one or more points in the first set of structural 3D coordinates and one or more the second set of structural 3D coordinates, the pixel-wise refinement score.

8. The method of any preceding claim, wherein identifying the first plurality of structures comprises segmenting the first set of 3D coordinates into a plurality of voxels.

25

9. The method of claim 8, wherein deriving one or more first structure 3D coordinates comprises assigning a 3D co-ordinate to a centre of each of the plurality of voxels.

10. The method of any preceding claim, wherein identifying second plurality of

30 structures comprises segmenting the second set of 3D coordinates into a plurality of voxels.

11. The method of claim 10, wherein deriving one or more second structure 3D coordinates comprises assigning a 3D co-ordinate to a centre of each of the plurality of

35 voxels.

12. The method of any preceding claim, wherein the geometric transformation comprises an affine transformation.

13. The method of any preceding claim, wherein determining the geometric

5 transformation comprises determining a rotation based on matched pairs of points from the first structure 3D coordinates and the second 3D structure coordinates.

14. The method of any preceding claim, wherein determining the geometric transformation comprises performing a scale estimation.

15. The method of any preceding claim, wherein determining the geometric transformation comprises determining a translation based on a mean point of the matched points in the first structure 3D coordinates and a mean point of the corresponding matched points in the second structure 3D coordinates.

16. The method of any preceding claim, wherein at least one of the first and/or second 3D scene capture device captures the first set of 3D coordinates and/or the second set of 3D coordinates using at least one of: LiDAR; Structure-fromMotion(SFM); Simultaneous Localisation and Mapping(SLAM); and/or KinectFusion.

17. The method of any preceding claim, wherein the first 3D scene capture device is of a different type to the second 3D scene capture device.

18. The method of any preceding claim, wherein the method further comprises

25 constructing a 3D model of a scene based on the first structure 3D coordinates and the transformed second structure 3D coordinates.

19. The method of any preceding claim, further comprising:

applying the geometric transformation to the second set of 3D coordinates to

30 align the second set of 3D coordinates with the first set of 3D coordinates, thereby to produce an aligned set of 3D coordinates.

20. The method of claim 19, further comprising determining a refined geometric transformation using the aligned set of 3D coordinates.

21. A system comprising: a first scene capture device for capturing a first set of 3D co-ordinates representative of a scene;

a second scene capture device for capturing a second set of 3D co-ordinates representative of the scene; and

5 a data processing system, wherein the system is configured to perform the method of any preceding claim.

22. Apparatus comprising:

at least one processor; and

10 at least one memory including computer program code which, when executed by the at least one processor, cause the apparatus to perform the method of any of claims 1 to 20.

23. Apparatus configured to perform the method of any of claims 1 to 20.

24. A computer readable medium having computer readable code stored thereon, the computer readable code, when executed by at least one processor, causing performance of any of claims 1 to 20.

20

25. Computer readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform the method of any of claims 1 to 20.