CN113474818A

CN113474818A - Apparatus and method for performing data-driven pairwise registration of three-dimensional point clouds

Info

Publication number: CN113474818A
Application number: CN202080013849.6A
Authority: CN
Inventors: 邓皓文; 托尔加·比尔达尔; 斯洛博丹·伊利奇
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2019-02-11
Filing date: 2020-01-29
Publication date: 2021-10-01

Abstract

Method and apparatus (1) for performing data-driven pairwise registration of a three-dimensional 3D point cloud PC, the apparatus comprising: at least one scanner (2) adapted to capture a first local point cloud PC1 in a first scan and a second local point cloud PC2 in a second scan; a PPF derivation unit (3) adapted to derive a captured local point cloud(PC1, PC2) to derive associated point pair characteristics (PPF1, PPF 2); a PPF self-encoder (4) adapted to process the derived point-to-point features (PPF1, PPF2) to extract corresponding PPF feature vectors (V)_PPF1，V_PPF2) (ii) a A PC auto-encoder (5) adapted to process the captured local point clouds (PC1, PC2) to extract corresponding PC feature vectors (V)_PC1，V_PC2) (ii) a A subtractor (6) adapted to derive the PC vector (V)_PC1，V_PC2) Minus the corresponding PPF feature vector (V)_PPF1，V_PPF2) To calculate a potential difference vector (LDV1, LDV2) for both captured point clouds (PC1, PC2), which potential difference vector (LDV1, LDV2) is concatenated as a potential difference vector (CLDV); and a pose prediction network (8) adapted to calculate a relative pose prediction T between a first scan and a second scan performed by the scanner (2) based on the concatenated potential difference vectors (CLDV).

Description

Apparatus and method for performing data-driven pairwise registration of three-dimensional point clouds

The present invention relates to a method and apparatus for performing data-driven pairwise registration of a three-dimensional point cloud generated by a scanner.

Matching local keypoint descriptors is a step of automatic registration for three-dimensional overlay scans. Point set registration, also known as point matching, is the process of finding a spatial transform that does align two sets of points. The point set or point cloud may include raw data from a three-dimensional scanner. In contrast to two-dimensional descriptors, learned three-dimensional descriptors lack any kind of local orientation assignment (orientation assignment) and, therefore, any subsequent pose estimator is forced to satisfy nearest neighbor queries and exhaustive RANSAC iterations to robustly compute the alignment transformation. This neither results in a reliable processing nor is it computationally efficient. Matching local keypoint descriptors forms a step of automatic registration for three-dimensional overlay scans. The match may contain outliers (outlers) that severely hamper scan registration, i.e., the alignment of the scan results in the computation of a six degree of freedom transformation between the outliers. The conventional processing method is to perform the RANSAC process. The RANSAC process does sample three corresponding matched pairs multiple times, and may transform one scan to another using a computed rigid transform that takes into account all key points counting the number of inliers. Such a sampling process is computationally inefficient.

The Andy Zeng et al article "3D match: Learning the Matching of Local 3D Geometry in Range scopes" discloses a 3D descriptor focused on partially noisy 3D data obtained from a Range of commercial sensors for Matching Local Geometry.

It is therefore an object of the present invention to provide the following methods and devices: the method and apparatus provide more efficient registration of three-dimensional point clouds.

This object is achieved according to a first aspect of the invention by a device comprising the features of claim 1.

According to a first aspect of the invention, the invention provides an apparatus for performing data-driven pairwise registration of a three-dimensional point cloud, the apparatus comprising: at least one scanner adapted to capture a first local point cloud in a first scan and a second local point cloud in a second scan;

a PPF derivation unit adapted to process both the captured local point clouds to derive associated point pair features;

a PPF self-encoder adapted to process the derived point-to-feature to extract a corresponding PPF feature vector;

a PC self-encoder adapted to process the captured local point cloud to extract a corresponding PC feature vector;

a subtractor adapted to subtract the corresponding PPF feature vector from the PC vector to calculate a potential difference vector for both of the captured point clouds, the potential difference vectors concatenated as a potential difference vector; and

a pose prediction network adapted to calculate a relative pose prediction between a first scan and a second scan performed by the scanner based on the concatenated potential difference vectors.

In a possible implementation of the apparatus according to the first aspect of the invention, the apparatus further comprises a pose selection unit adapted to process the pool of calculated relative pose predictions for selecting a suitable pose prediction.

In a possible implementation of the apparatus according to the first aspect of the present invention, the pose prediction network comprises a multi-layer perceptron MLP rotation network for decoding the concatenated potential difference vectors.

In a possible implementation of the apparatus according to the first aspect of the present invention, the PPF self-encoder comprises:

an encoder adapted to encode the point-to-point features derived by the PPF derivation unit to calculate a potential PPF feature vector, the potential PPF feature vector being supplied to the subtractor; and the PPF self-encoder includes:

a decoder adapted to reconstruct the point pair features from the potential PPF feature vectors.

In a possible implementation of the apparatus according to the first aspect of the present invention, the PC self-encoder comprises:

an encoder adapted to encode the captured local point cloud to calculate a potential PC feature vector, which is supplied to the subtractor; and the PC self-encoder includes:

a decoder adapted to reconstruct a local point cloud from the latent PC feature vector.

According to a further second aspect, the invention also provides a data-driven computer-implemented method for pairwise registration of three-dimensional 3D point clouds, comprising the features of claim 6.

According to a second aspect, the invention provides a data-driven computer-implemented method for pairwise registration of three-dimensional point clouds, the method comprising the steps of:

capturing, by at least one scanner, a first local point cloud in a first scan and a second local point cloud in a second scan;

processing both the captured local point clouds to derive associated point pair features;

supplying point pair features of both the captured local point clouds to a PPF self-encoder to provide a PPF feature vector and the captured local point clouds to a PC self-encoder to provide a PC feature vector;

subtracting the corresponding PPF feature vector provided by the PPF self-encoder from the PC vector provided by the PC self-encoder to calculate respective potential difference vectors for the captured point clouds; and

the calculated potential difference vectors are concatenated to provide concatenated potential difference vectors that are applied to a pose prediction network to calculate a relative pose prediction between the first scan and the second scan.

In a possible implementation of the method according to the second aspect of the invention, a pool of relative pose predictions is generated for a plurality of point cloud pairs, each comprising a first local point cloud and a second local point cloud.

In a further possible implementation of the method according to the second aspect of the invention, the pool of generated relative pose predictions is processed to perform pose verification.

In a further possible implementation of the method according to the second aspect of the present invention, the PPF auto-encoder and the PC auto-encoder are trained on the basis of a calculated loss function.

In a possible implementation of the method according to the second aspect of the invention, the loss functions comprise a reconstruction loss function, a pose prediction loss function and a feature consistency loss function.

In a further possible implementation of the method according to the second aspect of the present invention, the PPF feature vector provided by the PPF self-encoder comprises rotation invariant features and wherein the PC feature vector provided by the PC self-encoder comprises non-rotation invariant features.

In the following, possible embodiments of the different aspects of the invention are described in more detail with reference to the drawings.

Fig. 1 shows a block diagram for illustrating a possible exemplary embodiment of an apparatus for performing a data-driven pairwise registration of a three-dimensional point cloud according to a first aspect of the present invention;

FIG. 2 shows a flow diagram illustrating a possible exemplary embodiment of a data-driven computer-implemented method for pairwise registration of three-dimensional point clouds according to further aspects of the invention;

fig. 3 shows a schematic diagram for illustrating a possible exemplary implementation of an apparatus according to the first aspect of the invention;

fig. 4 shows a further schematic diagram for illustrating a further exemplary implementation of the apparatus according to the first aspect of the present invention.

As can be seen in the block diagram of fig. 1, the apparatus 1 for performing data-driven pairwise registration of three-dimensional 3D point clouds PC comprises in the exemplary embodiment shown at least one scanner 2, which at least one scanner 2 is adapted to capture a first local point cloud PC1 in a first scan and a second local point cloud PC2 in a second scan. In the illustrated exemplary embodiment of fig. 1, the apparatus 1 comprises one scanner 2 providing both point clouds PC1, PC 2. In an alternative embodiment, two separate scanners may be used, with the first scanner generating a first local point cloud PC1 and the second scanner generating a second local point cloud PC 2.

The apparatus 1 shown in fig. 1 comprises a PPF derivation unit 3, the PPF derivation unit 3 being adapted to process both the captured local point clouds PC1, PC2 to derive the associated point pair characteristics PPF1, PPF 2.

The apparatus 1 further comprises a PPF self-encoder 4, the PPF self-encoder 4 being adapted to process the derived point-to-point features PPF1, PPF2 output by the PPF derivation unit 3 to extract corresponding PPF feature vectors V_PPF1、V_PPF2As shown in the block diagram of fig. 1.

The apparatus 1 further comprises a PC self-encoder 5, the PC self-encoder 5 being adapted to process the captured local point clouds PC1, PC2 generated by the scanner 2 to extract corresponding PC feature vectors V_PC1、V_PC2。

The apparatus 1 further comprises a subtractor 6, the subtractor 6 being adapted to derive the PC vector V from_PC1、V_PC2Minus the corresponding PPF eigenvector V_PPF1、V_PPF2To compute potential difference vectors LDV1, LDV2 for both the captured point clouds PC1, PC 2.

The apparatus 1 comprises a concatenation unit 7 for concatenating the received potential difference vectors LDV1, LDV2 into a single concatenated potential difference vector CLDV, as illustrated in the block diagram of fig. 1.

The apparatus 1 further comprises a pose prediction network 8, the pose prediction network 8 being adapted to calculate a relative pose prediction T between a first scan and a second scan performed by the scanner 2 based on the received concatenated potential difference vectors CLDV. In a possible embodiment, the apparatus 1 further comprises a pose selection unit adapted to process a pool of calculated relative pose predictions T for selecting a suitable pose prediction T. The pose prediction network 8 of the apparatus 1 may in a possible embodiment comprise a multi-layer perceptron MLP rotation network for decoding the received concatenated potential difference vectors CLDV.

The apparatus 1 shown in fig. 1 comprises two self-encoders, namely a PPF self-encoder 4 and a PC self-encoder 5. The self-encoder 4, 5 may comprise a neural network adapted to copy the input to the output. The self-encoder works by compressing the received input into a potential spatial representation, and then reconstructing an output from the potential spatial representation. The self-encoders each include an encoder and a decoder.

The PPF self-encoder 4 comprises, in a possible embodiment, an encoder adapted to encode the point-to-point feature PPF derived by the PPF derivation unit 3 to calculate a potential PPF feature vector V_PPF1、V_PPF2Potential PPF feature vector V_PPF1、V_PPF2Is supplied to the subtractor 6 of the apparatus 1. The PPF self-encoder 4 further comprises a decoder adapted to reconstruct the point pair features from the potential PPF feature vectors.

In addition, the PC self-encoder 5 of the device 1 is, in a possible implementation, a self-encoderThe formula includes: an encoder adapted to encode the captured local point clouds to compute a latent PC feature vector V_PC1、V_PC2Latent PC feature vector V_PC1、V_PC2Is supplied to the subtractor 6; and a decoder adapted to reconstruct the local point cloud PC from the latent PC feature vectors.

FIG. 2 illustrates a possible exemplary embodiment of a data-driven computer-implemented method for pairwise registration of three-dimensional 3D point clouds according to further aspects of the invention. In the illustrated exemplary embodiment, the data driven computer implemented method includes five main steps S1 to S5.

In a first step S1, a first local point cloud PC1 is captured in a first scan and a second local point cloud PC2 is captured in a second scan by at least one scanner, for example by a single scanner 2 as shown in the block diagram of fig. 1, or by two separate scanners.

In a further step S2, both the captured local point clouds PC1, PC2 are processed to derive associated point pair features PPF1, PPF 2.

In a further step S3, the point-pair features PPF1, PPF2 derived in step S2 for both the captured local point clouds PC1, PC2 are supplied to the PPF self-encoder 4 to provide a PPF feature vector V_PPF1、V_PPF2And supplies the captured local point clouds PC1, PC2 to the PC self-encoder 5 to provide a PC feature vector V_PC1、V_PC2。

In a further step S4, vector V is derived from PC supplied by PC from encoder 5_PC1、V_PC2Subtracts the corresponding PPF feature vector V provided by the PPF self-encoder 4_PPF1、V_PPF2To calculate potential difference vectors LDV1, LDV2 for the captured point clouds PC1, PC2, respectively.

In a further step S5, both of the calculated potential difference vectors LDV1, LDV2 are automatically concatenated to provide a concatenated potential difference vector CLDV, which is applied to a pose prediction network to calculate a relative pose prediction T between the first scan and the second scan.

In a possible implementation, a pool of relative pose predictions T may be generated for a plurality of point cloud pairs each including a first local point cloud PC1 and a second local point cloud PC 2. In a possible implementation, a pool of generated relative pose predictions T may be processed to perform pose validation.

In a possible embodiment, the PPF auto-encoder 4 and the PC auto-encoder 5 may be trained based on the calculated loss function L. The loss function L may comprise, in a possible exemplary embodiment, reconstructing the loss function L_recPose prediction loss function L_poseAnd a characteristic consistency loss function L_feat。

In a possible embodiment, the PPF feature vector V provided by the PPF self-encoder 4_PPF1、V_PPF2Comprising rotation-invariant features, and a PC feature vector V provided by the PC self-encoder 5_PC1、V_PC2Including non-rotationally invariant features.

With a data-driven computer-implemented method for pairwise registration of three-dimensional point clouds PC, the relative transformation between robust local feature descriptors together with matching local keypoint blocks can be learned in a three-dimensional scan. The computational complexity of the estimation of the relative transformation between the matched keypoints, r registration. Furthermore, the computer-implemented method according to the invention is faster and more accurate compared to the conventional RANSAC process, and also results in learning more robust key points or feature descriptors compared to the conventional method.

The method according to the invention does separate the poses from the intermediate feature pairs. The method and apparatus 1 according to the present invention employs a dual architecture comprising a PPF self-encoder 4 and a PC self-encoder 5, wherein the self-encoders each comprise an encoder and a decoder, as also shown in the block diagram of fig. 3.

In the illustrated implementation of fig. 3, the self-encoder comprises an encoder ENC and a decoder DEC. The auto-encoder AE receives the point cloud PC or point-to-feature PPF and may compress the input into a potential feature representation. As also shown in the block diagram of fig. 4, the apparatus may include two separate self-encoders AE for each PC cloud having a separate input source. The PPF folding networks (FoldNet)4A, 4B and the PC folding networks 5A, 5B may be trained separately, and the

PPF folding networks

4A, 4B and the PC folding networks 5A, 5B may be capable of extracting rotation invariant features and rotation variant features, respectively. The features extracted by each PPF folded

network

4A, 4B are rotation invariant and therefore the features are the same across the same local block in different poses, whereas the features extracted by the PC folded networks 5A, 5B change in different poses, i.e. the features are non-rotation invariant. Thus, the method and apparatus 1 uses the features extracted by the PPF folded network as canonical features, i.e., the features of the canonical pose block. By subtracting the PPF folded network features from the PC folded network features, the remainder contains mainly geometry-free pose information. This geometry-free pose information may be supplied to the pose prediction network 8 to decode pose information according to the obtained feature differences.

With respect to data preparation, finding a canonical pose for a given local block is not easy. Local reference frames may be helpful but are generally unreliable because local reference frames are largely affected by noise. Defining the absolute pose of a local patch is challenging due to the lack of canonical poses. It is relevant that a partial block from one partial scan can be aligned with its corresponding partial block from another partial scan under the same relative transformation. Such basic reality information has been provided in many available data sets for training. Instead of trying to find the true pose of a local block as a training supervision, the method according to the invention does combine the pose characteristics of two corresponding blocks and use a pose prediction network 8 to recover the relative pose between the pose characteristics of two corresponding blocks, as is also shown in the block diagram of fig. 4.

In view of the fact that segment pairs or partial scan pairs can be used to train the network to predict the relative pose T, it may be beneficial to utilize this pair relationship as an additional signal to the PPF folding network to extract better local features. The training of the network can be done in a completely unsupervised manner. Existing pairs of relationships can be used to ensure that features extracted from the same block are as close as possible, regardless of noise, missing parts, or clutter. During training, additional L2 penalties may be added to the PPF folded network intermediate features generated for the block pairs. In this way, the quality of the learned features can be further improved.

For a given partial scan pair, a set of local correspondences may be established using features extracted from the PPF folded network. Each corresponding pair may generate an assumption for the relative pose between them, which also forms a vote for the relative pose between the two partial scans. Thus, it is possible to get a pool of hypotheses or relative pose predictions generated by all found correspondences. Since not all generated hypotheses are correct, in a possible embodiment the hypotheses may be inserted into a RANSAC-like pipeline, i.e. each hypothesis is exhaustively verified and scored, wherein the best scored hypothesis is retained as the final prediction.

In a further possible implementation, the hypotheses may be transformed into hough space to find peaks in the space where most hypotheses are clustered together. In general, this relies on the assumption that correctly predicted subsets are grouped together, which is valid in most cases.

With the method according to the invention, better local features for establishing local correspondences can be generated. This approach is able to predict the relative pose T given only two pairs of blocks, unlike the RANSAC process which requires at least three pairs to generate a minimum hypothesis.

Due to the combination of the advanced network structure and the weakly supervised training scheme, better local features can be extracted. A pipeline to recover relative pose information given a pair of local blocks or point clouds may be incorporated into the robust 3D reconstruction pipeline.

Pure geometric local blocks typically carry two pieces of information, namely structure and motion:

(1) 3D structure summarized by the dots themselves

Where ρ ═ x, y, z]^T。

(2) Motion, which corresponds in context to a 3D transformation or pose T oriented globally and locating in space the set of points P_i∈SE(3)：

Wherein R.epsilon.SO (3) and

set of points P representing a local block_iIs generally considered to be a canonical version thereof

The transformed copy of (a). Typically, such a canonical absolute pose T is looked up from a single local block_iInvolving computing local reference frames known to be unreliable [36 ]]. The invention is based on the following premises: a good local (block-level) pose estimation results in a good global rigid alignment of the two segments. First, by decoupling the pose components from the structural information, a data-driven predictor network can be designed that can regress the pose for any block and display good generalization properties.

A naive way to achieve tolerance to 3D structures is to train the network conditional on a database of input blocks for pose prediction, and leave invariance to the network. Unfortunately, networks trained in this manner require a very large number of unique local blocks, or simply lack generalization. To alleviate this drawback, the structural components are eliminated by training invariant-invariant network pairs and using intermediate latent spatial algorithms. The invariant function Ψ is characterized by:

where g (.) is a function that depends only on pose. When g (T) ═ I, Ψ is referred to as T-invariant.

For any input, P results in a specification Ψ (P) ← Ψ (P) of the specification^c) The result of (1). When g (T) ≠ I, it can be assumed that the isokinetic behavior of T can be approximated by some additive linear operations:

g(T)Ψ(P^c)≈h(T)+Ψ(p^c) (3)

h (T) is a potentially highly non-linear function of T. By substituting formula (3) into formula (2), it is possible to obtain:

Ψ(P)-Ψ(P^c)≈h(T) (4)

that is, the difference in potential space can approximate the pose to a maximum of non-linearity h. Approximating the inverse of h by means of a four-layer MLP network

And by regressing the motion (rotation) term:

ρ(f)≈R|t (5)

wherein f ═ Ψ (P) - Ψ (P)^c). Note that f only illustrates motion, and can therefore be generalized to any local block structure, yielding a powerful pose predictor under the above assumptions.

Note that ρ (·) can be used directly to return the absolute pose to the canonical frame. However, this is undesirable due to the above-mentioned difficulty in defining unique local reference frames. Since a given situation takes into account a pair of scenes, the relative pose can be estimated safely rather than the absolute pose, replacing the prerequisite of a good estimate of the LRF. This also helps to easily make the labels required for training. Therefore, ρ (-) can be modeled as a relative pose predictor network 8, as shown in fig. 1, 4.

At rigid transformation T_ijThe corresponding local structures of the next two scenes (i, j) in good registration are also associated with T_ijThe alignment is good. Therefore, the relative pose between the local patches can be easily obtained by calculating the relative pose between the segments.

To achieve generalized relative pose prediction, three key components can be implemented: invariant network Ψ (P)^c) Wherein g (t) is I; networkΨ (P) which varies depending on the input; and MLP ρ (·). The recent PPF folding network self-encoder is suitable for psi (P)^c) Modeling because of Ψ (P)^c) Is unsupervised, works for point blocks and achieves true invariance because the point-to-point feature (PPF) completely marginalizes the motion item. Interestingly, by preserving the same network architecture as the PPF folded network, if the PPF part is replaced with the 3D point itself Ψ (Pc), the intermediate features depend on both the structural and pose information. The PC folding network is used as an equal variation network psi (P) ═ g (T) } psi (P)^c). By using the PPF folding network and the PC folding network, the rotation invariant feature and the rotation variant feature can be learned separately. As shown in fig. 3, the PPF folding network and the PC folding network share the same architecture, while performing different encoding of the partial blocks. The difference of the encoder outputs of the two networks, i.e. the potential features of the PPF folding network and the potential features of the PC folding network, respectively, is taken by a subtractor 6, resulting in features that are almost exclusively specific to pose (motion) information. These features are then fed into a generalized pose prediction network 8 to recover the rigid relative transformation. The overall architecture of the complete relative pose prediction is shown in fig. 4.

Multiple cues, both supervised and unsupervised, may be used to train the network to guide the network in finding the optimal parameters. In particular, the loss function L may comprise three parts:

L＝L_rec+λ₁L_pose+λ₂L_feat (6)

L_rec、L_poseand L_featRespectively, reconstruction loss, pose prediction loss and feature consistency loss. L is_recReflecting the reconstruction fidelity of the PC folding network/PPF folding network. In order to enable the encoder of the PPF folded network/PC folded network to generate good features for pose regression and good features for finding robust local correspondences, similar to the steps in the PPF folded network, two auto-encoders AE can be trained in an unsupervised way using the chamfer distance as a metric:

wherein,

where the a operator refers to the (estimated) set of reconstructions. F_ppfRefers to a point pair characteristic of the same computed set of points.

The correspondence of the two local blocks is centralized and normalized before being sent to the PC/PPF folding network. This eliminates the translation section

Pose prediction loss L_poseIs to enable a pose prediction network to predict the relative rotation R between given blocks₁₂epsilon.SO (3). Thus, for L_poseThe preferred choice of (c) describes the difference between the predicted rotation and the substantially true rotation:

L_pose＝||q-q^*||₂ (9)

note that the rotation is parameterized by quaternions. This is mainly due to the reduced number of regression parameters and the lightweight projection operation-vector-normalization.

In a hypothetical correspondence (p)₁，p₂) And predicted rotation q^*Conditional translation t^*Can be calculated by:

t^*＝p₁-R^*p₂ (10)

wherein R is^*Corresponds to q^*Is represented by a matrix of (a).

The pose prediction network 8 requires local block pairs for training. Additionally, the information may be utilized as an additional weak supervisory signal to further facilitate training of the PPF folded network. Such guidance may improve the quality of intermediate latent features previously trained in a completely unsupervised manner. In particular, correspondence subject to noise, missing data, or clutterFeatures can generate high reconstruction losses that cause local features to be different even for the same local block. Such additional information helps to ensure that features extracted from the same block are as close as possible in the space of embedding, which is very beneficial as it does establish local correspondences that enable searching nearest neighbors in the feature space. Loss of feature consistency L_featExpressed as:

representing a corresponding set of local blocks, wherein f_pFor features extracted at p by the PPF folded network, f_p∈f_ppf。

The full 6DoF pose can be parameterized by translation conditioned on the matching point (3DoF) and the 3DoF orientation provided by the pose prediction network. Thus, having a set of correspondences is equivalent to having a set of pre-generated transformation assumptions. Note that this is in contrast to the standard RANSAC method, in which the pose is parameterized by an m-3-correspondence, and establishing N correspondences may cause

This assumption is to be verified. This small number of assumptions that are already linear in the number of correspondences makes it possible to exhaustively evaluate the set of assumption-matching pairs used for pose verification. The estimate can be refined by recalculating the transform using the surviving inliers. The hypothesis with the highest score is then retained as the final decision.

Claims

1. An apparatus (1) for performing data-driven pairwise registration of a three-dimensional 3D point cloud PC, the apparatus comprising:

(a) at least one scanner (2), the at least one scanner (2) being adapted to capture a first local point cloud PC1 in a first scan and a second local point cloud PC2 in a second scan, wherein the first scan comprises a first local structure of a first scene and the second scan comprises a second local structure of a second scene, the first local structure of the first scene corresponding to the second local structure of the second scene and having a relative pose to the second local structure of the second scene;

(b) a PPF derivation unit (3), the PPF derivation unit (3) being adapted to process both the captured local point clouds (PC1, PC2) to derive associated point pair features (PPF1, PPF 2);

(c) a PPF-auto-encoder (4), the PPF-auto-encoder (4) being adapted to process the derived point-pair features (PPF1, PPF2) to extract corresponding PPF feature vectors (V)_PPF1，V_PPF2)；

(d) A PC-autocoder (5), the PC-autocoder (5) being adapted to process the captured local point clouds (PC1, PC2) to extract corresponding PC feature vectors (V)_PC1，V_PC2)；

(e) A subtractor (6), said subtractor (6) being adapted to derive the PC vector (V)_PC1，V_PC2) Minus the corresponding PPF feature vector (V)_PPF1，V_PPF2) To calculate potential difference vectors (LDV1, LDV2) for both captured point clouds (PC1, PC2), said potential difference vectors (LDV1, LDV2) being concatenated as potential difference vectors (CLDV); and

(f) a pose prediction network (8), the pose prediction network (8) being adapted to calculate a relative pose prediction T between the first scan and the second scan performed by the scanner (2) based on a concatenated potential difference vector (CLDV),

wherein the PPF feature vector (V) provided by the PPF self-encoder (4)_PPF1，V_PPF2) Includes a rotation invariant feature, and

wherein the PC feature vector (V) provided by the PC auto-encoder (5)_PC1，V_PC2) Including non-rotationally invariant features.

2. The apparatus as defined in claim 1, wherein the apparatus (1) further comprises a pose selection unit adapted to process a pool of the calculated relative pose predictions T for selecting a suitable pose prediction T.

3. The apparatus as defined in claim 2, wherein the pose prediction network (8) comprises a multi-layer perceptron MLP rotation network for decoding the concatenated potential difference vectors (CLDV).

4. The apparatus of any of the preceding claims 1 to 3, wherein the PPF self-encoder (4) comprises:

an encoder (4A), the encoder (4A) being adapted to encode the point-to-feature PPF derived by the PPF derivation unit to calculate a potential PPF feature vector (V) supplied to the subtractor_PPF1，V_PPF2) (ii) a And the PPF self-encoder (4) comprises:

a decoder (4B), the decoder (4B) being adapted to reconstruct the point-to-point feature PPF from the potential PPF feature vector.

5. The apparatus of any preceding claim 1 to 4, wherein the PC self-encoder (5) comprises:

an encoder (5A), the encoder (5A) being adapted to encode the captured local Point Cloud (PC) to calculate a potential PC feature vector (V) supplied to the subtractor_PC1，V_PC2) (ii) a And the PC self-encoder (5) comprises:

a decoder (5B), the decoder (5B) being adapted to reconstruct the local point cloud PC from the potential PC feature vectors.

6. A pairwise registered, data-driven computer-implemented method for three-dimensional 3D point cloud PC, the method comprising the steps of:

(a) capturing (S1), by at least one scanner, a first local point cloud PC1 in a first scan and a second local point cloud PC2 in a second scan, wherein the first scan includes a first local structure of a first scene and the second scan includes a second local structure of a second scene, the first local structure of the first scene corresponding to the second local structure of the second scene and having a relative pose to the second local structure of the second scene;

(b) processing (S2) both the captured local point clouds (PC1, PC2) to derive associated point pair features (PPF1, PPF 2);

(c) supplying (S3) point-pair features (PPF1, PPF2) of both the captured local point clouds (PC1, PC2) to a PPF self-encoder to provide a PPF feature vector (V)_PPF1，V_PPF2) And supplying the captured local point clouds (PC1, PC2) to a PC self-encoder to provide PC feature vectors (V)_PC1，V_PC2)；

(d) From a PC vector (V) provided by the PC self-encoder_PC1，V_PC2) Subtracting (S4) the corresponding PPF feature vector (V) provided by the PPF self-encoder_PPF1，V_PPF2) To calculate respective potential difference vectors (LDV1, LDV2) for the captured point clouds (PC1, PC 2); and

(e) concatenating (5) the calculated potential difference vectors (LDV1, LDV2) to provide a concatenated potential difference vector (CLDV) which is applied to a pose prediction network to calculate a relative pose prediction T between the first scan and the second scan,

wherein a PPF feature vector (V) provided by the PPF self-encoder_PPF1，V_PPF2) Includes a rotation invariant feature, and

wherein the PC feature vector (V) provided by the PC self-encoder_PC1，V_PC2) Including non-rotationally invariant features.

7. The method of claim 6, wherein a pool of relative pose predictions T is generated for a plurality of pairs of point clouds PC, each comprising a first local point cloud PC1 and a second local point cloud PC 2.

8. The method of claim 7, wherein the pool of generated relative pose predictions T is processed to perform pose validation.

9. The method according to any of the preceding claims 6 to 8, wherein the PPF self-encoder and the PC self-encoder are trained based on the calculated loss function L.