CN113160411A

CN113160411A - Indoor three-dimensional reconstruction method based on RGB-D sensor

Info

Publication number: CN113160411A
Application number: CN202110441618.1A
Authority: CN
Inventors: 颜成钢; 吕坤; 朱尊杰; 黄培武; 徐枫; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-23

Abstract

The invention discloses an indoor three-dimensional reconstruction method based on an RGB-D sensor, which can better avoid the condition that the object identification result is not ideal due to the interference of noise on original scanning data by introducing RGB information into object identification and classification and using a drawing and attention module, is different from the prior reconstruction process, uses a CAD model to replace an object in a scanning scene to obtain neat and compact representation, compares the difference of key points of the CAD model and the object in the scene when the final reconstruction is finished, and reduces the alignment error through iterative optimization, so that the size and the pose of the CAD model are consistent with the object in the scene. The invention can solve the problems of inaccurate classification identification and insufficient reconstruction precision of the whole scene caused by noise interference of the sensor and fuzzy generated by motion of the sensor, and can freely edit due to the introduction of the CAD model, thereby increasing the flexibility of the scene.

Description

Indoor three-dimensional reconstruction method based on RGB-D sensor

Technical Field

The invention belongs to the field of computer vision, and mainly relates to scene three-dimensional reconstruction, wherein a CAD (computer-aided design) model of a corresponding object in a scene is jointly retrieved by using geometric information, functional information and RGB (red, green and blue) information, and the final overall layout is optimized through repeated iteration to finish indoor three-dimensional reconstruction with higher precision.

Technical Field

Three-dimensional reconstruction refers to the establishment of a mathematical model suitable for computer representation and processing of a three-dimensional object, is the basis for processing, operating and analyzing the properties of the three-dimensional object in a computer environment, and is also a key technology for establishing virtual reality expressing an objective world in a computer.

In computer vision, three-dimensional reconstruction refers to a process of reconstructing three-dimensional information from single-view or multi-view images, and the three-dimensional reconstruction requires the use of empirical knowledge because the information of a single video is incomplete.

In recent years, the widespread use of consumer grade RGB-D sensors, such as Microsoft Kinect, Intel Real Sense, and Google Song, has made significant progress in RGB-D reconstruction. A very prominent area of research is based on volume fusion, where depth data is integrated into the volume signed distance function (TSDF). Many modern real-time reconstruction methods, such as KinectFusion, are based on this surface representation. In order to make the representation more memory efficient, octree or hash-based scene representations have been proposed. Another fusion method is based on points; the reconstruction quality is slightly lower, but the method has more flexibility in processing scene dynamics and can dynamically adapt to loop closure. Recent RGB-D reconstruction frameworks combine efficient scene representation with global pose estimation. Meanwhile, the latest research adopts a deep learning-based method for reconstruction, although the reconstruction quality is improved to a certain extent, due to the noise of a sensor for acquiring data and the blurring caused by the sensor in a motion state, the obtained scene three-dimensional scanning also has great noise and even often has incomplete conditions.

One solution to the above problem is to replace the incomplete scanned object with a CAD model, retrieve the CAD model of the scanned object from the model library, complete the replacement and complete the 9 degrees of freedom matching (i.e., size, position, orientation). CAD models have the characteristics of being complete, clean, and lightweight, and if all objects, including scenes, can be represented in this way, the above-mentioned problems of 3D scan noise or loss due to sensor noise or motion blur can be solved. And because the CAD model is introduced as the object representation in the scene, the editability of the CAD model also ensures that the engineering meaning of the whole scene is larger, and a more flexible representation can be realized, so that the scene reconstruction can be used for higher-level applications, such as AR/VR. However, to find and conform the CAD model to the input scan, the method includes several separate steps, a correspondence lookup, a correspondence match, and finally an optimization of the potential matching correspondence for each candidate CAD model. The steps are communicated from top to bottom, so once a little error occurs in a certain link, the final result may be greatly different from the expected result, even if the method of intense fire in recent years is applied to deep learning, the problem that the sensor obtains the original data inaccurately cannot be solved sufficiently, so that a relatively comprehensive method capable of generating feedback is urgently needed in the current school world to solve the problem, and once the problem is solved, the indoor three-dimensional reconstruction technology must be greatly improved.

An Attention module:

the point cloud with three-dimensional coordinates and optional features is input into a graphical attention convolution module. A k-nearest neighbor (KNN) map is computed from the spatial location of each point, generating a set of local neighbors whose features are connected to the global features computed by the global attention module. These connected features are input into the MLP layer, the output of which implements the element product together with the edge attention weight and the density attention weight obtained by the edge and density attention module. Through the MLP layer and the maximum pooling, the same characteristic diagram as the input data is finally obtained.

Full connection layer:

fully connected layers (FC) act as "classifiers" throughout the convolutional neural network. If we say that operations such as convolutional layers, pooling layers, and activation function layers map raw data to hidden layer feature space, the fully-connected layer serves to map the learned "distributed feature representation" to the sample label space. In practical use, the fully-connected layer may be implemented by a convolution operation: a fully-connected layer that is fully-connected to the previous layer may be converted to a convolution with a convolution kernel of 1x 1; while the fully-connected layer whose preceding layer is a convolutional layer can be converted to a global convolution with a convolution kernel of hxw, h and w being the height and width of the preceding layer convolution result, respectively. After the geometric information and the RGB information are respectively extracted, each voxel tends to be complete through the layer, so that the result trained by the neural network can accord with the expectation.

Loss reaction: in the training process of the neural network, a Loss function (Loss function) is used for evaluating whether the network is trained in place, the network aims to reduce the function value as far as possible, and in the process of repeated iteration, related parameters are adjusted by the network to finish training.

Signed Distance Field (SDF): signed, sign, Distance, point-to-point Distance, Field, zone, function to determine whether a point is in a zone.

Voxel volume: a voxel is an abbreviation of Volume element (Volume Pixel), and a Volume containing a voxel can be represented by Volume rendering or by extracting a polygonal isosurface of a given threshold contour. Equivalent to pixels in an RGB map, the prediction heat map method in a two-dimensional image can be transferred into a three-dimensional space by the same principle, and the matching of two three-dimensional objects is completed.

Levenberg-Marquardt (LM) algorithm: the LM algorithm is an iterative algorithm that can be used to solve the least squares problem, it can be seen as a combination of the steepest descent method and the Gauss-Newton method (by adjusting the damped μ switch). when the current solution is farther from the optimal solution, the algorithm is closer to the steepest descent method, slow but guaranteed to descend; when the current solution is close to the optimal solution, the algorithm is close to the Gauss-Newton method, and fast convergence is achieved.

Disclosure of Invention

When the sensor acquires scanning data, the sensor is often influenced by noise, blurring caused by the motion of the sensor and the like, so that the obtained 3D scanning of the scene has the phenomena of noise, loss and the like, and the object in the scene is difficult to classify and model. In the reconstruction process, the existing methods provide a solution to the above problem by replacing the objects in the scene scan with a complete, lightweight representation of the CAD model. However, the method only carries out retrieval matching aiming at geometric information, and does not apply RGB information, so the method focuses on researching how to apply the RGB information to the CAD model retrieval matching process, and after model matching, the method applies the idea of closed-loop control in the traditional control theory, feeds back by comparing with original scanning data, iterates repeatedly until the required precision is reached, optimizes the overall layout, and enables the reconstruction precision to reach a higher level.

The invention provides an indoor three-dimensional reconstruction method based on an RGB-D sensor, which can better avoid the situation that the object recognition result is not ideal due to the interference of noise on original scanning data by introducing RGB information into object recognition classification and using a figure attention module, is different from the prior reconstruction process, and reduces the error by comparing a key point difference with an initial scanning scene and carrying out iterative optimization when the final reconstruction is completed. The invention can solve the problems of inaccurate classification identification and insufficient reconstruction precision of the whole scene caused by noise interference of the sensor and fuzzy generated by the motion of the sensor with better effect, and increases the flexibility of the scene (because of introducing a CAD model, the scene can be freely edited).

An indoor three-dimensional reconstruction method based on an RGB-D sensor comprises the following steps:

step 1: acquiring indoor integral 3D scanning data by using an RGB-D sensor;

step 2: voxelizing scene 3D scanning data, real object models in a real object model library and CAD models in a ShapeNet data set;

and step 3: applying a graph attention mechanism to reduce the difficulty of identifying objects due to incomplete scanning;

and 4, step 4: combining the color information with the geometric information to identify a real object model corresponding to an object part in a scanned scene voxel block;

and 5: searching a CAD model which is closest to the corresponding real object model;

step 6: replacing all objects in the original scene 3D scanning data with corresponding CAD models, and performing attitude optimization after completing the replacement;

and 7: and performing joint optimization on the functional space and the geometric space of the overall layout to optimize the overall layout.

The specific method of the step 2 is as follows:

the method comprises the steps of representing scene 3D scanning data by voxels, obtaining scanning scene voxel blocks, coding the scanning scene voxel blocks into a designed distance field (SDF), and carrying out voxelization by means of information combining RGB and a depth map, namely, the voxels not only retain geometric information but also retain RGB information, and coding the voxels into true object models in a true object model library crawled on the network and CAD models in a ShapeNet data set. The item types in the real object model library correspond to ShapeNet data sets.

The specific method in step 3 is as follows:

the method comprises the steps that the parts, without defects, of scanned objects in scene 3D scanning data occupy large weights in recognition and classification through a graph attention machine mechanism, the weights of the parts without defects are correspondingly reduced, the relations between input and output characteristics of all nodes are represented through a weight matrix, the matrix is obtained through training, the objects are split into components, the objects without defects are recognized according to prior knowledge obtained from other parts, the parts without defects are more sensitive, and negative effects caused by the fact that the objects are incomplete are compensated through combination of color information;

the specific method of the step 4 is as follows:

matching the cut object part of the voxelized scanning scene with the voxelized real object model through 3DCNN, using cross entropy as loss function, judging the matching probability of the voxels of the object part in the whole scene and the voxels of the real object model in a mode of outputting a heat map, bringing color information (RGB information) into the probability of judging whether the voxels are matched, and comparing the RGB values of input data and model data; the geometry information is processed in parallel with the RGB information, and finally both types of information are combined at each point through a full link layer. Finally, a real object model corresponding to the object part of the scanned scene voxel block is obtained; the probability of the heat map output is the probability that each point corresponds to a true object model voxel after the object model is pixelized, ranging from 0 to 1.

The specific method of the step 5 is as follows:

and (4) searching the CAD model closest to the corresponding real object model obtained in the step (4), calculating the L2 distance by coding the CAD model into characteristic vectors respectively, selecting the minimum pair for searching, wherein the process only needs to use geometric information, and then matching the cut-out voxelized scanning field scene body part with the obtained corresponding CAD model through 3DCNN to obtain a corresponding heat map.

The specific method of step 6 is as follows:

and (3) registering the positions of the original objects in the CAD model obtained in the step (5) and the scene 3D scanning data, wherein a coordinate system is required to be transformed, the positions of the original objects in the scene 3D scanning data are converted into the coordinate system of the CAD model from a world coordinate system (namely the coordinate system of the scanning scene), and lie algebra a is applied for representation. Besides, the relation of the dimension s between the CAD model and the object in the scene 3D scanning data is represented by 3-dimensional vectors (Sx, Sy, Sz), namely, the dimension deviation in each direction is obtained, the dimension s and the lie algebra a are combined and optimized, a 4x4 conversion matrix is used for representation, wherein the conversion matrix comprises the lie algebra and a three-dimensional vector for representing the dimension, an energy function minimization problem is constructed on the basis, namely, how to determine the rotation, the translation and the dimension of the CAD model to ensure that the CAD model is closer to the position of the object in the original 3D scanning scene, a Levenberg-Marrdquat (LM) algorithm is applied to solve the problem, and the lie and the 3-dimensional vectors (Sx, Sy, Sz) for representing the dimension are continuously iterated to obtain a solution for minimizing the energy function. Finally, a form that all object parts in the original scene 3D scanning data are replaced by corresponding CAD model representations is obtained, namely the replaced 3D scene representation;

the specific method of step 7 is as follows:

and (3) comparing the 3D scene representation after the replacement with the original scanning, taking the difference of Euclidean distances of key points as an error, if the error value is greater than or equal to a set threshold value, the matching fails, and if the error value is smaller than the set threshold value, repeatedly iterating the scene representation obtained for the first time by the method in the step (6) until the value of the error function is smaller than the set minimum value, so that the optimization of the pose of the CAD model after the replacement is completed.

The invention has the following beneficial effects:

innovation points 1: the RGB information, namely the color feature retrieval is innovatively integrated into the object CAD model retrieval process. The RGB data is generally more representative than depth or geometric information, and compared with the traditional method of simply applying the geometric information, the matching effect can be obviously improved, namely, an object model is matched through the combination of the RGB and the geometric information, and then the CAD model is matched through the object model.

Innovation points 2: after the whole process is finished, comparing with the original scanning, setting an error function (the deviation degree of the region after the attention module is applied to the object part) to observe the reconstruction precision, and repeating iteration until reaching the standard if not reaching the standard.

Innovation points 3: the invention introduces a graph attention mechanism, and can finish the identification and classification work of the object without reading the scanning data obtained by scanning the whole object by training the node weight. The sensor is lost due to noise and the like, and the probability of retrieving a complete model can be improved by increasing the weight of the part which is not lost, which is equivalent to completing 'completion'.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, an indoor three-dimensional reconstruction method based on an RGB-D sensor includes the following steps:

step 1: acquiring indoor integral 3D scanning data by using an RGB-D sensor; it is noted that mobile devices of larger amplitude are avoided as much as possible.

Step 2: the method comprises the steps of representing scene 3D scanning data by voxels, obtaining scanning scene voxel blocks, coding the scanning scene voxel blocks into a designed distance field (SDF), and carrying out voxelization by means of information combining RGB and a depth map, namely, not only retaining geometric information but also retaining RGB information in the voxels, and similarly coding the voxels for real object models in a real object model library crawled on the network and CAD models in a ShapeNet data set, so that the method can be analogized to classical two-dimensional image processing, and can be used for matching each point of a three-dimensional object. The item types in the real object model library correspond to ShapeNet data sets.

And step 3: the method comprises the steps of enabling the parts, without defects, of scanned objects in scene 3D scanning data to have large weights in recognition and classification through a graph attention machine mechanism, enabling the weights of the parts without defects to be correspondingly reduced, representing the relation between input and output characteristics of all nodes through a weight matrix, obtaining the matrix through training (training through a sample of the scanning objects without defects, increasing punishment on error judgment conditions, respectively extracting characteristics (by using part net) from the parts, such as four legs and a plane, of each object, combining the parts into the object through a full connecting layer, enabling the object, such as an object formed by the three legs and the plane through the full connecting layer, to be accepted as a stool within a certain error, wherein the step is mainly used for extracting semantic information of the object and improving the efficiency of next step of retrieving a real object model, because semantic information of the object is provided, when a real object model matching the model is searched, the model does not need to be searched in the model base all the time, but the model set of the corresponding type is traversed, so that the network can consider that, for example, four legs exist, but no plane exists, and the model set has higher probability of being a stool), the object is split into components, for example, a chair is a backrest, a seat and four legs, and then, if only three legs exist and the back is incomplete, the chair can be identified according to prior knowledge obtained from other parts, and the characteristics of the part without the incomplete part are more sensitive, so that the negative effect brought by the incomplete object is compensated by combining color information; for example, one third of a table is omitted, and only the remaining two thirds of the table are needed to find out the corresponding object in the real object model library.

And 4, step 4: cutting out an object part from a voxelized scanning scene (by selecting a point at the center of an object in the scanning scene during training and then cutting out the object in a region of 64x64x64 around the point) and matching the voxelized real object model through a 3DCNN (which can be analogized to object recognition in a 2-dimensional image), using cross entropy as a loss function, and outputting a heat map, wherein the probability output by the heat map is the probability of each point corresponding to the voxel of the real object model after the object model is voxelized, the range is 0 to 1, incorporating color information (namely RGB information) into the probability of judging whether the points are matched, comparing the RGB values of the input data and the model data (for example, even if two tables are similar in geometry, the different tables are not completely the same), so applying a color feature helps to identify the object more accurately, exactly which kind of table). The geometry information is processed in parallel with the RGB information, and finally both types of information are combined at each point through a full link layer. Finally, a real object model corresponding to the object part of the scanned scene voxel block is obtained;

and 5: and (4) searching the CAD model closest to the corresponding real object model obtained in the step (4), calculating the L2 distance by coding the CAD model into characteristic vectors respectively, selecting the minimum pair for searching, wherein the process only needs to use geometric information, and then matching the cut-out voxelized scanning field scene body part with the obtained corresponding CAD model through 3DCNN to obtain a corresponding heat map. The significance of this is that the accuracy of directly matching the CAD model is higher than that of matching the real object model first and then indirectly matching the CAD model, and the next pose optimization process needs to find the relation between the voxel blocks after the object is voxelized and the CAD model is voxelized in the scanning scene, i.e. the heat map, and optimize the corresponding positions of the voxel blocks to achieve the purpose of adjusting the pose and the size of the whole CAD model.

Step 6: since the CAD model is not necessarily in the model library, to register the position of the corresponding object CAD model obtained in step 5 with the position of the original object in the scene 3D scan data, a coordinate system transformation is required, and the position of the original object in the scene 3D scan data is transformed from the world coordinate system (coordinate system of the scanned scene) to the coordinate system of the CAD model, similarly to the transformation of the camera coordinate system and the world coordinates (the obtained initial data set has the corresponding pos label, refer to the obtained initial data set, and refer to the position label

R is a rotation matrix and t is a position translation transformation), the lie algebra a representation is applied. Denote a point in the scanned object as (p)_j,H_j) In the form of (1), where pj is a point of an object in the scanned scene (a voxel block, a voxel block in three-dimensional space corresponds to a point of a two-dimensional image), Hj is a probability (also 0 to 1), that is, whether the voxel block of the object in the scanned scene is a voxel block on the CAD model (because we input a region cut out from the scanned scene with the object as the center and including a voxel block of a part of the scanned scene)), and besides, the object in the CAD model and the 3D scanned data of the scene has a relation of a size s, and the object in the 3D scanned data of the scene is expressed by a 3-dimensional vector (Sx, Sy, Sz), that is, a size deviation in each direction, and the size s and the lie algebra a are optimized by combining, and a conversion matrix of 4x4 is used

Including lie algebra and a three-dimensional vector representing dimensions, mi represents the ith CAD model, to perform the representation

Is mapped to convert six-dimensional lie algebra (three-dimensional rotation, three-dimensional translation) and three-dimensional size vectors into a 4x4 matrix. Applications c_vox＝T_word→_vox·T_mi(a,s)·p_j(wherein T is_word→_voxIs a transformation of the world coordinate system into the coordinate system of the CAD model) from voxel points pj in the scan scene to point coordinates after the voxels of the CAD model, on the basis of which an energy function minimization problem is to be constructed,

i.e. how to determine the rotation, translation, and size of the CAD model to ensure that it is closer to the position of the object in the original 3D scanned scene (H is required)_j(c_vox) To make f as small as possible, i.e., to make the voxel coordinates on the CAD model voxel coordinate system as close as possible to the points on the CAD model that we previously applied the heat map to determine, when pose matching, corresponding to the objects in the scanned scene, by transforming a and s, since we determined whether pose transformation was completed by using the closeness of the corresponding points, apply Levenberg-marquardt (lm) algorithm to solve this problem, continuously iterate lie algebra and 3-dimensional vectors (Sx, Sy, Sz) representing the dimensions,

to obtain a solution that minimizes the energy function (these two points are the key to pose optimization). Finally, a form that all object parts in the original scene 3D scanning data are replaced by corresponding CAD model representations, namely the replaced 3D scene representation is obtained, because the objects in the scene have the condition of incomplete scanning due to sensors, the objects are replaced by the CAD model, and the purposes of completeness, cleanness, light weight and light weight are achievedThe level standard, and because the CAD model has the characteristic of free editing, the scene can be more flexibly represented;

and 7: comparing the 3D scene representation after the replacement with the original scanning, taking the difference of Euclidean distances of key points (such as objects or edge corner points) as an error (basically equal to the previous step, namely judging that the result from the visual angle after the algorithm convergence of the previous step cannot be accepted), if the error value is larger than or equal to a set threshold value, namely the matching fails, and if the error value is smaller than the set threshold value, repeatedly iterating the scene representation obtained for the first time by the method of the step 6 until the value of the error function is smaller than the set minimum value, namely the optimization of the pose of the CAD model after the replacement is considered to be completed.

Claims

1. An indoor three-dimensional reconstruction method based on an RGB-D sensor is characterized by comprising the following steps:

step 1: acquiring indoor integral 3D scanning data by using an RGB-D sensor;

2. The indoor three-dimensional reconstruction method based on the RGB-D sensor as claimed in claim 1, wherein the specific method in step 2 is as follows:

expressing scene 3D scanning data by using voxels, obtaining scanning scene voxel blocks, coding the scanning scene voxel blocks into a Signed Distance Field (SDF), and carrying out voxelization on the scanning scene voxel blocks by depending on information combining RGB and a depth map, namely, not only retaining geometric information but also retaining RGB information in the voxels, and similarly coding the voxels for real object models in a real object model library crawled on the network and CAD models in a ShapeNet data set; the item types in the real object model library correspond to ShapeNet data sets.

3. The indoor three-dimensional reconstruction method based on the RGB-D sensor as claimed in claim 2, wherein the specific method in step 3 is as follows:

the method comprises the steps of enabling the parts, without defects, of scanned objects in scene 3D scanning data to have large weights in recognition and classification through a graph attention machine mechanism, enabling the weights of the parts without defects to be correspondingly reduced, representing the relation between input and output characteristics of all nodes through a weight matrix, training to obtain the matrix, splitting the objects into component parts, recognizing the objects without defects according to priori knowledge obtained from other parts, and enabling the parts without defects to be more sensitive, so that negative effects caused by the fact that the objects are incomplete are compensated by combining color information.

4. The indoor three-dimensional reconstruction method based on the RGB-D sensor as claimed in claim 3, wherein the specific method in step 4 is as follows:

matching the cut object part of the voxelized scanning scene with the voxelized real object model through 3DCNN, using cross entropy as loss function, judging the matching probability of the voxels of the object part in the whole scene and the voxels of the real object model in a mode of outputting a heat map, bringing color information (RGB information) into the probability of judging whether the voxels are matched, and comparing the RGB values of input data and model data; parallel processing of geometric information and RGB information, and finally combining the two kinds of information of each point through a full connection layer; finally, a real object model corresponding to the object part of the scanned scene voxel block is obtained; the probability of the heat map output is the probability that each point corresponds to a true object model voxel after the object model is pixelized, ranging from 0 to 1.

5. The indoor three-dimensional reconstruction method based on the RGB-D sensor as claimed in claim 4, wherein the specific method in step 5 is as follows:

6. The RGB-D sensor-based indoor three-dimensional reconstruction method of claim 5, wherein the specific method in step 6 is as follows:

registering the position of the original object in the CAD model obtained in the step 5 and scene 3D scanning data, wherein the position of the original object in the scene 3D scanning data needs to be transformed by a coordinate system, the position of the original object in the scene 3D scanning data is converted into the coordinate system of the CAD model from a world coordinate system (coordinate system of a scanning scene), and a lie algebra a is applied for representation; in addition, the relation of the dimension s exists between the CAD model and the object in the scene 3D scanning data, the relation is represented by 3-dimensional vectors (Sx, Sy, Sz), namely, the dimension deviation in each direction is obtained, the dimension s and the lie algebra a are combined and optimized, a 4x4 conversion matrix is used for representation, wherein the conversion matrix comprises the lie algebra and a three-dimensional vector representing the dimension, an energy function minimization problem is constructed on the basis, namely, how to determine the rotation, translation and dimension of the CAD model to ensure that the CAD model is closer to the position of the object in the original 3D scanning scene, a Levenberg-Marrdquat (LM) algorithm is applied to solve the problem, and the lie and the 3-dimensional vectors (Sx, Sy, Sz) representing the dimension are iterated continuously to obtain a solution for minimizing the energy function; finally, the form that all object parts in the original scene 3D scanning data are replaced by corresponding CAD model representation, namely the replaced 3D scene representation is obtained.

7. The RGB-D sensor-based indoor three-dimensional reconstruction method according to claim 6, wherein the step 7: and (3) comparing the 3D scene representation after the replacement with the original scanning, taking the difference of Euclidean distances of key points as an error, if the error value is greater than or equal to a set threshold value, the matching fails, and if the error value is smaller than the set threshold value, repeatedly iterating the scene representation obtained for the first time by the method in the step (6) until the value of the error function is smaller than the set minimum value, so that the optimization of the pose of the CAD model after the replacement is completed.