CN117274756A

CN117274756A - Fusion method and device of two-dimensional image and point cloud based on multi-dimensional feature registration

Info

Publication number: CN117274756A
Application number: CN202311103838.9A
Authority: CN
Inventors: 郑文杰; 杨祎; 张峰达; 李壮壮; 辜超; 朱文兵; 林颖; 刘萌; 崔其会; 李勇; 乔木; 任敬国; 孙艺玮; 吕俊涛; 邢海文; 李程启; 李笋; 李文博; 顾朝亮; 李龙龙
Original assignee: Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-12-22

Abstract

The invention discloses a fusion method of a two-dimensional image and a point cloud based on multidimensional feature registration, which comprises the following steps: preprocessing two-dimensional images and three-dimensional point cloud data; performing feature extraction by adopting a convolutional neural network; adopting convolution kernel full-connection layer fusion to obtain shared characteristic representation; constructing a deep learning model: the deep learning model comprises an encoder and a decoder, wherein the encoder mainly comprises a convolution layer and a pooling layer, and the convolution layer comprises a plurality of convolution kernels with different scales; the full connection layer generates output characteristics; the decoder predicts a semantic class for each pixel; model training and optimization to obtain a prediction model; inputting the two-dimensional image of the object to be processed and the corresponding three-dimensional point cloud data into a prediction model to obtain fusion information of the object to be processed. The invention has the beneficial effects of improving the fusion precision and meeting the high-precision fusion requirement. The invention also provides a fusion device which comprises a preprocessing module, a registration projection module, a feature extraction module, a feature fusion prediction module and an output module.

Description

Fusion method and device of two-dimensional image and point cloud based on multi-dimensional feature registration

Technical Field

The invention relates to the technical field of fusion point clouds. More particularly, the invention relates to a fusion method and a fusion device of a two-dimensional image and a point cloud based on multi-dimensional feature registration.

Background

Although the three-dimensional point cloud data has real three-dimensional coordinates, the three-dimensional point cloud data lacks real texture information, and on the premise of lacking priori knowledge, the human eyes have very limited identification capability for the three-dimensional point cloud. The digital image has rich texture information, has imaging effect similar to that seen by human eyes, accords with the cognition of human to the real world, and has the defect that the three-dimensional real world cannot be directly represented by two dimensions. It is assumed that if the three-dimensional point cloud and the two-dimensional digital image can be fused, the comprehensive information of the target surface can be obtained, including the real three-dimensional coordinates of the surface of the measured object and rich texture information. The current fusion techniques are as follows:

1. the three-dimensional point cloud is aligned with the two-dimensional image: the three-dimensional point cloud and the two-dimensional image are typically represented by different coordinate systems, so they first need to be aligned into shared coordinates. The conversion of image coordinates into point cloud coordinates may be accomplished by camera calibration and perspective geometry techniques, such as using a camera projection model.

2. Three-dimensional point cloud projection: projecting the three-dimensional point cloud onto the two-dimensional image plane may be accomplished by projecting three-dimensional points in the three-dimensional point cloud onto corresponding two-dimensional image locations. The projection calculation may be performed using camera parameters and geometric relationships. After projection, attributes (such as color and texture) of the three-dimensional point cloud can be corresponding to the image pixels, so that fusion of the three-dimensional point cloud and the two-dimensional image is realized.

3. Feature extraction of a three-dimensional point cloud and a two-dimensional image: features can be extracted from the three-dimensional point cloud and the two-dimensional image respectively, and fusion can be performed by matching the features. For example, conventional feature extraction algorithms (e.g., SISF, SURF, etc.) are used in the image to monitor key points and descriptors, and then corresponding projection techniques are used in the point cloud to map these features onto corresponding three-dimensional points, thereby establishing correspondence between the image and the point cloud.

4. Deep learning fusion of a three-dimensional point cloud and a two-dimensional image: the deep learning method has made great progress in both two-dimensional image processing and three-dimensional point cloud analysis. The two-dimensional image and the three-dimensional point cloud data can be simultaneously processed by using the deep learning model, so that fusion of the two-dimensional image and the three-dimensional point cloud data is realized. For example, a neural network model may be designed. The method can accept two-dimensional images and three-dimensional point clouds as input, and jointly learn the relationship between two-dimensional and three-dimensional features to solve various tasks, such as target detection, scene understanding and the like.

5. Sensor fusion: in some cases, multiple sensors (e.g., cameras, lidars, etc.) may be used to simultaneously acquire two-dimensional images and three-dimensional point cloud data and fuse them. Such as by using a sensor fusion algorithm, such as an Extended Kalman Filter (EKF) or a Particle Filter (PF), to obtain more accurate and complete scene information.

Then, the current two-dimensional image and three-dimensional point cloud fusion technology faces the following defects:

1. data inconsistency: the two-dimensional image and the three-dimensional point cloud are different in data representation modes, the two-dimensional image is represented in units of pixels, and the three-dimensional point cloud is represented in coordinates and attribute information of points. This creates data inconsistencies that require conversion and alignment of the data formats for efficient fusion;

2. data registration problem: fusing a two-dimensional image and a three-dimensional point cloud requires registration of the data, i.e., projecting the two-dimensional image into a three-dimensional space or mapping the three-dimensional point cloud back to the two-dimensional image plane. However, the registration process may introduce errors, resulting in inaccuracy in the fusion result. Solving the problem requires the adoption of an accurate sensor calibration and registration algorithm;

3. data sparsity and noise: both two-dimensional image and three-dimensional point cloud data have sparsity and noise of the data. Two-dimensional images may have problems of occlusion, illumination variation, texture blurring, etc., and three-dimensional point clouds may have missing points and noise points. These problems affect the quality and accuracy of the fusion result, requiring data processing and noise filtering operations;

4. computational complexity: the processing of two-dimensional images and three-dimensional point clouds has different computational complexity. The processing of two-dimensional images generally adopts a convolutional neural network-based method. Therefore, the problem of unmatched computation complexity is required to be solved by effectively fusing the two-dimensional image and the three-dimensional point cloud;

5. cross-modal information fusion: the problem of cross-modal information fusion is to be considered in fusing the two-dimensional image and the three-dimensional point cloud, namely, how to effectively fuse color, texture and shape information in the two-dimensional image with the aggregate structure and attribute information in the three-dimensional point cloud;

in summary, the current fusion method is poor in precision when being singly used, and cannot meet the requirements of some high-precision tasks.

Disclosure of Invention

It is an object of the present invention to solve at least the above problems and to provide at least the advantages to be described later.

To achieve these objects and other advantages and in accordance with the purpose of the invention, a fusion method of a two-dimensional image and a point cloud based on multi-dimensional feature registration is provided, comprising the steps of:

s1, acquiring a two-dimensional image of an object and corresponding three-dimensional point cloud data, and respectively preprocessing;

s2, registering: converting the preprocessed two-dimensional image and the three-dimensional point cloud data into the same coordinate system, and establishing spatial position correlation between the two-dimensional image and the three-dimensional point cloud;

s3, point cloud projection: projecting the registered three-dimensional point cloud data into a corresponding two-dimensional image space;

s4, feature extraction: respectively extracting features of the two-dimensional image and the three-dimensional point cloud data with the mutual spatial position association relation by adopting a convolutional neural network to obtain corresponding feature representation;

s5, fusing the feature representation in the step S4 by adopting a convolution kernel full-connection layer to obtain a shared feature representation;

s6, constructing a deep learning model: the deep learning model comprises an encoder and a decoder, wherein the encoder mainly comprises a convolution layer and a pooling layer, the convolution layer comprises a plurality of convolution kernels with different scales, each convolution kernel carries out convolution operation with input data, space structure information is reserved and local features are extracted at the same time, the convolution kernels are weights in the convolution operation, the convolution operation carries out sliding window calculation on the input data, a specific feature map is generated, a nonlinear activation function is adopted to activate the feature map, and nonlinear features are introduced; the pooling layer downsamples the feature map;

after being processed by a convolution layer and a pooling layer for many times, the input shared characteristic representation and the weight are subjected to matrix multiplication and nonlinear change by adopting a full-connection layer to generate output characteristics;

the decoder mainly comprises a full-time convolution layer and a convolution layer, wherein the full-time convolution layer carries out up-sampling on output characteristics, increases resolution, carries out classification prediction through the convolution layer, outputs the same size as an input image, and predicts semantic categories for each pixel;

s7, model training and optimization: training the deep learning model constructed in the step S6 by adopting a labeled training data set, performing feature descent optimization according to a loss function, and obtaining a prediction model after iterative training and verification;

s8, processing the two-dimensional image of the object to be processed and the corresponding three-dimensional point cloud data through the steps S1-S5, and inputting the processed two-dimensional image and the processed three-dimensional point cloud data into the prediction model in the step S7 to obtain fusion information of the object to be processed.

Preferably, the pretreatment method in step S1 specifically includes: preprocessing a two-dimensional image, including scaling to a target size, normalizing the pixel range of the two-dimensional image to 0-1, and cutting the two-dimensional image to obtain an interested region;

denoising the three-dimensional point cloud, selecting a filter type, setting filter parameters for filtering, selecting a resampling method, setting resampling parameters for processing the filtered three-dimensional point cloud data.

Preferably, interpolation algorithms are used to process differences between image pixels during scaling. .

Preferably, the normalization method is to divide the pixel value by 255 such that the pixel ranges from 0 to 255 to between 0 and 1.

Preferably, the cropping criteria are defined in terms of the position of the image, the pixel values or the bounding box.

Preferably, the filter type includes one of an average filter, a median filter, and a gaussian filter.

Preferably, the resampling method comprises one of voxel gridding, nearest neighbor sampling, and surface-based sampling.

Preferably, the pooling layer adopts a maximum pooling method, and selects the maximum value of a certain area in the feature map as the feature after downsampling.

Preferably, the activation function is a softmax function, generating a probability distribution for each category.

The device for the fusion method of the two-dimensional image and the point cloud based on the multi-dimensional feature registration comprises the following steps:

the preprocessing module is used for acquiring a two-dimensional image of the object and corresponding three-dimensional point cloud data and respectively preprocessing the two-dimensional image and the corresponding three-dimensional point cloud data;

the registration projection module is used for converting the preprocessed two-dimensional image and the three-dimensional point cloud data into the same coordinate system and establishing spatial position association between the two-dimensional image and the three-dimensional point cloud; and the three-dimensional point cloud data after registration are projected into a corresponding two-dimensional image space;

the feature extraction module is used for respectively carrying out feature extraction on the two-dimensional image and the three-dimensional point cloud data with the mutual spatial position association relation by adopting a convolutional neural network to obtain corresponding feature representation;

the feature fusion prediction module is used for constructing a deep learning model, the deep learning model comprises an encoder and a decoder, the encoder mainly comprises a convolution layer and a pooling layer, the convolution layer comprises a plurality of convolution kernels with different scales, each convolution kernel carries out convolution operation with input data, spatial structure information is reserved, local features are extracted at the same time, the convolution kernels are weights in the convolution operation, the convolution operation carries out sliding window calculation on the input data, a specific feature map is generated, a nonlinear activation function is adopted to activate the feature map, and nonlinear features are introduced; the pooling layer downsamples the feature map;

the model training and optimizing method comprises the steps of (1) training a deep learning model constructed in the step (S6) by adopting a labeled training data set, performing feature descent optimization according to a loss function, and obtaining a prediction model after iterative training and verification;

the output module is used for inputting the two-dimensional image of the object to be processed and the corresponding three-dimensional point cloud data into the prediction model in the step S7 after the two-dimensional image of the object to be processed and the corresponding three-dimensional point cloud data are processed in the steps S1 to S5, and obtaining fusion information of the object to be processed.

The invention at least comprises the following beneficial effects: based on the deep learning of the point cloud and the image, a network fusion model concept is introduced, an end-to-end deep learning network is designed, two-dimensional image and three-dimensional point cloud data are taken as input, and the two-dimensional image and the three-dimensional point cloud data are combined through network layer fusion operation, so that an accurate mapping relation between a visible light picture and the three-dimensional point cloud is obtained, and a more accurate and robust two-dimensional image and three-dimensional point cloud fusion result is obtained.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a block diagram of a fusion method of the present invention.

Detailed Description

The present invention is described in further detail below with reference to the drawings to enable those skilled in the art to practice the invention by referring to the description.

As shown in fig. 1, the invention provides a fusion method of a two-dimensional image and a point cloud based on multi-dimensional feature registration, which comprises the following steps:

s1, acquiring a two-dimensional image of an object and corresponding three-dimensional point cloud data, and respectively preprocessing; the pretreatment method specifically comprises the following steps: preprocessing a two-dimensional image, including scaling to a target size, normalizing the pixel range of the two-dimensional image to 0-1, and cutting the two-dimensional image to obtain an interested region;

wherein the scaling is to adjust the size of the image to different sizes. The scaling may be set as desired. Typically, a scale factor of less than 1 may be used for the reduced image, and a scale factor of greater than 1 may be used for the enlarged image. The specific scaling factor depends on the original size of the image and the target size. For example, if the original image size is 500×500 pixels, we can reduce it to half the original size, i.e., 250×250 pixels. Similarly, if we want to zoom in the image to twice the original size, it can be adjusted to 1000×1000 pixels. Interpolation algorithms (such as nearest neighbor interpolation, bilinear interpolation, or bicubic interpolation) may be used to handle differences between pixels during scaling.

Normalization is the mapping of a range of pixel values of an image to a specific range (typically 0-1). The purpose of this is to eliminate the difference in pixel values between different images, so that it has a similar scale, which is easier to compare and process. One normalization method that may be employed is to divide the pixel value by 255, mapping the pixel range from 0 to 255 to between 0 and 1.

Cropping refers to selecting a region of interest from an image according to a particular criteria, and retaining it, removing other regions. The clipping criteria can be defined according to different requirements and applications. For example, clipping criteria may be defined based on the location of the image, the pixel values, or the bounding box. Thus, an image with a specific area can be acquired, and subsequent processing or analysis is facilitated.

The general steps of filtering include: defining a filter: appropriate filter types and parameters are selected. Common filter types include mean filters, median filters, gaussian filters, etc. The choice of filter depends on the specific application requirements and image characteristics.

Three-dimensional point cloud resampling is to resample three-dimensional point cloud data to adjust the density, resolution, or remove unwanted noise points of the point cloud. And new three-dimensional point cloud data can be obtained after resampling.

In the three-dimensional point cloud resampling process, different algorithms and strategies can be adopted for processing. Resampling methods include voxel gridding, nearest neighbor sampling, surface-based sampling, and the like. The data output after sampling is still three-dimensional point cloud data, but the characteristics of the data are different according to the resampling target and algorithm. For example, if resampling is to adjust the density of the point cloud, the output point cloud may have fewer or more points, and the density of the point cloud may be spatially uniform and non-uniform. If resampling is to remove noise, the output point cloud may reduce the number of noise points or adjust the shape of the point cloud to better represent the real object.

The resampled three-dimensional point cloud data may be used for further three-dimensional point cloud processing and applications such as target recognition, point cloud registration, modeling, and the like. The output data is still a three-dimensional point cloud, but its structure and characteristics will vary depending on the resampling method and parameters. Depending on the resampling algorithm employed and the parameter settings in the process.

the specific steps of feature extraction include:

data preparation: first, an original dataset needs to be prepared, which may be in the form of image, text, in, etc. Ensuring the quality and applicability of the data set.

Feature selection: before feature extraction, feature selection may be performed. Feature selection refers to selecting or screening out the most distinguishing and important features from the original dataset. Thus, the dimension of the feature can be reduced, the calculation efficiency is improved, and the influence of honor or noise features is avoided.

The feature extraction method comprises the following steps: and respectively carrying out feature extraction on the two-dimensional image and the three-dimensional point cloud data with the mutual spatial position association relation by adopting a convolutional neural network to obtain corresponding feature representation. An algorithm such as SISF, SURF, FAST is used to extract the features. The characteristic is represented as follows: the extracted features need to be appropriately represented for subsequent processing and analysis. The vectors are used for representing the features, so that the consistency and the comparability of the features are ensured.

Feature evaluation and selection: after extracting the features, evaluation and selection of the features is performed to understand the quality of the features and their contribution to the task. The importance of the feature is assessed using indicators of relevance, information gain, variance, etc.

Characteristic pretreatment: and carrying out necessary characteristic preprocessing operation according to task requirements and characteristic properties. For example, normalization, dimension reduction, etc., operations may further optimize the representation and processing effects.

Wherein, the following features are extracted: the structural characteristics are as follows: structural information for expressing and representing data, such as aggregate shape, topology and structure, hierarchical relationship, and the like. In image processing, the structural features may be edges, corner points, lines, etc. Statistical characteristics: features extracted based on statistical properties of the data include mean, variance, standard deviation, maximum and minimum, histogram, etc. Statistical features are often used to describe the distribution and overall characteristics of data. Visual characteristics: feature extraction for image and video data, including color features, texture features, shape features, SIFT (scale invariant feature transform) descriptors, HOG (histogram of directional gradients), and the like.

S5, fusing the feature representation in the step S4 by adopting a convolution kernel full-connection layer to obtain a shared feature representation; after serial fusion connection, the following can be obtained: data/feature fusion results, feature representation enhancement, decision result fusion and information fusion results.

The convolution kernel full-connection layer specifically comprises the following steps:

input: first, a data set of a two-dimensional image and a three-dimensional point cloud (the results of the feature extraction of the former two) is prepared. The image is represented as a matrix and the three-dimensional point cloud is represented as three-dimensional coordinate information and possibly additional attributes.

Feature extraction and pretreatment: for two-dimensional images and three-dimensional point clouds, different feature extraction and preprocessing methods are applied. The image is subjected to characteristic extraction by using a Convolutional Neural Network (CNN), and the three-dimensional point cloud is subjected to characteristic extraction by using a point cloud specific algorithm (Pointnet++). Local and global information of the image and the three-dimensional point cloud can be captured, and corresponding feature representations are generated.

Feature fusion: after the feature extraction is completed, a feature fusion operation may be performed. And (3) connecting or superposing the image and the characteristics of the three-dimensional point cloud in series. The specific operation is to connect the image and the point cloud feature into a larger feature vector or to merge the image and the point cloud feature into a shared feature representation through a convolution kernel full-connection layer, wherein the shared feature representation is selected at this time.

S6, constructing a deep learning model: next, a deep learning model is constructed to learn the fused feature expressions and relationships. In the deep learning model, the convolution kernel is used for extracting the space and channel level of the features, and the full connection layer is used for mapping and combining the features. The convolution layer carries out convolution operation on input through a sliding window, and local features are extracted. The full connection layer performs matrix multiplication and nonlinear change on the input features and the weights to generate output features.

The deep learning model comprises an encoder and a decoder, wherein the encoder mainly comprises a convolution layer and a pooling layer, the convolution layer comprises a plurality of convolution kernels with different scales, each convolution kernel carries out convolution operation with input data, space structure information is reserved and local features are extracted at the same time, the convolution kernels are weights in the convolution operation, the convolution operation carries out sliding window calculation on the input data, a specific feature map is generated, a nonlinear activation function is adopted to activate the feature map, and nonlinear features are introduced; the pooling layer downsamples the feature map; wherein, convolution kernels with different scales are introduced into the convolution neural network to process receptive fields with different sizes. Such as convolution kernels of different sizes or convolution using holes of multiple dimensions. This captures details and context information at multiple scales and fuses them together to improve feature expressivity.

in this application, the encoder basic steps include:

input layer: an input of a shared feature representation is accepted.

Convolution layer: the convolutional layer is the core part of the CNN, and consists of a plurality of convolutional kernels. Each convolution kernel performs convolution operation with the input data to extract local features. The convolution operation calculates on the input data through a sliding window, generating a feature map.

Activation function: after the convolutional layer, the feature map is activated using a nonlinear activation function, such as ReLU (Rectified Linear Unit), to introduce nonlinear features.

Pooling layer: the pooling layer reduces the number of parameters and the calculated amount by downsampling the feature map. While retaining important features. And selecting the maximum value of a certain area in the feature map as the feature after downsampling by using maximum pooling (max pooling).

Full tie layer: after passing through a series of convolution and pooling layers, the features extracted using the fully connected layers are classified or regressed. The fully connected layer flattens the features into vectors and generates the final result by matrix multiplication and activation functions.

Output layer: for classification tasks, a probability distribution for each class is generated using a softmax function.

s7, model training and optimization: training the deep learning model constructed in the step S6 by adopting a labeled training data set, performing feature descent optimization according to a loss function, performing iterative training and verification, and adjusting parameters and structures of the model to obtain a prediction model;

and (3) outputting: after training is completed, the predictive model may be used to predict or generate new data. For a given input, the predictive model will generate a corresponding output, such as a classification label, regression value, or other desired result.

The device for providing a fusion method of a two-dimensional image and a point cloud based on multi-dimensional feature registration comprises the following steps:

< application example 1>

The method is applied to the actual situation of the transformer substation:

step one, data acquisition: and acquiring two-dimensional images and three-dimensional point cloud data of the transformer substation through equipment such as an unmanned aerial vehicle, a camera or a laser scanner. The two-dimensional image may provide color and texture information, while the three-dimensional point cloud may provide geometry and spatial coordinate information.

Step two, data registration: registering the two-dimensional image and the three-dimensional point cloud data ensures that the two-dimensional image and the three-dimensional point cloud data are in the same coordinate system. The registration can be realized by using methods such as feature point matching, automatic calibration or manual calibration. The aim of the registration is to correspond the three-dimensional point cloud of the two-dimensional image to the same spatial position, thus establishing an association between the two.

Step three, point cloud projection: and projecting the registered three-dimensional point cloud data into a corresponding two-dimensional image space. This can be achieved by mapping the three-dimensional coordinates of each point to a corresponding pixel location. Projection may be accomplished using geometric transformations and algorithms such as camera models.

Step four, extracting features: feature information is extracted from the registration projection data. For two-dimensional images, computer vision techniques such as edge detection, feature descriptor extraction, etc. may be used to obtain shape and texture information for the object. For three-dimensional point clouds, geometric features such as surface normals, curvatures, etc. can be extracted, or feature extraction based on shape descriptors can be performed.

Step four, data fusion: fusing the feature representation by adopting a convolution kernel full-connection layer to obtain a shared feature representation;

step five, inputting the shared characteristic representation in the step four into a prediction model to obtain fusion information of the object to be processed;

step five, visualization and application: and visualizing the fused two-dimensional image and three-dimensional point cloud data, and applying the two-dimensional image and the three-dimensional point cloud data to specific tasks and applications. Visualization may be achieved by rendering techniques such as point cloud rendering, texture mapping, and the like. The fused data can be used in the fields of modeling, safety analysis, maintenance planning and the like of the transformer substation.

Although embodiments of the present invention have been disclosed above, it is not limited to the details and embodiments shown and described, it is well suited to various fields of use for which the invention would be readily apparent to those skilled in the art, and accordingly, the invention is not limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. The fusion method of the two-dimensional image and the point cloud based on the multi-dimensional feature registration is characterized by comprising the following steps of:

2. The method for fusing a two-dimensional image and a point cloud based on multi-dimensional feature registration as claimed in claim 1, wherein the preprocessing in step S1 specifically comprises: preprocessing a two-dimensional image, including scaling to a target size, normalizing the pixel range of the two-dimensional image to 0-1, and cutting the two-dimensional image to obtain an interested region;

3. The method of merging a two-dimensional image and a point cloud based on multi-dimensional feature registration as recited in claim 2, wherein differences between pixels of the image are processed using an interpolation algorithm during scaling. .

4. A method of merging a two-dimensional image with a point cloud based on multi-dimensional feature registration as claimed in claim 2, wherein the normalization method is to divide the pixel value by 255 such that the pixel range is mapped from 0 to 255 to between 0 and 1.

5. A method of fusion of a two-dimensional image and a point cloud based on multi-dimensional feature registration as claimed in claim 2, characterized in that clipping criteria are defined in terms of the position, pixel values or bounding box of the image.

6. The method of merging a two-dimensional image with a point cloud based on multi-dimensional feature registration of claim 2, wherein the filter type comprises one of a mean filter, a median filter, and a gaussian filter.

7. The method of merging a two-dimensional image with a point cloud based on multi-dimensional feature registration of claim 2, wherein the resampling method comprises one of voxel gridding, nearest neighbor sampling, and surface-based sampling.

8. The method for merging the two-dimensional image and the point cloud based on multi-dimensional feature registration according to claim 1, wherein a pooling layer adopts a maximum pooling method, and a maximum value of a certain area in the feature map is selected as a feature after downsampling.

9. The method of merging a two-dimensional image with a point cloud based on multi-dimensional feature registration as recited in claim 1, wherein the activation function is a softmax function, generating a probability distribution for each category.

10. The device based on the fusion method of the two-dimensional image and the point cloud based on the multi-dimensional feature registration according to any one of claims 1 to 9, characterized by comprising: