CN113159232A

CN113159232A - Three-dimensional target classification and segmentation method

Info

Publication number: CN113159232A
Application number: CN202110560118.XA
Authority: CN
Inventors: 韩先锋; 金依菲
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-07-23

Abstract

The invention discloses a three-dimensional target classification and segmentation method, wherein the classification method comprises the following steps: acquiring three-dimensional point cloud data of a target to be classified; performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map; and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target. The invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.

Description

Three-dimensional target classification and segmentation method

Technical Field

The invention relates to the field of artificial intelligence, in particular to a three-dimensional target classification and segmentation method.

Background

In the early stages of computer vision development, the visual perception of the objective world by machines was primarily dependent on two-dimensional images or image sequences captured by cameras. However, the world is three-dimensional in euclidean space, and images because only capturing information of the world projected at a certain perspective will create uncertainty in characterizing the dimensional and geometric properties of the object. In contrast, point clouds, which are the most primitive three-dimensional data representations, can accurately reflect the real size and shape structure of an object, and gradually become another data form on which machine vision perception depends. With the emergence of 3D acquisition devices such as microsoft Kinect, google Tango tablet, intel real sense and the like, the acquisition of point cloud data is as convenient as images, which further promotes the development of three-dimensional computer vision technology, and 3D point cloud also plays more and more important roles in the fields of virtual/augmented reality, autonomous driving, robot technology and the like, so how to perform effective point cloud analysis becomes a problem to be solved urgently.

In recent years, deep learning techniques have enjoyed tremendous success in the computer graphics field, which in fact provides opportunities for better understanding point clouds. However, the point cloud is composed of a plurality of discrete, unordered, topological-structure-free three-dimensional points, which is an initial form of data acquired by the three-dimensional sensing system, so that a learner needs to preprocess the point cloud data before processing the point cloud data by using a traditional convolutional neural network, and there are two main methods adopted at present:

1. the multi-view based method aims at projecting point cloud data into a 2D image set at certain specific visual angles, such as a front-view visual angle and a bird's eye-view visual angle, converting a 3D problem into a 2D problem, and therefore applying a 2D neural network to carry out feature learning. And simultaneously, image information from a camera is fused and used, and data of different visual angles are combined to realize object classification and part and semantic segmentation tasks of point cloud data. The pioneering working MVCNN extracts multi-view features into global descriptors using max pool operations. The View-GCN constructs a directed graph with the views as nodes. Although this method has a significant effect on tasks such as object classification, it is still difficult to determine the appropriate number of views to cover a 3D object under constraints of geometric information loss and high computational cost.

2. Voxelization: and dividing the point cloud data into regular grids. In the method, the three-dimensional space is segmented, the spatial dependence is introduced into the point cloud data, and the method is very suitable for extracting feature representation by using a three-dimensional convolution neural network. Scholars have proposed methods such as OctNet and Kd-Net to gather data information and skip empty voxel grids. The pointgrid approach improves local geometric detail learning by including points in each mesh. However, the accuracy of the method depends on the segmentation degree of the three-dimensional space, and the computation complexity of the 3D convolution is high, and is affected by the increased computation complexity and memory requirement of the cube in terms of the volume resolution and the geometric information loss.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention provides a three-dimensional object classification and segmentation method, which is used to solve the shortcomings of the prior art.

To achieve the above and other related objects, the present invention provides a three-dimensional object classification method, including:

acquiring three-dimensional point cloud data of a target to be classified;

performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map;

and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target.

Optionally, the Transformer-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:

the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.

Optionally, each of the Transformer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the Transformer model obtains the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;

wherein the content of the first and second substances,

a characteristic diagram of the l +1 layer which represents the output of the Transformer model,

performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;

represents passing through the first layerAnd the characteristic diagram is obtained by carrying out channel self-attention operation on the characteristic diagram.

Optionally, the point-of-attention mechanism model

Expressed as:

m denotes the Mth point self-attention mechanism module, MHAT_PWSA(F^l) Feature diagram F for ith layer point^lPerforming multi-head self-attention operation;

where M represents the index of a point from the attention head, M1, 2, 3.

Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,

is a learnable weight parameter of three linear layers, and d_q＝d_k＝d_v＝d_cC represents a characteristic dimension,

respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·^TIndicating transposition.

Optionally, the channel self-attention mechanism model

Expressed as:

MHAT_CWSA(F^l) Feature diagram F for ith layer point^lThe multi-head self-attention operation of the channel is carried out,

wherein the content of the first and second substances,

is a matrix of the characteristics of the channels,

is a weight matrix of the full connection layer, and d_c＝C/M',

Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·^TIndicating transposition.

Optionally, the transform-based feature extraction module includes 3 feature extraction units that are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.

To achieve the above and other related objects, the present invention provides a three-dimensional object segmentation method, comprising:

classifying the target to be classified by using the classification method to obtain a classified target;

performing at least two times of feature extraction on the classified point cloud data of the target to obtain a feature map;

and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain segmentation.

Optionally, the performing at least two feature extractions on the point cloud data of the classified target includes:

performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;

performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;

the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;

the first feature extraction unit comprises a feature down-sampling layer and a feature extraction subunit, the second feature extraction unit comprises a feature up-sampling layer and a feature extraction subunit, and the feature extraction subunit comprises: connected in sequence, attention-based transducer models.

wherein the content of the first and second substances,

the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.

Optionally, thePoint self-attention mechanism model

Expressed as:

where M represents the index of a point from the attention head, M1, 2, 3.

respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·^TRepresenting a transpose;

the channel self-attention mechanism model

Expressed as:

wherein the content of the first and second substances,

is a matrix of the characteristics of the channels,

is a weight matrix of the full connection layer, and d_c＝C/M',

As described above, the three-dimensional object classification and segmentation method of the present invention has the following beneficial effects:

the invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.

Drawings

FIG. 1 is a flowchart illustrating a three-dimensional object classification method according to an embodiment of the present invention;

FIG. 2 is a diagram of a Transformer model based on the attention mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a classification network according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a three-dimensional object segmentation method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a split network according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

As shown in fig. 1, an embodiment of the present application provides a three-dimensional object classification method, including:

s11, acquiring three-dimensional point cloud data of the target to be classified;

s12, extracting the characteristics of the three-dimensional point cloud data by using a Transformer-based characteristic extraction module to obtain a characteristic diagram;

s13, inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain classified targets.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In one embodiment, the Transformer-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:

FIG. 2 shows a Transformer model based on the attention mechanism. In fig. 2, N represents the number of points of the point cloud model, C represents the feature dimension, and Q, K, and V represent the Query (Query), Key (Key), and Value (Value) matrix in the self-attention mechanism, respectively. Because a multi-headed attention mechanism is used, Q, K, V can be viewed as a query, key, value matrix for each head split from the capital Q, K, V. k is a radical of^T，q^TRepresenting a matrix transposition operation, d_cRepresenting the dimensions of each head.

In an embodiment, each of the fransormer models includes a Point-wise-attention-system (PWSA) module (as shown in an upper portion of fig. 2) and a channel-wise-attention-system (PWSA) module (as shown in a lower portion of fig. 2), and the Transformer model obtains the feature map by aggregating the Point-wise-attention-system and the channel-wise-attention-system;

wherein the content of the first and second substances,

In one embodiment, the point-of-interest self-attention mechanism model

Expressed as:

the characteristic diagram of the point at the l +1 st layer obtained by the above-mentioned self-attention calculation is shown.

Where M represents the index of a point from the attention head, M1, 2, 3.

respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model (·)^TIndicating transposition.

In order to emphasize the importance of the interaction between different point feature map channels, the invention adopts a basic idea similar to a point self-attention mechanism to construct a channel multi-head self-attention model, as shown in the lower part of fig. 2. In one embodiment, the channel self-attention mechanism model

Is shown as：

representing a characteristic diagram of the (l + 1) th layer point obtained through the channel self-attention operation;

wherein the content of the first and second substances,

representing the mutual influence of all channels by a channel characteristic matrix;

is a weight matrix of the full connection layer, and d_c＝C/M,

In one embodiment, the transform-based feature extraction module includes 3 feature extraction units which are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.

In one embodiment, the ModelNet dataset is used as an evaluation index for the classification task to determine the Overall Accuracy (OA) of the classification network as shown in FIG. 3.

The ModelNet dataset consists of 12311 CAD models from 40 classes, with 9843 shapes for training and 2468 objects for testing. 1024 points were sampled from each model according to the PointNet criteria. During training, the authenticity of the data is enhanced by exploiting random point loss, random scaling in [0.8, 1.25], and random shifts in [ -0.1, 0.1 ].

The network for the object classification task is shown in fig. 3, and comprises a transform-based feature extraction module and three cascaded fully-connected layers; the feature extraction module based on the Transformer is formed by cascading three feature extraction units, wherein each feature extraction unit comprises a feature down-sampling layer and a Transformer model based on an attention mechanism. The feature extraction module based on the Transformer is formed by cascading a plurality of feature extraction units; the feature extraction unit includes: the system comprises a feature down-sampling layer FDS and a transducer model DPCT based on an attention mechanism which are connected in sequence. The number of dots and channels used per layer is as follows:

INPUT(N＝1.024，C＝3)-FDS(N＝512，C＝128)-DPCT(N＝512，C＝320)-FDS(N＝128，C＝256)-DPCT(N＝256，C＝640)-FDS(N＝1，C＝1024)-DPCT(C＝1024)-FC(512)-FC(256)-FC(40)

setting parameters: the network iterates 150 times during training, processing 16 batches of data each time, with an initial learning rate of 0.001 and a reduction of 0.7 per iteration of 20.

And (3) comparing the performances: table 1 shows the quantitative performance comparison of the present invention with other techniques. It is clear from the results that the network of the invention achieves an overall accuracy of 92.9%, which exceeds PointNet and point2sequence by 3.7% and 0.3%, respectively. This demonstrates the effectiveness of our model.

Table 1: object classification result based on ModelNet40

According to the method, a transducer model based on an attention mechanism is constructed, firstly, the transducer has displacement invariance, changes caused by different input sequences of points are eliminated, and therefore the method can be used for directly processing on point cloud without preprocessing operations such as multi-view and voxel grid, so that the loss of geometric information is greatly reduced, and secondly, feature extraction is carried out from spatial correlation and channel correlation to capture the dependency of context semantic features, so that the characterization capability of depth fusion features is enhanced, and important support is provided for accurate understanding of a three-dimensional point cloud scene.

As shown in fig. 4, an embodiment of the present application provides a three-dimensional object segmentation method, where object segmentation is performed based on object classification, where the classification is to determine that an entire point cloud model belongs to an object, and for example, a chair is obtained after the point cloud model is classified; the segmentation is more detailed, and each point of the point cloud model is classified, for example, the segmentation of the chair model is to distinguish that the part is a backrest, that the part is a leg, that the part is a chair surface, and the like.

The three-dimensional object segmentation method shown in fig. 4 includes the following steps:

s41, classifying the target to be classified by the classification method to obtain the classified target;

s42, performing at least two times of feature extraction on the point cloud data of the classified target to obtain a feature map;

s43 inputs the feature map into a fully connected module formed by connecting a plurality of fully connected layers, and the feature map is divided.

In an embodiment, the performing at least two times of feature extraction on the point cloud data of the classified target to obtain a feature map includes:

the first Feature extraction unit includes a Feature Down Sample Layer (Feature Down Sample Layer) and a Feature extraction subunit, the second Feature extraction unit includes a Feature Up Sample Layer (Feature Up Sample Layer) and a Feature extraction subunit, and the Feature extraction subunit includes: connected in sequence, attention-based transducer models.

As shown in fig. 5, the first feature extraction module includes 4 first feature extraction units, namely, a first-stage first feature extraction unit, a second-stage first feature extraction unit, a third-stage first feature extraction unit, and a fourth-stage first feature extraction unit; the second feature extraction module comprises 4 second feature extraction units, namely a first-stage second feature extraction unit, a second-stage second feature extraction unit, a third-stage second feature extraction unit and a fourth-stage second feature extraction unit. As shown in the figure, the first-stage first feature extraction unit, the second-stage first feature extraction unit, the third-stage first feature extraction unit and the fourth-stage first feature extraction unit are sequentially Connected, the first-stage second feature extraction unit, the second-stage second feature extraction unit, the third-stage second feature extraction unit and the fourth-stage second feature extraction unit are sequentially Connected, the first-stage first feature extraction unit is Connected with the third-stage first feature extraction unit, the second-stage first feature extraction unit is Connected with the second-stage second feature extraction unit, the third-stage first feature extraction unit is Connected with the first-stage second feature extraction unit, the last-stage second feature extraction unit is Connected to a full connection Layer (full Connected Layer), and a segmentation target is obtained after the full connection Layer.

The first feature extraction unit comprises a feature down-sampling layer and a transducer model (Dual Point Cloud transducer) based on attention connected with the output of the feature down-sampling layer; the second feature extraction unit comprises a feature upsampling layer and an attention-based transform model connected with the output of the feature upsampling layer.

In an embodiment, each of the fransformer models includes a Point-wise self-attention-initiation (PWSA) module (as shown in an upper portion of fig. 2) and a channel-wise self-attention (PWSA) module (as shown in a lower portion of fig. 2), and the fransformer model obtains the feature map through an aggregation Point self-attention mechanism and a channel self-attention mechanism;

wherein the content of the first and second substances,

In one embodiment, the point-of-interest self-attention mechanism model

Expressed as:

Where M represents the index of a point from the attention head, M1, 2, 3.

Expressed as:

wherein the content of the first and second substances,

is a weight matrix of the full connection layer, and d_c＝C/M',

In one embodiment, to build multi-scale hierarchical features, a feature down-sampling layer (FDS) is added before the attention-based transform model. Specifically, for the input feature map F^lPerforming a farthest point sampling algorithm (FPS) to generate a sub-feature map

Next, for sub-feature map F'^lTo aggregate and assign to it the characteristics of all points in its spherical neighborhood, followed by linear transformation, Batch Normalization (BN), and ReLU operations. This downsampled (FDS) layer can be briefly summarized as follows:

F^l+1＝Relu(BN(W'^l(Agg(FPS(F^l)))))

wherein, Agg (·)) Means local feature polymerization operation, W'^lLearnable weight parameters representing a linear transformation. FPS (-) is a plot of point characteristics F for the l layer^lAnd performing the farthest point sampling operation and performing downsampling.

In one embodiment, for more accurate prediction in the segmentation task, a feature upsampling layer is placed in the decoder portion to increase the resolution of the point feature map to the original image size. The point set is up-sampled by using a K-nearest neighbor interpolation algorithm based on Euclidean distance.

In one embodiment, a Shapelet part benchmark dataset consisting of 16881 objects from 16 different classes, for a total of 50 part labels, is selected to train and test the part segmentation effect of the segmentation network. The training/testing data ratio of 14007/2874 officially given by the dataset was followed and 2048 points were sampled from each shape as input. Furthermore, the same data enhancement as the classification task is performed. The assessment indicators include the average IoU value and the classification IoU for all part categories.

The architecture of the component-partitioned network is shown in fig. 5. The encoder structure (the first feature extraction units) is similar to the classification task, and the decoder (the second feature extraction units) is added with three second feature extraction units (including a feature upsampling layer and a attention-based transform model). The number of dots and channels used per layer is as follows:

Input(N＝2048，C＝3)-FDS(N＝512，C＝320)-DPCT(N＝512，C＝320)-FDS(N＝128，C＝512)-DPCT(N＝128，C＝512)-FDS(N＝1，C＝1024)-DPCT(N＝1，C＝2014)-FUS(N＝128，C＝256)-DPCT(N＝128)

setting parameters: the network iterates 80 times during training, with an initial learning rate of 0.0005 and a 50% reduction per 20 iterations.

And (3) comparing the performances: quantitative comparison of the method given in table 2 with the current latest model. Unlike methods in which PointNet, PointNet + + and SO-Net introduce normal vectors and point coordinates simultaneously, the multiple Transformer model of the present invention uses only XYZ coordinates as input features. The segmentation result shows that the method of the invention achieves the highest 85.6 percent of mIoU and respectively exceeds 0.5 percent and 0.2 percent of PointNet + + and the current optimal method SFCNN. In particular, the method of the present invention performs better in certain types of segmentation than these competing methods, such as chairs, lights, skates, tables, and the like.

Table 2: component segmentation results based on ShapeNet dataset

The invention breaks through the restriction of traditional mode information loss and high computation complexity, does not need any preprocessing method such as voxelization or projection, can directly input original point cloud data, has the capability of capturing the context information of a long range of point clouds, has better point cloud characteristic description capability, is suitable for popularization and application in the fields of computer vision, computer graphics, robotics, remote sensing and the like, and has important practical application value. For example, applications in the field of remote sensing include large-scene remote sensing point cloud splicing and terrain scene reconstruction; the application in the field of cultural heritage protection comprises the construction of an ancient cultural relic digital model base based on multi-mesh point cloud splicing and reconstruction; typical applications in the field of computer vision are three-dimensional face recognition, three-dimensional target classification detection and recognition and gesture tracking of three-dimensional moving objects; the application in the aerospace field comprises the motion pose resolving of space non-cooperative targets and the like; the main application in the robot field comprises the estimation of the grabbing and placing postures of objects; the application in the field of national defense includes air-to-ground precise target striking and the like.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may comprise any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, etc.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for classifying a three-dimensional object, comprising:

acquiring three-dimensional point cloud data of a target to be classified;

performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map; and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target.

2. The method of classifying a three-dimensional object according to claim 1, wherein the transform-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:

3. The method for classifying a three-dimensional object according to claim 2, wherein each of the fransormer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the fransormer models obtain the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;

wherein the content of the first and second substances,

feature diagram representing the passing through to the I < th > layerAnd (5) carrying out a characteristic diagram obtained by channel self-attention operation.

4. The method of claim 3, wherein the point auto-attention mechanism model

Expressed as:

where M represents the index of a point from the attention head, M1, 2, 3.

5. The method of claim 4, wherein the channel is self-classifyingAttention mechanism model

Expressed as:

wherein the content of the first and second substances,

is a matrix of the characteristics of the channels,

is a weight matrix of the full connection layer, and

respectively representing the query, key and value matrixes of the mth head of the l + 1-layer channel multi-head attention model.

6. The three-dimensional object classification method according to claim 5, characterized in that the transform-based feature extraction module comprises 3 feature extraction units which are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.

7. A method for segmenting a three-dimensional object, comprising:

classifying the target to be classified by using the classification method according to any one of claims 1 to 6 to obtain the classified target;

8. The method of claim 7, wherein the performing at least two feature extractions on the point cloud data of the classified objects comprises:

9. The method of claim 8, wherein each of the fransormer models comprises a self-attention mechanism module and a channel self-attention mechanism module, and the fransormer models obtain the feature map by an aggregation point self-attention mechanism and a channel self-attention mechanism;