CN113159232A - Three-dimensional target classification and segmentation method - Google Patents

Three-dimensional target classification and segmentation method Download PDF

Info

Publication number
CN113159232A
CN113159232A CN202110560118.XA CN202110560118A CN113159232A CN 113159232 A CN113159232 A CN 113159232A CN 202110560118 A CN202110560118 A CN 202110560118A CN 113159232 A CN113159232 A CN 113159232A
Authority
CN
China
Prior art keywords
feature extraction
point
feature
attention
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110560118.XA
Other languages
Chinese (zh)
Inventor
韩先锋
金依菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest University
Original Assignee
Southwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University filed Critical Southwest University
Priority to CN202110560118.XA priority Critical patent/CN113159232A/en
Publication of CN113159232A publication Critical patent/CN113159232A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Architecture (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a three-dimensional target classification and segmentation method, wherein the classification method comprises the following steps: acquiring three-dimensional point cloud data of a target to be classified; performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map; and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target. The invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.

Description

Three-dimensional target classification and segmentation method
Technical Field
The invention relates to the field of artificial intelligence, in particular to a three-dimensional target classification and segmentation method.
Background
In the early stages of computer vision development, the visual perception of the objective world by machines was primarily dependent on two-dimensional images or image sequences captured by cameras. However, the world is three-dimensional in euclidean space, and images because only capturing information of the world projected at a certain perspective will create uncertainty in characterizing the dimensional and geometric properties of the object. In contrast, point clouds, which are the most primitive three-dimensional data representations, can accurately reflect the real size and shape structure of an object, and gradually become another data form on which machine vision perception depends. With the emergence of 3D acquisition devices such as microsoft Kinect, google Tango tablet, intel real sense and the like, the acquisition of point cloud data is as convenient as images, which further promotes the development of three-dimensional computer vision technology, and 3D point cloud also plays more and more important roles in the fields of virtual/augmented reality, autonomous driving, robot technology and the like, so how to perform effective point cloud analysis becomes a problem to be solved urgently.
In recent years, deep learning techniques have enjoyed tremendous success in the computer graphics field, which in fact provides opportunities for better understanding point clouds. However, the point cloud is composed of a plurality of discrete, unordered, topological-structure-free three-dimensional points, which is an initial form of data acquired by the three-dimensional sensing system, so that a learner needs to preprocess the point cloud data before processing the point cloud data by using a traditional convolutional neural network, and there are two main methods adopted at present:
1. the multi-view based method aims at projecting point cloud data into a 2D image set at certain specific visual angles, such as a front-view visual angle and a bird's eye-view visual angle, converting a 3D problem into a 2D problem, and therefore applying a 2D neural network to carry out feature learning. And simultaneously, image information from a camera is fused and used, and data of different visual angles are combined to realize object classification and part and semantic segmentation tasks of point cloud data. The pioneering working MVCNN extracts multi-view features into global descriptors using max pool operations. The View-GCN constructs a directed graph with the views as nodes. Although this method has a significant effect on tasks such as object classification, it is still difficult to determine the appropriate number of views to cover a 3D object under constraints of geometric information loss and high computational cost.
2. Voxelization: and dividing the point cloud data into regular grids. In the method, the three-dimensional space is segmented, the spatial dependence is introduced into the point cloud data, and the method is very suitable for extracting feature representation by using a three-dimensional convolution neural network. Scholars have proposed methods such as OctNet and Kd-Net to gather data information and skip empty voxel grids. The pointgrid approach improves local geometric detail learning by including points in each mesh. However, the accuracy of the method depends on the segmentation degree of the three-dimensional space, and the computation complexity of the 3D convolution is high, and is affected by the increased computation complexity and memory requirement of the cube in terms of the volume resolution and the geometric information loss.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a three-dimensional object classification and segmentation method, which is used to solve the shortcomings of the prior art.
To achieve the above and other related objects, the present invention provides a three-dimensional object classification method, including:
acquiring three-dimensional point cloud data of a target to be classified;
performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map;
and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target.
Optionally, the Transformer-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:
the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.
Optionally, each of the Transformer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the Transformer model obtains the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;
Figure BDA0003078631190000021
wherein the content of the first and second substances,
Figure BDA0003078631190000022
a characteristic diagram of the l +1 layer which represents the output of the Transformer model,
Figure BDA0003078631190000023
performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;
Figure BDA0003078631190000024
represents passing through the first layerAnd the characteristic diagram is obtained by carrying out channel self-attention operation on the characteristic diagram.
Optionally, the point-of-attention mechanism model
Figure BDA0003078631190000025
Expressed as:
Figure BDA0003078631190000026
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
Figure BDA0003078631190000027
where M represents the index of a point from the attention head, M1, 2, 3.
Figure BDA0003078631190000028
Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,
Figure BDA0003078631190000029
is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,
Figure BDA00030786311900000210
respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TIndicating transposition.
Optionally, the channel self-attention mechanism model
Figure BDA00030786311900000211
Expressed as:
Figure BDA00030786311900000212
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
Figure BDA0003078631190000031
wherein the content of the first and second substances,
Figure BDA0003078631190000032
is a matrix of the characteristics of the channels,
Figure BDA0003078631190000033
is a weight matrix of the full connection layer, and dc=C/M',
Figure BDA0003078631190000034
Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
Optionally, the transform-based feature extraction module includes 3 feature extraction units that are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.
To achieve the above and other related objects, the present invention provides a three-dimensional object segmentation method, comprising:
classifying the target to be classified by using the classification method to obtain a classified target;
performing at least two times of feature extraction on the classified point cloud data of the target to obtain a feature map;
and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain segmentation.
Optionally, the performing at least two feature extractions on the point cloud data of the classified target includes:
performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;
performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;
the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;
the first feature extraction unit comprises a feature down-sampling layer and a feature extraction subunit, the second feature extraction unit comprises a feature up-sampling layer and a feature extraction subunit, and the feature extraction subunit comprises: connected in sequence, attention-based transducer models.
Optionally, each of the Transformer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the Transformer model obtains the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;
Figure BDA0003078631190000035
wherein the content of the first and second substances,
Figure BDA0003078631190000036
a characteristic diagram of the l +1 layer which represents the output of the Transformer model,
Figure BDA0003078631190000037
performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;
Figure BDA0003078631190000038
the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
Optionally, thePoint self-attention mechanism model
Figure BDA0003078631190000041
Expressed as:
Figure BDA0003078631190000042
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
Figure BDA0003078631190000043
where M represents the index of a point from the attention head, M1, 2, 3.
Figure BDA0003078631190000044
Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,
Figure BDA0003078631190000045
is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,
Figure BDA0003078631190000046
respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TRepresenting a transpose;
the channel self-attention mechanism model
Figure BDA0003078631190000047
Expressed as:
Figure BDA0003078631190000048
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
Figure BDA0003078631190000049
wherein the content of the first and second substances,
Figure BDA00030786311900000410
is a matrix of the characteristics of the channels,
Figure BDA00030786311900000411
is a weight matrix of the full connection layer, and dc=C/M',
Figure BDA00030786311900000412
Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
As described above, the three-dimensional object classification and segmentation method of the present invention has the following beneficial effects:
the invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.
Drawings
FIG. 1 is a flowchart illustrating a three-dimensional object classification method according to an embodiment of the present invention;
FIG. 2 is a diagram of a Transformer model based on the attention mechanism according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a classification network according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a three-dimensional object segmentation method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a split network according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
As shown in fig. 1, an embodiment of the present application provides a three-dimensional object classification method, including:
s11, acquiring three-dimensional point cloud data of the target to be classified;
s12, extracting the characteristics of the three-dimensional point cloud data by using a Transformer-based characteristic extraction module to obtain a characteristic diagram;
s13, inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain classified targets.
The invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, the Transformer-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:
the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.
FIG. 2 shows a Transformer model based on the attention mechanism. In fig. 2, N represents the number of points of the point cloud model, C represents the feature dimension, and Q, K, and V represent the Query (Query), Key (Key), and Value (Value) matrix in the self-attention mechanism, respectively. Because a multi-headed attention mechanism is used, Q, K, V can be viewed as a query, key, value matrix for each head split from the capital Q, K, V. k is a radical ofT,qTRepresenting a matrix transposition operation, dcRepresenting the dimensions of each head.
In an embodiment, each of the fransormer models includes a Point-wise-attention-system (PWSA) module (as shown in an upper portion of fig. 2) and a channel-wise-attention-system (PWSA) module (as shown in a lower portion of fig. 2), and the Transformer model obtains the feature map by aggregating the Point-wise-attention-system and the channel-wise-attention-system;
Figure BDA0003078631190000061
wherein the content of the first and second substances,
Figure BDA0003078631190000062
a characteristic diagram of the l +1 layer which represents the output of the Transformer model,
Figure BDA0003078631190000063
performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;
Figure BDA0003078631190000064
the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
In one embodiment, the point-of-interest self-attention mechanism model
Figure BDA0003078631190000065
Expressed as:
Figure BDA0003078631190000066
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
Figure BDA0003078631190000067
the characteristic diagram of the point at the l +1 st layer obtained by the above-mentioned self-attention calculation is shown.
Figure BDA0003078631190000068
Where M represents the index of a point from the attention head, M1, 2, 3.
Figure BDA0003078631190000069
Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,
Figure BDA00030786311900000610
is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,
Figure BDA00030786311900000611
respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model (·)TIndicating transposition.
In order to emphasize the importance of the interaction between different point feature map channels, the invention adopts a basic idea similar to a point self-attention mechanism to construct a channel multi-head self-attention model, as shown in the lower part of fig. 2. In one embodiment, the channel self-attention mechanism model
Figure BDA00030786311900000612
Is shown as:
Figure BDA00030786311900000613
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
Figure BDA00030786311900000614
representing a characteristic diagram of the (l + 1) th layer point obtained through the channel self-attention operation;
Figure BDA00030786311900000615
wherein the content of the first and second substances,
Figure BDA00030786311900000616
representing the mutual influence of all channels by a channel characteristic matrix;
Figure BDA00030786311900000617
Figure BDA00030786311900000618
is a weight matrix of the full connection layer, and dc=C/M,
Figure BDA00030786311900000619
Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
In one embodiment, the transform-based feature extraction module includes 3 feature extraction units which are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.
In one embodiment, the ModelNet dataset is used as an evaluation index for the classification task to determine the Overall Accuracy (OA) of the classification network as shown in FIG. 3.
The ModelNet dataset consists of 12311 CAD models from 40 classes, with 9843 shapes for training and 2468 objects for testing. 1024 points were sampled from each model according to the PointNet criteria. During training, the authenticity of the data is enhanced by exploiting random point loss, random scaling in [0.8, 1.25], and random shifts in [ -0.1, 0.1 ].
The network for the object classification task is shown in fig. 3, and comprises a transform-based feature extraction module and three cascaded fully-connected layers; the feature extraction module based on the Transformer is formed by cascading three feature extraction units, wherein each feature extraction unit comprises a feature down-sampling layer and a Transformer model based on an attention mechanism. The feature extraction module based on the Transformer is formed by cascading a plurality of feature extraction units; the feature extraction unit includes: the system comprises a feature down-sampling layer FDS and a transducer model DPCT based on an attention mechanism which are connected in sequence. The number of dots and channels used per layer is as follows:
INPUT(N=1.024,C=3)-FDS(N=512,C=128)-DPCT(N=512,C=320)-FDS(N=128,C=256)-DPCT(N=256,C=640)-FDS(N=1,C=1024)-DPCT(C=1024)-FC(512)-FC(256)-FC(40)
setting parameters: the network iterates 150 times during training, processing 16 batches of data each time, with an initial learning rate of 0.001 and a reduction of 0.7 per iteration of 20.
And (3) comparing the performances: table 1 shows the quantitative performance comparison of the present invention with other techniques. It is clear from the results that the network of the invention achieves an overall accuracy of 92.9%, which exceeds PointNet and point2sequence by 3.7% and 0.3%, respectively. This demonstrates the effectiveness of our model.
Table 1: object classification result based on ModelNet40
Figure BDA0003078631190000081
According to the method, a transducer model based on an attention mechanism is constructed, firstly, the transducer has displacement invariance, changes caused by different input sequences of points are eliminated, and therefore the method can be used for directly processing on point cloud without preprocessing operations such as multi-view and voxel grid, so that the loss of geometric information is greatly reduced, and secondly, feature extraction is carried out from spatial correlation and channel correlation to capture the dependency of context semantic features, so that the characterization capability of depth fusion features is enhanced, and important support is provided for accurate understanding of a three-dimensional point cloud scene.
As shown in fig. 4, an embodiment of the present application provides a three-dimensional object segmentation method, where object segmentation is performed based on object classification, where the classification is to determine that an entire point cloud model belongs to an object, and for example, a chair is obtained after the point cloud model is classified; the segmentation is more detailed, and each point of the point cloud model is classified, for example, the segmentation of the chair model is to distinguish that the part is a backrest, that the part is a leg, that the part is a chair surface, and the like.
The three-dimensional object segmentation method shown in fig. 4 includes the following steps:
s41, classifying the target to be classified by the classification method to obtain the classified target;
s42, performing at least two times of feature extraction on the point cloud data of the classified target to obtain a feature map;
s43 inputs the feature map into a fully connected module formed by connecting a plurality of fully connected layers, and the feature map is divided.
In an embodiment, the performing at least two times of feature extraction on the point cloud data of the classified target to obtain a feature map includes:
performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;
performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;
the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;
the first Feature extraction unit includes a Feature Down Sample Layer (Feature Down Sample Layer) and a Feature extraction subunit, the second Feature extraction unit includes a Feature Up Sample Layer (Feature Up Sample Layer) and a Feature extraction subunit, and the Feature extraction subunit includes: connected in sequence, attention-based transducer models.
As shown in fig. 5, the first feature extraction module includes 4 first feature extraction units, namely, a first-stage first feature extraction unit, a second-stage first feature extraction unit, a third-stage first feature extraction unit, and a fourth-stage first feature extraction unit; the second feature extraction module comprises 4 second feature extraction units, namely a first-stage second feature extraction unit, a second-stage second feature extraction unit, a third-stage second feature extraction unit and a fourth-stage second feature extraction unit. As shown in the figure, the first-stage first feature extraction unit, the second-stage first feature extraction unit, the third-stage first feature extraction unit and the fourth-stage first feature extraction unit are sequentially Connected, the first-stage second feature extraction unit, the second-stage second feature extraction unit, the third-stage second feature extraction unit and the fourth-stage second feature extraction unit are sequentially Connected, the first-stage first feature extraction unit is Connected with the third-stage first feature extraction unit, the second-stage first feature extraction unit is Connected with the second-stage second feature extraction unit, the third-stage first feature extraction unit is Connected with the first-stage second feature extraction unit, the last-stage second feature extraction unit is Connected to a full connection Layer (full Connected Layer), and a segmentation target is obtained after the full connection Layer.
The first feature extraction unit comprises a feature down-sampling layer and a transducer model (Dual Point Cloud transducer) based on attention connected with the output of the feature down-sampling layer; the second feature extraction unit comprises a feature upsampling layer and an attention-based transform model connected with the output of the feature upsampling layer.
FIG. 2 shows a Transformer model based on the attention mechanism. In fig. 2, N represents the number of points of the point cloud model, C represents the feature dimension, and Q, K, and V represent the Query (Query), Key (Key), and Value (Value) matrix in the self-attention mechanism, respectively. Because a multi-headed attention mechanism is used, Q, K, V can be viewed as a query, key, value matrix for each head split from the capital Q, K, V. k is a radical ofT,qTRepresenting a matrix transposition operation, dcRepresenting the dimensions of each head.
In an embodiment, each of the fransformer models includes a Point-wise self-attention-initiation (PWSA) module (as shown in an upper portion of fig. 2) and a channel-wise self-attention (PWSA) module (as shown in a lower portion of fig. 2), and the fransformer model obtains the feature map through an aggregation Point self-attention mechanism and a channel self-attention mechanism;
Figure BDA0003078631190000101
wherein the content of the first and second substances,
Figure BDA0003078631190000102
a characteristic diagram of the l +1 layer which represents the output of the Transformer model,
Figure BDA0003078631190000103
performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;
Figure BDA0003078631190000104
the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
In one embodiment, the point-of-interest self-attention mechanism model
Figure BDA0003078631190000105
Expressed as:
Figure BDA0003078631190000106
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
Figure BDA0003078631190000107
the characteristic diagram of the point at the l +1 st layer obtained by the above-mentioned self-attention calculation is shown.
Figure BDA0003078631190000108
Where M represents the index of a point from the attention head, M1, 2, 3.
Figure BDA0003078631190000109
Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,
Figure BDA00030786311900001010
is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,
Figure BDA00030786311900001011
respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TIndicating transposition.
In order to emphasize the importance of the interaction between different point feature map channels, the invention adopts a basic idea similar to a point self-attention mechanism to construct a channel multi-head self-attention model, as shown in the lower part of fig. 2. In one embodiment, the channel self-attention mechanism model
Figure BDA00030786311900001012
Expressed as:
Figure BDA00030786311900001013
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
Figure BDA00030786311900001014
representing a characteristic diagram of the (l + 1) th layer point obtained through the channel self-attention operation;
Figure BDA00030786311900001015
wherein the content of the first and second substances,
Figure BDA00030786311900001016
representing the mutual influence of all channels by a channel characteristic matrix;
Figure BDA00030786311900001017
is a weight matrix of the full connection layer, and dc=C/M',
Figure BDA00030786311900001018
Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
In one embodiment, to build multi-scale hierarchical features, a feature down-sampling layer (FDS) is added before the attention-based transform model. Specifically, for the input feature map FlPerforming a farthest point sampling algorithm (FPS) to generate a sub-feature map
Figure BDA0003078631190000111
Next, for sub-feature map F'lTo aggregate and assign to it the characteristics of all points in its spherical neighborhood, followed by linear transformation, Batch Normalization (BN), and ReLU operations. This downsampled (FDS) layer can be briefly summarized as follows:
Fl+1=Relu(BN(W'l(Agg(FPS(Fl)))))
wherein, Agg (·)) Means local feature polymerization operation, W'lLearnable weight parameters representing a linear transformation. FPS (-) is a plot of point characteristics F for the l layerlAnd performing the farthest point sampling operation and performing downsampling.
In one embodiment, for more accurate prediction in the segmentation task, a feature upsampling layer is placed in the decoder portion to increase the resolution of the point feature map to the original image size. The point set is up-sampled by using a K-nearest neighbor interpolation algorithm based on Euclidean distance.
In one embodiment, a Shapelet part benchmark dataset consisting of 16881 objects from 16 different classes, for a total of 50 part labels, is selected to train and test the part segmentation effect of the segmentation network. The training/testing data ratio of 14007/2874 officially given by the dataset was followed and 2048 points were sampled from each shape as input. Furthermore, the same data enhancement as the classification task is performed. The assessment indicators include the average IoU value and the classification IoU for all part categories.
The architecture of the component-partitioned network is shown in fig. 5. The encoder structure (the first feature extraction units) is similar to the classification task, and the decoder (the second feature extraction units) is added with three second feature extraction units (including a feature upsampling layer and a attention-based transform model). The number of dots and channels used per layer is as follows:
Input(N=2048,C=3)-FDS(N=512,C=320)-DPCT(N=512,C=320)-FDS(N=128,C=512)-DPCT(N=128,C=512)-FDS(N=1,C=1024)-DPCT(N=1,C=2014)-FUS(N=128,C=256)-DPCT(N=128)
setting parameters: the network iterates 80 times during training, with an initial learning rate of 0.0005 and a 50% reduction per 20 iterations.
And (3) comparing the performances: quantitative comparison of the method given in table 2 with the current latest model. Unlike methods in which PointNet, PointNet + + and SO-Net introduce normal vectors and point coordinates simultaneously, the multiple Transformer model of the present invention uses only XYZ coordinates as input features. The segmentation result shows that the method of the invention achieves the highest 85.6 percent of mIoU and respectively exceeds 0.5 percent and 0.2 percent of PointNet + + and the current optimal method SFCNN. In particular, the method of the present invention performs better in certain types of segmentation than these competing methods, such as chairs, lights, skates, tables, and the like.
Table 2: component segmentation results based on ShapeNet dataset
Figure BDA0003078631190000121
The invention breaks through the restriction of traditional mode information loss and high computation complexity, does not need any preprocessing method such as voxelization or projection, can directly input original point cloud data, has the capability of capturing the context information of a long range of point clouds, has better point cloud characteristic description capability, is suitable for popularization and application in the fields of computer vision, computer graphics, robotics, remote sensing and the like, and has important practical application value. For example, applications in the field of remote sensing include large-scene remote sensing point cloud splicing and terrain scene reconstruction; the application in the field of cultural heritage protection comprises the construction of an ancient cultural relic digital model base based on multi-mesh point cloud splicing and reconstruction; typical applications in the field of computer vision are three-dimensional face recognition, three-dimensional target classification detection and recognition and gesture tracking of three-dimensional moving objects; the application in the aerospace field comprises the motion pose resolving of space non-cooperative targets and the like; the main application in the robot field comprises the estimation of the grabbing and placing postures of objects; the application in the field of national defense includes air-to-ground precise target striking and the like.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may comprise any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, etc.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A method for classifying a three-dimensional object, comprising:
acquiring three-dimensional point cloud data of a target to be classified;
performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map; and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target.
2. The method of classifying a three-dimensional object according to claim 1, wherein the transform-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:
the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.
3. The method for classifying a three-dimensional object according to claim 2, wherein each of the fransormer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the fransormer models obtain the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;
Figure FDA0003078631180000011
wherein the content of the first and second substances,
Figure FDA0003078631180000012
a characteristic diagram of the l +1 layer which represents the output of the Transformer model,
Figure FDA0003078631180000013
performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;
Figure FDA0003078631180000014
feature diagram representing the passing through to the I < th > layerAnd (5) carrying out a characteristic diagram obtained by channel self-attention operation.
4. The method of claim 3, wherein the point auto-attention mechanism model
Figure FDA0003078631180000015
Expressed as:
Figure FDA0003078631180000016
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
Figure FDA0003078631180000017
where M represents the index of a point from the attention head, M1, 2, 3.
Figure FDA0003078631180000018
Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,
Figure FDA0003078631180000019
is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,
Figure FDA00030786311800000110
respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TIndicating transposition.
5. The method of claim 4, wherein the channel is self-classifyingAttention mechanism model
Figure FDA00030786311800000111
Expressed as:
Figure FDA0003078631180000021
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
Figure FDA0003078631180000022
wherein the content of the first and second substances,
Figure FDA0003078631180000023
is a matrix of the characteristics of the channels,
Figure FDA0003078631180000024
is a weight matrix of the full connection layer, and
Figure FDA0003078631180000025
Figure FDA0003078631180000026
Figure FDA0003078631180000027
respectively representing the query, key and value matrixes of the mth head of the l + 1-layer channel multi-head attention model.
6. The three-dimensional object classification method according to claim 5, characterized in that the transform-based feature extraction module comprises 3 feature extraction units which are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.
7. A method for segmenting a three-dimensional object, comprising:
classifying the target to be classified by using the classification method according to any one of claims 1 to 6 to obtain the classified target;
performing at least two times of feature extraction on the classified point cloud data of the target to obtain a feature map;
and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain segmentation.
8. The method of claim 7, wherein the performing at least two feature extractions on the point cloud data of the classified objects comprises:
performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;
performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;
the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;
the first feature extraction unit comprises a feature down-sampling layer and a feature extraction subunit, the second feature extraction unit comprises a feature up-sampling layer and a feature extraction subunit, and the feature extraction subunit comprises: connected in sequence, attention-based transducer models.
9. The method of claim 8, wherein each of the fransormer models comprises a self-attention mechanism module and a channel self-attention mechanism module, and the fransormer models obtain the feature map by an aggregation point self-attention mechanism and a channel self-attention mechanism;
Figure FDA0003078631180000031
wherein the content of the first and second substances,
Figure FDA0003078631180000032
a characteristic diagram of the l +1 layer which represents the output of the Transformer model,
Figure FDA0003078631180000033
performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;
Figure FDA0003078631180000034
the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
10. The three-dimensional object segmentation method of claim 9, wherein the point-auto-attention mechanism model
Figure FDA0003078631180000035
Expressed as:
Figure FDA0003078631180000036
m denotes the Mth Point self-attention channel, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
Figure FDA0003078631180000037
where M represents the index of a point from the attention head, M1, 2, 3.
Figure FDA0003078631180000038
Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,
Figure FDA0003078631180000039
is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,
Figure FDA00030786311800000310
respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TRepresenting a transpose;
the channel self-attention mechanism model
Figure FDA00030786311800000311
Expressed as:
Figure FDA00030786311800000312
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
Figure FDA00030786311800000313
wherein the content of the first and second substances,
Figure FDA00030786311800000314
is a matrix of the characteristics of the channels,
Figure FDA00030786311800000315
is a weight matrix of the full connection layer, and dc=C/M',
Figure FDA00030786311800000316
Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
CN202110560118.XA 2021-05-21 2021-05-21 Three-dimensional target classification and segmentation method Pending CN113159232A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110560118.XA CN113159232A (en) 2021-05-21 2021-05-21 Three-dimensional target classification and segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110560118.XA CN113159232A (en) 2021-05-21 2021-05-21 Three-dimensional target classification and segmentation method

Publications (1)

Publication Number Publication Date
CN113159232A true CN113159232A (en) 2021-07-23

Family

ID=76877650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110560118.XA Pending CN113159232A (en) 2021-05-21 2021-05-21 Three-dimensional target classification and segmentation method

Country Status (1)

Country Link
CN (1) CN113159232A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723208A (en) * 2021-08-05 2021-11-30 北京大学 Three-dimensional object shape classification method based on normative equal transformation conversion sub-neural network
CN114211490A (en) * 2021-12-17 2022-03-22 中山大学 Robot arm gripper pose prediction method based on Transformer model
CN114550162A (en) * 2022-02-16 2022-05-27 北京工业大学 Three-dimensional object identification method combining view importance network and self-attention mechanism
CN116012374A (en) * 2023-03-15 2023-04-25 译企科技(成都)有限公司 Three-dimensional PET-CT head and neck tumor segmentation system and method
CN116091751A (en) * 2022-09-09 2023-05-09 锋睿领创(珠海)科技有限公司 Point cloud classification method and device, computer equipment and storage medium
WO2023098000A1 (en) * 2021-11-30 2023-06-08 上海商汤智能科技有限公司 Image processing method and apparatus, defect detection method and apparatus, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN111489358A (en) * 2020-03-18 2020-08-04 华中科技大学 Three-dimensional point cloud semantic segmentation method based on deep learning
CN112633330A (en) * 2020-12-06 2021-04-09 西安电子科技大学 Point cloud segmentation method, system, medium, computer device, terminal and application

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN111489358A (en) * 2020-03-18 2020-08-04 华中科技大学 Three-dimensional point cloud semantic segmentation method based on deep learning
CN112633330A (en) * 2020-12-06 2021-04-09 西安电子科技大学 Point cloud segmentation method, system, medium, computer device, terminal and application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAN-FENG HAN等: "Dual Transformer for Point Cloud Analysis", 《COMPUTER VISION AND PATTERN RECOGNITION》 *
梁铎瀚: "基于3D骨骼人体行为识别算法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723208A (en) * 2021-08-05 2021-11-30 北京大学 Three-dimensional object shape classification method based on normative equal transformation conversion sub-neural network
CN113723208B (en) * 2021-08-05 2023-10-20 北京大学 Three-dimensional object shape classification method based on canonical and other transformation conversion sub-neural network
WO2023098000A1 (en) * 2021-11-30 2023-06-08 上海商汤智能科技有限公司 Image processing method and apparatus, defect detection method and apparatus, electronic device and storage medium
CN114211490A (en) * 2021-12-17 2022-03-22 中山大学 Robot arm gripper pose prediction method based on Transformer model
CN114211490B (en) * 2021-12-17 2024-01-05 中山大学 Method for predicting pose of manipulator gripper based on transducer model
CN114550162A (en) * 2022-02-16 2022-05-27 北京工业大学 Three-dimensional object identification method combining view importance network and self-attention mechanism
CN114550162B (en) * 2022-02-16 2024-04-02 北京工业大学 Three-dimensional object recognition method combining view importance network and self-attention mechanism
CN116091751A (en) * 2022-09-09 2023-05-09 锋睿领创(珠海)科技有限公司 Point cloud classification method and device, computer equipment and storage medium
CN116091751B (en) * 2022-09-09 2023-09-05 锋睿领创(珠海)科技有限公司 Point cloud classification method and device, computer equipment and storage medium
CN116012374A (en) * 2023-03-15 2023-04-25 译企科技(成都)有限公司 Three-dimensional PET-CT head and neck tumor segmentation system and method

Similar Documents

Publication Publication Date Title
Zhang et al. A review of deep learning-based semantic segmentation for point cloud
CN110458939B (en) Indoor scene modeling method based on visual angle generation
Riegler et al. Octnetfusion: Learning depth fusion from data
CN113159232A (en) Three-dimensional target classification and segmentation method
Wu et al. 3d shapenets: A deep representation for volumetric shapes
CN108921926A (en) A kind of end-to-end three-dimensional facial reconstruction method based on single image
CN113177555B (en) Target processing method and device based on cross-level, cross-scale and cross-attention mechanism
CN111414953B (en) Point cloud classification method and device
CN112990010B (en) Point cloud data processing method and device, computer equipment and storage medium
US20230206603A1 (en) High-precision point cloud completion method based on deep learning and device thereof
CN111753698A (en) Multi-mode three-dimensional point cloud segmentation system and method
CN111695494A (en) Three-dimensional point cloud data classification method based on multi-view convolution pooling
CN113345106A (en) Three-dimensional point cloud analysis method and system based on multi-scale multi-level converter
Shi et al. Gesture recognition using spatiotemporal deformable convolutional representation
CN110781894A (en) Point cloud semantic segmentation method and device and electronic equipment
CN111652273A (en) Deep learning-based RGB-D image classification method
CN113569979A (en) Three-dimensional object point cloud classification method based on attention mechanism
CN113743417A (en) Semantic segmentation method and semantic segmentation device
Ahmad et al. 3D capsule networks for object classification from 3D model data
CN114627290A (en) Mechanical part image segmentation algorithm based on improved DeepLabV3+ network
CN113988147A (en) Multi-label classification method and device for remote sensing image scene based on graph network, and multi-label retrieval method and device
CN114299339A (en) Three-dimensional point cloud model classification method and system based on regional correlation modeling
CN116452757B (en) Human body surface reconstruction method and system under complex scene
CN113011506B (en) Texture image classification method based on deep fractal spectrum network
CN111414802B (en) Protein data characteristic extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210723