CN113159232A - Three-dimensional target classification and segmentation method - Google Patents
Three-dimensional target classification and segmentation method Download PDFInfo
- Publication number
- CN113159232A CN113159232A CN202110560118.XA CN202110560118A CN113159232A CN 113159232 A CN113159232 A CN 113159232A CN 202110560118 A CN202110560118 A CN 202110560118A CN 113159232 A CN113159232 A CN 113159232A
- Authority
- CN
- China
- Prior art keywords
- feature extraction
- point
- feature
- attention
- self
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000011218 segmentation Effects 0.000 title claims abstract description 25
- 238000000605 extraction Methods 0.000 claims abstract description 131
- 238000010586 diagram Methods 0.000 claims abstract description 47
- 230000007246 mechanism Effects 0.000 claims description 50
- 239000011159 matrix material Substances 0.000 claims description 35
- 239000000126 substance Substances 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 11
- 230000017105 transposition Effects 0.000 claims description 11
- 230000002776 aggregation Effects 0.000 claims description 5
- 238000004220 aggregation Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 abstract description 5
- 238000007781 pre-processing Methods 0.000 abstract description 5
- 238000004590 computer program Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 230000004438 eyesight Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/20—Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Architecture (AREA)
- Computer Graphics (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a three-dimensional target classification and segmentation method, wherein the classification method comprises the following steps: acquiring three-dimensional point cloud data of a target to be classified; performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map; and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target. The invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a three-dimensional target classification and segmentation method.
Background
In the early stages of computer vision development, the visual perception of the objective world by machines was primarily dependent on two-dimensional images or image sequences captured by cameras. However, the world is three-dimensional in euclidean space, and images because only capturing information of the world projected at a certain perspective will create uncertainty in characterizing the dimensional and geometric properties of the object. In contrast, point clouds, which are the most primitive three-dimensional data representations, can accurately reflect the real size and shape structure of an object, and gradually become another data form on which machine vision perception depends. With the emergence of 3D acquisition devices such as microsoft Kinect, google Tango tablet, intel real sense and the like, the acquisition of point cloud data is as convenient as images, which further promotes the development of three-dimensional computer vision technology, and 3D point cloud also plays more and more important roles in the fields of virtual/augmented reality, autonomous driving, robot technology and the like, so how to perform effective point cloud analysis becomes a problem to be solved urgently.
In recent years, deep learning techniques have enjoyed tremendous success in the computer graphics field, which in fact provides opportunities for better understanding point clouds. However, the point cloud is composed of a plurality of discrete, unordered, topological-structure-free three-dimensional points, which is an initial form of data acquired by the three-dimensional sensing system, so that a learner needs to preprocess the point cloud data before processing the point cloud data by using a traditional convolutional neural network, and there are two main methods adopted at present:
1. the multi-view based method aims at projecting point cloud data into a 2D image set at certain specific visual angles, such as a front-view visual angle and a bird's eye-view visual angle, converting a 3D problem into a 2D problem, and therefore applying a 2D neural network to carry out feature learning. And simultaneously, image information from a camera is fused and used, and data of different visual angles are combined to realize object classification and part and semantic segmentation tasks of point cloud data. The pioneering working MVCNN extracts multi-view features into global descriptors using max pool operations. The View-GCN constructs a directed graph with the views as nodes. Although this method has a significant effect on tasks such as object classification, it is still difficult to determine the appropriate number of views to cover a 3D object under constraints of geometric information loss and high computational cost.
2. Voxelization: and dividing the point cloud data into regular grids. In the method, the three-dimensional space is segmented, the spatial dependence is introduced into the point cloud data, and the method is very suitable for extracting feature representation by using a three-dimensional convolution neural network. Scholars have proposed methods such as OctNet and Kd-Net to gather data information and skip empty voxel grids. The pointgrid approach improves local geometric detail learning by including points in each mesh. However, the accuracy of the method depends on the segmentation degree of the three-dimensional space, and the computation complexity of the 3D convolution is high, and is affected by the increased computation complexity and memory requirement of the cube in terms of the volume resolution and the geometric information loss.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a three-dimensional object classification and segmentation method, which is used to solve the shortcomings of the prior art.
To achieve the above and other related objects, the present invention provides a three-dimensional object classification method, including:
acquiring three-dimensional point cloud data of a target to be classified;
performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map;
and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target.
Optionally, the Transformer-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:
the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.
Optionally, each of the Transformer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the Transformer model obtains the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;
wherein the content of the first and second substances,a characteristic diagram of the l +1 layer which represents the output of the Transformer model,performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;represents passing through the first layerAnd the characteristic diagram is obtained by carrying out channel self-attention operation on the characteristic diagram.
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
where M represents the index of a point from the attention head, M1, 2, 3.Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TIndicating transposition.
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
wherein the content of the first and second substances,is a matrix of the characteristics of the channels,is a weight matrix of the full connection layer, and dc=C/M',Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
Optionally, the transform-based feature extraction module includes 3 feature extraction units that are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.
To achieve the above and other related objects, the present invention provides a three-dimensional object segmentation method, comprising:
classifying the target to be classified by using the classification method to obtain a classified target;
performing at least two times of feature extraction on the classified point cloud data of the target to obtain a feature map;
and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain segmentation.
Optionally, the performing at least two feature extractions on the point cloud data of the classified target includes:
performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;
performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;
the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;
the first feature extraction unit comprises a feature down-sampling layer and a feature extraction subunit, the second feature extraction unit comprises a feature up-sampling layer and a feature extraction subunit, and the feature extraction subunit comprises: connected in sequence, attention-based transducer models.
Optionally, each of the Transformer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the Transformer model obtains the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;
wherein the content of the first and second substances,a characteristic diagram of the l +1 layer which represents the output of the Transformer model,performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
where M represents the index of a point from the attention head, M1, 2, 3.Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TRepresenting a transpose;
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
wherein the content of the first and second substances,is a matrix of the characteristics of the channels,is a weight matrix of the full connection layer, and dc=C/M',Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
As described above, the three-dimensional object classification and segmentation method of the present invention has the following beneficial effects:
the invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.
Drawings
FIG. 1 is a flowchart illustrating a three-dimensional object classification method according to an embodiment of the present invention;
FIG. 2 is a diagram of a Transformer model based on the attention mechanism according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a classification network according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a three-dimensional object segmentation method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a split network according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
As shown in fig. 1, an embodiment of the present application provides a three-dimensional object classification method, including:
s11, acquiring three-dimensional point cloud data of the target to be classified;
s12, extracting the characteristics of the three-dimensional point cloud data by using a Transformer-based characteristic extraction module to obtain a characteristic diagram;
s13, inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain classified targets.
The invention directly takes the original point cloud data as input, and does not need any preprocessing methods such as voxelization or projection and the like, so the invention is not limited by information loss and high calculation complexity, can capture context information in a long range, and has better point cloud characteristic expression capability.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, the Transformer-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:
the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.
FIG. 2 shows a Transformer model based on the attention mechanism. In fig. 2, N represents the number of points of the point cloud model, C represents the feature dimension, and Q, K, and V represent the Query (Query), Key (Key), and Value (Value) matrix in the self-attention mechanism, respectively. Because a multi-headed attention mechanism is used, Q, K, V can be viewed as a query, key, value matrix for each head split from the capital Q, K, V. k is a radical ofT,qTRepresenting a matrix transposition operation, dcRepresenting the dimensions of each head.
In an embodiment, each of the fransormer models includes a Point-wise-attention-system (PWSA) module (as shown in an upper portion of fig. 2) and a channel-wise-attention-system (PWSA) module (as shown in a lower portion of fig. 2), and the Transformer model obtains the feature map by aggregating the Point-wise-attention-system and the channel-wise-attention-system;
wherein the content of the first and second substances,a characteristic diagram of the l +1 layer which represents the output of the Transformer model,performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;the characteristic diagram of the point at the l +1 st layer obtained by the above-mentioned self-attention calculation is shown.
Where M represents the index of a point from the attention head, M1, 2, 3.Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model (·)TIndicating transposition.
In order to emphasize the importance of the interaction between different point feature map channels, the invention adopts a basic idea similar to a point self-attention mechanism to construct a channel multi-head self-attention model, as shown in the lower part of fig. 2. In one embodiment, the channel self-attention mechanism modelIs shown as:
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,representing a characteristic diagram of the (l + 1) th layer point obtained through the channel self-attention operation;
wherein the content of the first and second substances,representing the mutual influence of all channels by a channel characteristic matrix; is a weight matrix of the full connection layer, and dc=C/M,Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
In one embodiment, the transform-based feature extraction module includes 3 feature extraction units which are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.
In one embodiment, the ModelNet dataset is used as an evaluation index for the classification task to determine the Overall Accuracy (OA) of the classification network as shown in FIG. 3.
The ModelNet dataset consists of 12311 CAD models from 40 classes, with 9843 shapes for training and 2468 objects for testing. 1024 points were sampled from each model according to the PointNet criteria. During training, the authenticity of the data is enhanced by exploiting random point loss, random scaling in [0.8, 1.25], and random shifts in [ -0.1, 0.1 ].
The network for the object classification task is shown in fig. 3, and comprises a transform-based feature extraction module and three cascaded fully-connected layers; the feature extraction module based on the Transformer is formed by cascading three feature extraction units, wherein each feature extraction unit comprises a feature down-sampling layer and a Transformer model based on an attention mechanism. The feature extraction module based on the Transformer is formed by cascading a plurality of feature extraction units; the feature extraction unit includes: the system comprises a feature down-sampling layer FDS and a transducer model DPCT based on an attention mechanism which are connected in sequence. The number of dots and channels used per layer is as follows:
INPUT(N=1.024,C=3)-FDS(N=512,C=128)-DPCT(N=512,C=320)-FDS(N=128,C=256)-DPCT(N=256,C=640)-FDS(N=1,C=1024)-DPCT(C=1024)-FC(512)-FC(256)-FC(40)
setting parameters: the network iterates 150 times during training, processing 16 batches of data each time, with an initial learning rate of 0.001 and a reduction of 0.7 per iteration of 20.
And (3) comparing the performances: table 1 shows the quantitative performance comparison of the present invention with other techniques. It is clear from the results that the network of the invention achieves an overall accuracy of 92.9%, which exceeds PointNet and point2sequence by 3.7% and 0.3%, respectively. This demonstrates the effectiveness of our model.
Table 1: object classification result based on ModelNet40
According to the method, a transducer model based on an attention mechanism is constructed, firstly, the transducer has displacement invariance, changes caused by different input sequences of points are eliminated, and therefore the method can be used for directly processing on point cloud without preprocessing operations such as multi-view and voxel grid, so that the loss of geometric information is greatly reduced, and secondly, feature extraction is carried out from spatial correlation and channel correlation to capture the dependency of context semantic features, so that the characterization capability of depth fusion features is enhanced, and important support is provided for accurate understanding of a three-dimensional point cloud scene.
As shown in fig. 4, an embodiment of the present application provides a three-dimensional object segmentation method, where object segmentation is performed based on object classification, where the classification is to determine that an entire point cloud model belongs to an object, and for example, a chair is obtained after the point cloud model is classified; the segmentation is more detailed, and each point of the point cloud model is classified, for example, the segmentation of the chair model is to distinguish that the part is a backrest, that the part is a leg, that the part is a chair surface, and the like.
The three-dimensional object segmentation method shown in fig. 4 includes the following steps:
s41, classifying the target to be classified by the classification method to obtain the classified target;
s42, performing at least two times of feature extraction on the point cloud data of the classified target to obtain a feature map;
s43 inputs the feature map into a fully connected module formed by connecting a plurality of fully connected layers, and the feature map is divided.
In an embodiment, the performing at least two times of feature extraction on the point cloud data of the classified target to obtain a feature map includes:
performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;
performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;
the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;
the first Feature extraction unit includes a Feature Down Sample Layer (Feature Down Sample Layer) and a Feature extraction subunit, the second Feature extraction unit includes a Feature Up Sample Layer (Feature Up Sample Layer) and a Feature extraction subunit, and the Feature extraction subunit includes: connected in sequence, attention-based transducer models.
As shown in fig. 5, the first feature extraction module includes 4 first feature extraction units, namely, a first-stage first feature extraction unit, a second-stage first feature extraction unit, a third-stage first feature extraction unit, and a fourth-stage first feature extraction unit; the second feature extraction module comprises 4 second feature extraction units, namely a first-stage second feature extraction unit, a second-stage second feature extraction unit, a third-stage second feature extraction unit and a fourth-stage second feature extraction unit. As shown in the figure, the first-stage first feature extraction unit, the second-stage first feature extraction unit, the third-stage first feature extraction unit and the fourth-stage first feature extraction unit are sequentially Connected, the first-stage second feature extraction unit, the second-stage second feature extraction unit, the third-stage second feature extraction unit and the fourth-stage second feature extraction unit are sequentially Connected, the first-stage first feature extraction unit is Connected with the third-stage first feature extraction unit, the second-stage first feature extraction unit is Connected with the second-stage second feature extraction unit, the third-stage first feature extraction unit is Connected with the first-stage second feature extraction unit, the last-stage second feature extraction unit is Connected to a full connection Layer (full Connected Layer), and a segmentation target is obtained after the full connection Layer.
The first feature extraction unit comprises a feature down-sampling layer and a transducer model (Dual Point Cloud transducer) based on attention connected with the output of the feature down-sampling layer; the second feature extraction unit comprises a feature upsampling layer and an attention-based transform model connected with the output of the feature upsampling layer.
FIG. 2 shows a Transformer model based on the attention mechanism. In fig. 2, N represents the number of points of the point cloud model, C represents the feature dimension, and Q, K, and V represent the Query (Query), Key (Key), and Value (Value) matrix in the self-attention mechanism, respectively. Because a multi-headed attention mechanism is used, Q, K, V can be viewed as a query, key, value matrix for each head split from the capital Q, K, V. k is a radical ofT,qTRepresenting a matrix transposition operation, dcRepresenting the dimensions of each head.
In an embodiment, each of the fransformer models includes a Point-wise self-attention-initiation (PWSA) module (as shown in an upper portion of fig. 2) and a channel-wise self-attention (PWSA) module (as shown in a lower portion of fig. 2), and the fransformer model obtains the feature map through an aggregation Point self-attention mechanism and a channel self-attention mechanism;
wherein the content of the first and second substances,a characteristic diagram of the l +1 layer which represents the output of the Transformer model,performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;the characteristic diagram of the point at the l +1 st layer obtained by the above-mentioned self-attention calculation is shown.
Where M represents the index of a point from the attention head, M1, 2, 3.Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TIndicating transposition.
In order to emphasize the importance of the interaction between different point feature map channels, the invention adopts a basic idea similar to a point self-attention mechanism to construct a channel multi-head self-attention model, as shown in the lower part of fig. 2. In one embodiment, the channel self-attention mechanism modelExpressed as:
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,representing a characteristic diagram of the (l + 1) th layer point obtained through the channel self-attention operation;
wherein the content of the first and second substances,representing the mutual influence of all channels by a channel characteristic matrix;is a weight matrix of the full connection layer, and dc=C/M',Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
In one embodiment, to build multi-scale hierarchical features, a feature down-sampling layer (FDS) is added before the attention-based transform model. Specifically, for the input feature map FlPerforming a farthest point sampling algorithm (FPS) to generate a sub-feature mapNext, for sub-feature map F'lTo aggregate and assign to it the characteristics of all points in its spherical neighborhood, followed by linear transformation, Batch Normalization (BN), and ReLU operations. This downsampled (FDS) layer can be briefly summarized as follows:
Fl+1=Relu(BN(W'l(Agg(FPS(Fl)))))
wherein, Agg (·)) Means local feature polymerization operation, W'lLearnable weight parameters representing a linear transformation. FPS (-) is a plot of point characteristics F for the l layerlAnd performing the farthest point sampling operation and performing downsampling.
In one embodiment, for more accurate prediction in the segmentation task, a feature upsampling layer is placed in the decoder portion to increase the resolution of the point feature map to the original image size. The point set is up-sampled by using a K-nearest neighbor interpolation algorithm based on Euclidean distance.
In one embodiment, a Shapelet part benchmark dataset consisting of 16881 objects from 16 different classes, for a total of 50 part labels, is selected to train and test the part segmentation effect of the segmentation network. The training/testing data ratio of 14007/2874 officially given by the dataset was followed and 2048 points were sampled from each shape as input. Furthermore, the same data enhancement as the classification task is performed. The assessment indicators include the average IoU value and the classification IoU for all part categories.
The architecture of the component-partitioned network is shown in fig. 5. The encoder structure (the first feature extraction units) is similar to the classification task, and the decoder (the second feature extraction units) is added with three second feature extraction units (including a feature upsampling layer and a attention-based transform model). The number of dots and channels used per layer is as follows:
Input(N=2048,C=3)-FDS(N=512,C=320)-DPCT(N=512,C=320)-FDS(N=128,C=512)-DPCT(N=128,C=512)-FDS(N=1,C=1024)-DPCT(N=1,C=2014)-FUS(N=128,C=256)-DPCT(N=128)
setting parameters: the network iterates 80 times during training, with an initial learning rate of 0.0005 and a 50% reduction per 20 iterations.
And (3) comparing the performances: quantitative comparison of the method given in table 2 with the current latest model. Unlike methods in which PointNet, PointNet + + and SO-Net introduce normal vectors and point coordinates simultaneously, the multiple Transformer model of the present invention uses only XYZ coordinates as input features. The segmentation result shows that the method of the invention achieves the highest 85.6 percent of mIoU and respectively exceeds 0.5 percent and 0.2 percent of PointNet + + and the current optimal method SFCNN. In particular, the method of the present invention performs better in certain types of segmentation than these competing methods, such as chairs, lights, skates, tables, and the like.
Table 2: component segmentation results based on ShapeNet dataset
The invention breaks through the restriction of traditional mode information loss and high computation complexity, does not need any preprocessing method such as voxelization or projection, can directly input original point cloud data, has the capability of capturing the context information of a long range of point clouds, has better point cloud characteristic description capability, is suitable for popularization and application in the fields of computer vision, computer graphics, robotics, remote sensing and the like, and has important practical application value. For example, applications in the field of remote sensing include large-scene remote sensing point cloud splicing and terrain scene reconstruction; the application in the field of cultural heritage protection comprises the construction of an ancient cultural relic digital model base based on multi-mesh point cloud splicing and reconstruction; typical applications in the field of computer vision are three-dimensional face recognition, three-dimensional target classification detection and recognition and gesture tracking of three-dimensional moving objects; the application in the aerospace field comprises the motion pose resolving of space non-cooperative targets and the like; the main application in the robot field comprises the estimation of the grabbing and placing postures of objects; the application in the field of national defense includes air-to-ground precise target striking and the like.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may comprise any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, etc.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (10)
1. A method for classifying a three-dimensional object, comprising:
acquiring three-dimensional point cloud data of a target to be classified;
performing feature extraction on the three-dimensional point cloud data by using a feature extraction module based on a Transformer to obtain a feature map; and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain a classified target.
2. The method of classifying a three-dimensional object according to claim 1, wherein the transform-based feature extraction module is formed by cascading a plurality of feature extraction units; the feature extraction unit includes:
the method comprises a feature downsampling layer and a Transformer model based on an attention mechanism which are connected in sequence.
3. The method for classifying a three-dimensional object according to claim 2, wherein each of the fransormer models includes a self-attention mechanism module and a channel self-attention mechanism module, and the fransormer models obtain the feature map through an aggregation point self-attention mechanism and a channel self-attention mechanism;
wherein the content of the first and second substances,a characteristic diagram of the l +1 layer which represents the output of the Transformer model,performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;feature diagram representing the passing through to the I < th > layerAnd (5) carrying out a characteristic diagram obtained by channel self-attention operation.
m denotes the Mth point self-attention mechanism module, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
where M represents the index of a point from the attention head, M1, 2, 3.Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TIndicating transposition.
5. The method of claim 4, wherein the channel is self-classifyingAttention mechanism modelExpressed as:
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
6. The three-dimensional object classification method according to claim 5, characterized in that the transform-based feature extraction module comprises 3 feature extraction units which are cascaded in sequence; the characteristic diagram is subjected to 3 cascaded full-connection layers to obtain the category of the target.
7. A method for segmenting a three-dimensional object, comprising:
classifying the target to be classified by using the classification method according to any one of claims 1 to 6 to obtain the classified target;
performing at least two times of feature extraction on the classified point cloud data of the target to obtain a feature map;
and inputting the characteristic diagram into a full-connection module formed by connecting a plurality of full-connection layers to obtain segmentation.
8. The method of claim 7, wherein the performing at least two feature extractions on the point cloud data of the classified objects comprises:
performing at least one time of feature extraction on the point cloud data of the classified target in a first feature extraction mode through a first feature extraction module to obtain a first feature map;
performing at least one time of feature extraction on the first feature map in a second feature extraction mode through a second feature extraction module to obtain a feature map;
the first feature extraction module comprises a first-stage first feature extraction unit to an Nth-stage first feature extraction unit, the second feature extraction module comprises a first-stage second feature extraction unit to an Nth-stage second feature extraction unit, and the Nth-stage first feature extraction unit is connected with the first-stage second feature extraction unit; the nth-level second feature extraction unit is connected with the nth-level first feature extraction unit;
the first feature extraction unit comprises a feature down-sampling layer and a feature extraction subunit, the second feature extraction unit comprises a feature up-sampling layer and a feature extraction subunit, and the feature extraction subunit comprises: connected in sequence, attention-based transducer models.
9. The method of claim 8, wherein each of the fransormer models comprises a self-attention mechanism module and a channel self-attention mechanism module, and the fransormer models obtain the feature map by an aggregation point self-attention mechanism and a channel self-attention mechanism;
wherein the content of the first and second substances,a characteristic diagram of the l +1 layer which represents the output of the Transformer model,performing multi-point self-attention operation on the first layer point feature map to obtain a point feature map;the characteristic diagram obtained by performing channel self-attention operation on the characteristic diagram of the l-th layer is shown.
10. The three-dimensional object segmentation method of claim 9, wherein the point-auto-attention mechanism modelExpressed as:
m denotes the Mth Point self-attention channel, MHATPWSA(Fl) Feature diagram F for ith layer pointlPerforming multi-head self-attention operation;
where M represents the index of a point from the attention head, M1, 2, 3.Is the point space feature matrix of the mth point self-attention head, sigma represents the softmax operation,is a learnable weight parameter of three linear layers, and dq=dk=dv=dcC represents a characteristic dimension,respectively representing the query, key and value matrix of the mth head of the l +1 th layer point multi-head attention model ·TRepresenting a transpose;
MHATCWSA(Fl) Feature diagram F for ith layer pointlThe multi-head self-attention operation of the channel is carried out,
wherein the content of the first and second substances,is a matrix of the characteristics of the channels,is a weight matrix of the full connection layer, and dc=C/M',Respectively representing the query, key, value matrix of the mth head of the l +1 th channel multi-head attention model ·TIndicating transposition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110560118.XA CN113159232A (en) | 2021-05-21 | 2021-05-21 | Three-dimensional target classification and segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110560118.XA CN113159232A (en) | 2021-05-21 | 2021-05-21 | Three-dimensional target classification and segmentation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113159232A true CN113159232A (en) | 2021-07-23 |
Family
ID=76877650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110560118.XA Pending CN113159232A (en) | 2021-05-21 | 2021-05-21 | Three-dimensional target classification and segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113159232A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113723208A (en) * | 2021-08-05 | 2021-11-30 | 北京大学 | Three-dimensional object shape classification method based on normative equal transformation conversion sub-neural network |
CN114211490A (en) * | 2021-12-17 | 2022-03-22 | 中山大学 | Robot arm gripper pose prediction method based on Transformer model |
CN114550162A (en) * | 2022-02-16 | 2022-05-27 | 北京工业大学 | Three-dimensional object identification method combining view importance network and self-attention mechanism |
CN116012374A (en) * | 2023-03-15 | 2023-04-25 | 译企科技(成都)有限公司 | Three-dimensional PET-CT head and neck tumor segmentation system and method |
CN116091751A (en) * | 2022-09-09 | 2023-05-09 | 锋睿领创(珠海)科技有限公司 | Point cloud classification method and device, computer equipment and storage medium |
WO2023098000A1 (en) * | 2021-11-30 | 2023-06-08 | 上海商汤智能科技有限公司 | Image processing method and apparatus, defect detection method and apparatus, electronic device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190147245A1 (en) * | 2017-11-14 | 2019-05-16 | Nuro, Inc. | Three-dimensional object detection for autonomous robotic systems using image proposals |
CN111489358A (en) * | 2020-03-18 | 2020-08-04 | 华中科技大学 | Three-dimensional point cloud semantic segmentation method based on deep learning |
CN112633330A (en) * | 2020-12-06 | 2021-04-09 | 西安电子科技大学 | Point cloud segmentation method, system, medium, computer device, terminal and application |
-
2021
- 2021-05-21 CN CN202110560118.XA patent/CN113159232A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190147245A1 (en) * | 2017-11-14 | 2019-05-16 | Nuro, Inc. | Three-dimensional object detection for autonomous robotic systems using image proposals |
CN111489358A (en) * | 2020-03-18 | 2020-08-04 | 华中科技大学 | Three-dimensional point cloud semantic segmentation method based on deep learning |
CN112633330A (en) * | 2020-12-06 | 2021-04-09 | 西安电子科技大学 | Point cloud segmentation method, system, medium, computer device, terminal and application |
Non-Patent Citations (2)
Title |
---|
XIAN-FENG HAN等: "Dual Transformer for Point Cloud Analysis", 《COMPUTER VISION AND PATTERN RECOGNITION》 * |
梁铎瀚: "基于3D骨骼人体行为识别算法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113723208A (en) * | 2021-08-05 | 2021-11-30 | 北京大学 | Three-dimensional object shape classification method based on normative equal transformation conversion sub-neural network |
CN113723208B (en) * | 2021-08-05 | 2023-10-20 | 北京大学 | Three-dimensional object shape classification method based on canonical and other transformation conversion sub-neural network |
WO2023098000A1 (en) * | 2021-11-30 | 2023-06-08 | 上海商汤智能科技有限公司 | Image processing method and apparatus, defect detection method and apparatus, electronic device and storage medium |
CN114211490A (en) * | 2021-12-17 | 2022-03-22 | 中山大学 | Robot arm gripper pose prediction method based on Transformer model |
CN114211490B (en) * | 2021-12-17 | 2024-01-05 | 中山大学 | Method for predicting pose of manipulator gripper based on transducer model |
CN114550162A (en) * | 2022-02-16 | 2022-05-27 | 北京工业大学 | Three-dimensional object identification method combining view importance network and self-attention mechanism |
CN114550162B (en) * | 2022-02-16 | 2024-04-02 | 北京工业大学 | Three-dimensional object recognition method combining view importance network and self-attention mechanism |
CN116091751A (en) * | 2022-09-09 | 2023-05-09 | 锋睿领创(珠海)科技有限公司 | Point cloud classification method and device, computer equipment and storage medium |
CN116091751B (en) * | 2022-09-09 | 2023-09-05 | 锋睿领创(珠海)科技有限公司 | Point cloud classification method and device, computer equipment and storage medium |
CN116012374A (en) * | 2023-03-15 | 2023-04-25 | 译企科技(成都)有限公司 | Three-dimensional PET-CT head and neck tumor segmentation system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | A review of deep learning-based semantic segmentation for point cloud | |
CN110458939B (en) | Indoor scene modeling method based on visual angle generation | |
Riegler et al. | Octnetfusion: Learning depth fusion from data | |
CN113159232A (en) | Three-dimensional target classification and segmentation method | |
Wu et al. | 3d shapenets: A deep representation for volumetric shapes | |
CN108921926A (en) | A kind of end-to-end three-dimensional facial reconstruction method based on single image | |
CN113177555B (en) | Target processing method and device based on cross-level, cross-scale and cross-attention mechanism | |
CN111414953B (en) | Point cloud classification method and device | |
CN112990010B (en) | Point cloud data processing method and device, computer equipment and storage medium | |
US20230206603A1 (en) | High-precision point cloud completion method based on deep learning and device thereof | |
CN111753698A (en) | Multi-mode three-dimensional point cloud segmentation system and method | |
CN111695494A (en) | Three-dimensional point cloud data classification method based on multi-view convolution pooling | |
CN113345106A (en) | Three-dimensional point cloud analysis method and system based on multi-scale multi-level converter | |
Shi et al. | Gesture recognition using spatiotemporal deformable convolutional representation | |
CN110781894A (en) | Point cloud semantic segmentation method and device and electronic equipment | |
CN111652273A (en) | Deep learning-based RGB-D image classification method | |
CN113569979A (en) | Three-dimensional object point cloud classification method based on attention mechanism | |
CN113743417A (en) | Semantic segmentation method and semantic segmentation device | |
Ahmad et al. | 3D capsule networks for object classification from 3D model data | |
CN114627290A (en) | Mechanical part image segmentation algorithm based on improved DeepLabV3+ network | |
CN113988147A (en) | Multi-label classification method and device for remote sensing image scene based on graph network, and multi-label retrieval method and device | |
CN114299339A (en) | Three-dimensional point cloud model classification method and system based on regional correlation modeling | |
CN116452757B (en) | Human body surface reconstruction method and system under complex scene | |
CN113011506B (en) | Texture image classification method based on deep fractal spectrum network | |
CN111414802B (en) | Protein data characteristic extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210723 |