WO2020119619A1

WO2020119619A1 - Network optimization structure employing 3d target classification and scene semantic segmentation

Info

Publication number: WO2020119619A1
Application number: PCT/CN2019/123947
Authority: WO
Inventors: 程俊; 张锲石; 王胜文
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2018-12-14
Filing date: 2019-12-09
Publication date: 2020-06-18
Also published as: CN109753995A; CN109753995B

Abstract

A network structure optimization method employing 3D target classification and scene semantic segmentation, relating to the field of robots and the field of reinforcement learning. The method comprises: after acquiring the features of points, scoring each of the points, the level of the score representing the contribution of the point to a task; and sorting the scores, and selecting the top N points. In center point sampling, all of acquired point sets are subsets of point sets in a previous layer, and thus, the same point has different features in the same layer. Thus, when feature extraction is performed on the next layer, different features located in the same point in the previous layer can be combined, and the combination technique combines fine-grained features of a specified point. The method improves the classification performance of PointNet++ for objects, and improves performance for scene segmentation.

Description

A network optimization structure based on 3D target classification and scene semantic segmentation

Technical field

The invention relates to the field of robots and reinforcement learning, and in particular to a network optimization structure based on 3D target classification and scene semantic segmentation.

Background technique

PointNet++ is a recently proposed network structure for 3D target classification and scene semantic segmentation. Although it has achieved relatively satisfactory results, there are still two problems:

1) PointNet++ uses the farthest point sampling (FPS) algorithm when selecting centroid points. Although this algorithm can cover the entire data set better than random point selection, it ignores the classification of the features of each point. The fact that the contribution of the segmentation task is different. Therefore, FPS cannot guarantee that the selected set of centroid points can correctly represent the main features of the object;

2) Multi-scale grouping (MSG) and Multi-resolution grouping (MRG) are used in PointNet++ to solve the problem of uneven density of point clouds, but MSG is the fusion of multi-scale features of the same point on the same layer, and MRG is the global of different layers Feature fusion. This feature fusion method ignores the characteristics of the same point between different levels.

Summary of the invention

To solve the above problems in the background art, the present invention proposes a network optimization structure based on 3D target classification and scene semantic segmentation, which can not only improve the classification performance of PointNet++ for objects, but also improve the performance of scene segmentation.

The technical solution of the present invention to solve the above problems is: a network optimization structure based on 3D target classification and scene semantic segmentation, which is special in that it includes the following steps:

1) Build PS module

1.1) The characteristics of the acquisition point;

1.2) Scoring each point, the level of the score represents the point's contribution to the task;

1.3) Sort the scores and take the first N points, where N is the number of points that you want to sample;

2) MLPF feature extraction and fusion

When sampling the center point, the collected point set is a subset of the previous point set. According to this characteristic, the same point has different characteristics in each layer, so the next layer is being processed. During feature extraction, we can fuse the different features of the previous layer at the same point. This fusion method is to fuse the fine-grained features of the specified point.

Further, in step 1.2), when scoring each point, a scoring function α (f _n ; θ) is used to score each point, where f _n ∈ ^Rd , n = 1, 2, ..., N Represents the d-dimensional feature, and θ represents the learned parameter;

When training the PS module, the output of the module is Y:

among them

Represents the weight of the last output layer, M is the number of categories that need to be predicted;

During training, use the crossentropyloss function to converge. The loss function formula is as follows:

L=-[y ^* ln ^p +(1-y ^* )ln ^(1-p) ] (2),

Where y ^* represents the label,

The PS module uses a 2-layer CNN layer and the convolution kernel size of each layer is 1x1.

Advantages of the invention:

1) The present invention is a network optimization structure based on 3D target classification and scene semantic segmentation. It proposes a new method to select centroid points and score the contribution degree of points before feature extraction, so that the selected point set can reflect The main characteristics of the target;

2) The Multi-level-point feature (MLPF) structure is proposed. The MLPF method can extract different levels of features for each center point of interest and fuse them. Although MLPF also uses different levels of features, the object of action is a point. Not the area. And this feature extraction method is more universal and can be used in other networks;

3) In addition, a new feature fusion method is proposed, so that more fine-grained features can be extracted. In addition, these two structures are not only applicable to PointNet++, but can also be applied to other network structures, thereby improving the overall performance of the network, and can effectively prevent overfitting problems. Therefore, our structure has important use and reference value for scene target classification and scene semantic segmentation.

BRIEF DESCRIPTION

FIG. 1 is a schematic structural diagram of a PS provided by an embodiment of the present invention (different numbers of points represent different importance);

FIG. 2 is a schematic diagram of center point selection between levels and multi-level fusion of features at the same point provided by an embodiment of the present invention (where l _i represents the characteristics of the i-th layer).

detailed description

To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments in the present invention, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the present invention. Based on the embodiments in the present invention, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention.

A network optimization structure based on 3D target classification and scene semantic segmentation includes the following steps:

1) Build a PS module. The PS module selects feature points with a new point-selection method. The new point-selection method is a new method based on the attention mechanism to select those features that contribute more to the task Point, so that the selected point set can better represent the entire sampled space, the schematic diagram of the PS module structure is shown in Figure 1 (different numbers of points represent different importance);

1.1) The characteristics of the acquisition point;

1.3) Sort the scores and take the first N points, where N is the number of points you want to sample. In this way, the same number of points is taken, and the points obtained by this method are more representative and more obvious than the point set selected by the FPS algorithm.

2) MLPF feature extraction and fusion

When sampling the center point, the collected point set is a subset of the previous point set. According to this characteristic, the same point has different characteristics in each layer, so the next layer is being processed. During feature extraction, we can fuse the different features of the previous layer at the same point. This fusion method is to fuse the fine-grained features of the specified point. The process is shown in Figure 2:

Figure 2 is the selection of the center point between levels and the multi-level fusion of features at the same point, where l _i represents the characteristics of the i-th layer.

It can be seen from Figure 2 that each layer of feature point sets is a subset of the previous layer, and the same point contains different feature information in different layers, so we can fuse these features to get more Powerful features. For example, layer l _i+1 contains 3 points: point 1, point 2 and point 3. These points are obtained through the first two layers of feature selection. In the original PointNet++, the features of the points in the next layer are only related to the previous layer, and the previous features are not considered. The corresponding diagram is the dotted line 2 with only l _i-1 to l _i and l _i to l _i+1 , and the dotted line 1 without l _i-1 to l _i+ 1. We have achieved a fusion of fine-grained features through such multi-level feature fusion at the same point. Such features contain more information. The specific process is as follows:

Where C _i represents the set of centroid points output by the i-th layer,

N _j represents the centroid of the points C _i; and C _i F _i representative feature points corresponding to the set,

For the point

Characteristics.

When performing feature extraction at the i+1th layer, the i+1 centroid point set C _i+1 (

Where k=1, 2, ..., i) selection. After obtaining C _i+1 , we use C _i+1 as the index to filter out the features of the middle point of C _i+1 in the front i layer and perform feature stitching F _fuse :

among them

Represents the feature of the point in C _i+1 in the i-th layer. Therefore, the input of the final i+1 layer is {C _i+1 , F _fuse }, and the input in the original network is

When training the PS module, the output of the module is Y:

among them

L＝-[y ^* ln ^p +(1-y ^* )ln ^(1-p) ] (2)

Where y ^* represents the label,

We conducted some experiments on the ModelNet40 and ScanNet data sets, and compared with other advanced methods. The results are shown in Table 1 and Table 2, which can verify that the present invention is superior to other methods.

Table 1: Object classification results on the ModelNet40 dataset

方式the way	Mean lossMean Loss	Accuracy(％)Accuracy(%)	Avg.Acc(％)Avg.Acc(%)
SubvolumeSubvolume	--	89.289.2	86.086.0
MVCNNMVCNN	--	90.190.1	--
PointNetPointNet	0.4910.491	89.289.2	86.286.2
PointNet++(SSG)PointNet++(SSG)	0.4450.445	90.290.2	87.987.9
Ours(PS)Ours(PS)	0.3860.386	90.690.6	88.188.1
Ours(MLPF)Ours(MLPF)	0.3420.342	91.191.1	87.887.8

Table 2: Scene semantic segmentation results on ScanNet dataset

方式the way	Accuracy(％)Accuracy(%)
3DCNN3DCNN	73.073.0
PointNetPointNet	73.973.9
PointNet++(SSG)PointNet++(SSG)	83.383.3
Ours(MLPF)Ours(MLPF)	85.185.1

The above are only embodiments of the present invention, and are not intended to limit the scope of protection of the present invention. Any equivalent structure or equivalent process transformation made by the description and drawings of the present invention, or directly or indirectly used in other related In the field of systems, the same reason is included in the protection scope of the present invention.

Claims

A network optimization structure based on 3D target classification and scene semantic segmentation is special in that it includes the following steps:

1) Build PS module

1.1) The characteristics of the acquisition point;

1.2) Scoring each point, the level of the score represents the point's contribution to the task;

1.3) Sort the scores and take the first N points, where N is the number of points you want to sample;

2) MLPF feature extraction and fusion

When sampling the center point, the collected point set is a subset of the previous layer point set. According to this characteristic, the same point has different characteristics in each layer, so the next layer is being processed. When extracting features, you can fuse the different features of the previous layer at the same point. This fusion method is to fuse the fine-grained features of the specified point.
A network optimization structure based on 3D target classification and scene semantic segmentation according to claim 1, which is special in that:

In step 1.2), when scoring each point, a scoring function α (f n ; θ) is used to score each point, where f n ∈ Rd , n = 1, 2, ..., N represents the d dimension Features, θ represents the learned parameters;

When training the PS module, the output of the module is Y:

among them
Represents the weight of the last output layer, M is the number of categories that need to be predicted;

During training, use the crossentropyloss function to converge. The loss function formula is as follows:

L=-[y * ln p +(1-y * )ln (1-p) ] (2),

Where y * represents the label,

The PS module uses a 2-layer CNN layer and the convolution kernel size of each layer is 1x1.