CN109063139B

CN109063139B - Three-dimensional model classification and retrieval method based on panorama and multi-channel CNN

Info

Publication number: CN109063139B
Application number: CN201810879211.5A
Authority: CN
Inventors: 梁祺; 聂为之
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2021-08-03
Anticipated expiration: 2038-08-03
Also published as: CN109063139A

Abstract

The invention discloses a three-dimensional model view extraction method based on a panoramic image and a multichannel CNN, which comprises the following steps: projecting the 3D model to the side of a cylinder satisfying a preset conditionTaking the origin of the 3D model as a center, and enabling the axis of the 3D model to be parallel to one of the main axes of X, Y, Z to obtain an initial panoramic image; angles to the surface of a 3D model in three-dimensional space at a certain predetermined rate, respectively

Sampling the y coordinate to obtain two groups of values of each point in the initial panoramic image so as to represent the position characteristic of the surface of the 3D model in the three-dimensional space and the direction characteristic of the surface of the 3D model; and constructing a multi-scale network and a multi-channel convolution neural network, taking the position characteristics of the surface of the 3D model and the direction characteristics of the surface of the 3D model as input, and carrying out network training and similarity measurement between two different 3D models. The invention reserves the local and global information of the structure and vision of the three-dimensional model, automatically calculates the characteristics of the 2D panoramic view, and is used for processing the classification and retrieval problems.

Description

Three-dimensional model classification and retrieval method based on panorama and multi-channel CNN

Technical Field

The invention relates to the field of three-dimensional model classification and retrieval, in particular to a three-dimensional model classification and retrieval method based on a panoramic image and a CNN multichannel CNN.

Background

With the development of computer vision technology, 3D technology is widely used in the fields of the film and television industry, mechanical design, construction industry, infrastructure, entertainment industry, medical treatment, and the like. More and more people are starting to upload self-designed 3D models on some websites, and the number of 3D models is on a growing trend. This has led to the search of 3D models as a topical topic in the field of computer vision. Unlike the conventional visual representation of two-dimensional image information, a three-dimensional model has not only visual information but also structural information. Therefore, conventional computer vision techniques are difficult to use to represent 3D models. In recent years, many approaches have been proposed to address the problem of 3D model representation.

In general, 3D model retrieval methods are largely divided into two categories, model-based methods and view-based methods^[1]。

Early methods generally belonged to model-based methods, which required explicit 3D model data retrieval. Popular model-based methods typically utilize geometric moments^[2]Surface distribution of^[3]Three-dimensional model^[4]The shape of the representation is described. However, the extraction of the structure information is computationally expensive and its performance is highly limited by the sampling structure points. Therefore, the practical application of the model-based approach is severely limited.

View-based approaches have attracted more attention in recent years because they represent a 3D model with a set of 2D images. Many sophisticated computer vision techniques can be used directly to process representations of 3D models, and many classical approaches have also been proposed^[5][6]. However, the biggest problem with the view-based approach is that it ignores the structural and spatial information of the three-dimensional model.

In recent years, with the development of deep learning, many researchers begin to utilize some classical deep learning methods to deal with the three-dimensional model retrieval problem. A number of trending topics have been proposed. Maturana et al^[7]A novel three-dimensional convolutional neural network based on a CNN classical architecture is provided. It can extract effective feature vectors from the structural information. Su et al^[8]A novel CNN network (MVCNN) is proposed to process multi-view information based 3D model representations. In network processing, it can fuse multi-view information to provide robust functionality. Kanezaki et al^[9]An improved CNN network is proposed to handle the three-dimensional model classification and retrieval problem, the model being designed to use only a partial set of multi-view images for reasoning and feature learning. Charles et al^[10]A novel neural network (PointNet) that directly consumes point clouds is presented. However, this method is only applicable to point cloud type, which limits its application range. Wu et al^[11]A deep belief network was trained on the shapes discretized into a 303 voxel grid for object classification, shape completion and sub-optimal view prediction. Sedaghat et al^[12]Introduces a kind of auxiliary directional loss, compared with original VoxNet, the classification performance is improved^[7]. In general, all of these methods generally focus on structural or visual information, while ignoring the other, affecting classification and search accuracy.

Disclosure of Invention

The invention provides a three-dimensional model classification and retrieval method based on a panoramic image and a multichannel CNN (CNN). The invention reserves the local and global information of the structure and vision of the three-dimensional model, automatically calculates the characteristics of a 2D panoramic image, is used for processing the classification and retrieval problems, and is described in detail as follows:

a method for extracting a three-dimensional model view based on a panorama and a multichannel CNN (CNN), comprising the following steps of:

projecting the 3D model onto the side surface of the cylinder meeting the preset condition, and taking the original point of the 3D model as the center, and enabling the axis of the 3D model to be parallel to one of the main axes of X, Y, Z to obtain an initial panoramic view;

by taking an angle in a plane formed by any two coordinate axes

Angles to the surface of a 3D model in three-dimensional space at a certain predetermined rate, respectively

Sampling with y coordinate to obtain each point in initial panorama

To represent the position characteristics of the 3D model surface in three-dimensional space and the orientation characteristics of the 3D model surface; and constructing a multi-scale network and a multi-channel convolution neural network, taking the position characteristics of the surface of the 3D model and the direction characteristics of the surface of the 3D model as input, and carrying out network training and similarity measurement between two different 3D models.

Further, the preset conditions are as follows: the height of the cylinder is 2 times of the radius of the bottom surface. The preset rate is as follows: at rates 2B and B diagonal

And the y coordinate.

Wherein the multi-scale network comprises: extracting view descriptors of different resolutions of the same input picture respectively, wherein the size of the input picture is 256 × 256;

for the first scale, the size is 256 × 256, feature mapping is obtained through the convolution layer of VGG16, and 4096-dimensional feature mapping is obtained through normalization processing;

for the second scale, converting the scale of the input picture into 128 × 128, performing down-sampling, obtaining the feature mapping of the low-resolution picture through the convolution layer, and obtaining 3072-dimensional feature mapping through the maximum pooling layer and normalization processing;

for the third scale, converting the scale of the input picture into 64 x 64, performing down-sampling, obtaining the feature mapping of the low-resolution picture through the convolution layer, and obtaining 3072-dimensional feature mapping through the maximum pooling layer and normalization processing;

performing linear fusion on the outputs of the three scales to obtain a 4096-dimensional characteristic diagram, then obtaining a view descriptor through a full connection layer, and obtaining a classified result vector through a dropout layer and a softmax layer;

finally, the softmax layer outputs the class probability given the input 3D model, the class with the highest probability is considered the predicted class of the 3D model, trained using a random gradient descent method with momentum set to 0.9.

The multi-channel convolutional neural network comprises 6 channels,

the branch channel is used for creating a branch channel which is segmented according to the 3 axes of the panoramic view, and for the classification task, the probability vector is calculated by taking the mean value of all three individual probability vectors;

each 3D model has 6 descriptors, three of which are spatial distribution descriptors on XYZ axes and are used for describing the position characteristics of the surface of the 3D model;

the other three are normal vector distribution descriptors on XYZ axes, namely the direction characteristics of the 3D model surface;

each 3D model descriptor is compared to the remaining 3D model descriptors using an L1 distance metric for these 6 descriptors.

The technical scheme provided by the invention has the beneficial effects that:

1. according to the invention, each 3D model is represented by using the multi-resolution panoramic view, so that the structure of the 3D model can be effectively represented;

2. the invention provides a novel multi-channel CNN network for extracting visual characteristic vectors of panoramic views, convolution kernels of different scales are applied in the multi-channel CNN network, local and global information of the panoramic views can be stored, and the robustness of the characteristic vectors is improved.

Drawings

FIG. 1 is a flow chart of a method for classification and retrieval of three-dimensional models based on panoramas and multichannel CNNs;

FIG. 2 is a schematic representation of a 3D model and SDM (space distribution map), NDM (normal deviation map) images on three axes;

FIG. 3 is a schematic diagram of a Multi-Scale-NN (Multi-Scale neural network) architecture;

fig. 4 is a schematic diagram of a Multi-Channel-NN (Multi-Channel neural network) architecture.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

To solve the above problem, a multi-resolution panorama view needs to be extracted from each 3D model. The research shows that: the panoramic view of the 3D model can convert the structural information of the 3D model into 2D image information^[13]. The embodiment of the invention provides a three-dimensional model view extraction method based on a panoramic image and a multichannel CNN (CNN), and the method is described in detail in the following description with reference to fig. 1 and 2:

101: obtaining an initial panorama by projecting the 3D model surface in fig. 2 onto the side of a cylinder of radius R and height H2R, centered at the 3D model origin in fig. 2, with the axis of the 3D model parallel to one of the principal axes of X, Y, Z of space;

wherein the value of R is set to 3 x d_max，d_maxIs the maximum distance of the 3D model surface from the centroid.

102: assuming that a Z-axis panorama is extracted, embodiments of the present invention use a set of points

Parameterizing the initial panorama;

wherein the content of the first and second substances,

is the angle in the xy-plane and,at rates 2B and B, respectively

And the y coordinate. In the present embodiment, B-32, 64,128 is set. This means that each axis needs to be sampled three times. Each point in the initial panorama is then determined

Represents two different features of the 3D model surface, respectively:

(1) the position of the model surface in three-dimensional space (called the spatial map or SDM);

(2) the orientation of the model surface (called the normal deviation map or NDM).

Thus, for each axis in each 3D model, a panorama of 6 different scales and different values can be obtained, as shown in fig. 2. FIG. 2 depicts on the left a spatial distribution map, obtained centering around three axes respectively, representing the position of the mold surface in three-dimensional space; the right side depicts normal deviation maps obtained centering around the three axes, respectively, to represent the orientation of the model surface.

In conclusion, the embodiment of the invention avoids the loss of the structure and space information of the three-dimensional model caused by the traditional method, thereby improving the classification and retrieval accuracy.

Example 2

The training of the multi-channel and multi-scale CNN networks and the similarity measure between two different 3D models are described below with reference to specific network structures, calculation formulas, fig. 3, and fig. 4, and will be described in detail in the following description:

fig. 3 shows a multi-scale network, which includes three scales, and extracts view descriptors of different resolutions of the same input picture, assuming that the size of the input picture is 256 × 256, and for the first scale, the size of the input picture is the same as that of the original picture, and the size of the input picture is 256 × 256, and the input picture directly passes through a VGG16 convolutional neural network to obtain feature mapping, and then obtains 4096-dimensional feature mapping through normalization processing;

for the second scale, converting the input picture into 128 × 128 to enable the generated picture resolution to be 1/2 of the original picture resolution, then performing down-sampling, obtaining feature mapping of a low-resolution picture through a convolutional neural network, performing maximum pooling layer processing, and performing normalization processing to obtain 3072-dimensional feature mapping;

for the third scale, converting the input picture into 64 x 64 so that the resolution of the generated picture is 1/4 of the resolution of the original picture, then carrying out down-sampling, then obtaining the feature mapping of the low-resolution picture through a convolutional neural network, carrying out maximum pooling layer processing, and then carrying out normalization processing to obtain 3072 dimensional feature mapping;

finally, the softmax layer outputs the class probability given the input 3D model, the class with the highest probability is considered the prediction class of the 3D model, and the network is trained using a stochastic gradient descent method with momentum set to 0.9.

Fig. 4 shows a multi-channel convolutional neural network, which includes 6 channels, and extracts 6 different panoramas of a three-dimensional model as input, where the initial resolution of an input picture of each channel is 128 × 256, then the input picture of each channel passes through a multi-scale network to obtain vectors of predicted classification results, and then the vectors are weighted and averaged to obtain a final classification result vector. The network aims to create a branching channel that is segmented according to the 3-axis of the panoramic view. For the classification task, the probability vector is calculated by taking the mean of all three individual probability vectors. The descriptors of the retrieval task consist of the activation of the last fully-connected layer of the convolutional neural network.

Therefore, each 3D model has 6 descriptors, three of which are spatial distribution descriptors on XYZ axes, and are used to describe the position characteristics of the surface of the 3D model; the other three are normal vector distribution descriptors on XYZ axes, direction features of the 3D model surface. Each 3D model descriptor is compared to the remaining 3D model descriptors using an L1 distance metric (e.g., equation). Due to its linearity, the L1 distance is used, which emphasizes the differences between the components of the descriptor vector.

Where Q and M represent each 3D model. i and j are indices of the panoramic view. f is the feature vector of the panoramic view extracted by the multi-scale convolutional network, as shown in fig. 3. According to the distance between Q and M, the similarity between two different models can be easily obtained, and the 3D model retrieval task can be processed.

In summary, the embodiments of the present invention avoid the influence of the loss of structural and spatial information on the classification and retrieval of the three-dimensional model caused by representing a 3D model by a group of 2D images, and improve the accuracy of the classification and retrieval.

Example 3

The results of the protocol of examples 1 and 2 are verified below with reference to the specific data set, table 1 and table 2, and are described in detail below:

the dataset used to evaluate the proposed classification method is the primston model net large 3d cad model dataset. ModelNet consists of 127,915 CAD models, grouped into 662 object classes, into two subsets, ModelNet-10 and ModelNet-40, both of which contain training and testing partitions.

1) ModelNet10 consists of 4899 CAD models, classified into 10 classes. For convenience of processing, the models are adjusted by placing the center of mass of the model at the origin of the coordinates and normalizing in terms of translation and rotation.

The training and testing subsets of the ModelNet10 are composed of 3991 and 908 models, respectively.

2) ModelNet-40 contains 12,311 CAD models, divided into 40 categories. For ease of handling, these models are adjusted to place the centroid of the model at the origin of the coordinates, but are not normalized.

The training and testing subsets of ModelNet-40 are composed of 9843 and 2468 models, respectively.

Hair brushThe MSMC-NN proposed in the example was evaluated on the classification task of the test subset of ModelNet-10 and ModelNet-40. Performance is measured by the average binary classification accuracy (a value of 1 corresponds to the case where the class of the test 3D model is correctly predicted, otherwise 0). The comparison was origi Light Field^[15](LFD,4700dimensions) and Spherical Harmonics^[16](SPH,544dimensions), machine learning is not used in order to set evaluation criteria. Also included are 3d hash nets using recently machine learning methods^[17](V)，the DeepPano descriptor^[18]，the Geometry Image descriptor^[19]. In addition to the above described competition methods, the results were extended to include the following techniques: GIFT^[20]，ORION(V)，Set-convolution^[21]，3D-GAN^[22](V)，Vox Net^[7](V), Garcia-Garcia et al Point Net method^[23](Point Net-Garcia), Xu and Todorovic^[24](V). The scores of the above competition methods are the scores reported by the authors in the respective papers. Table 1 summarizes the scores of the above methods. The corresponding experimental results are shown in the table.

TABLE 1 Classification accuracy of ModelNet-10 and ModelNet-40 datasets.

(V) the representation method adopts voxel representation; (NONML) shows no involvement of machine learning.

From Table 1, it can be seen that the proposed method outperforms all of the above methods in the challenging ModelNet-40 dataset and ModelNet-10. It is clear that the method using voxel representation generally performs better than the method using image representation. This can be demonstrated by the richer information contained in the 3D volume data about the 2D representation. However, despite the use of image representations, the proposed method can outperform previous methods. Meanwhile, only MC-NN is applied to handle the 3D classification problem. The corresponding results also indicate that MSMC-NN gave better results than MC-NN.

Another evaluation of the proposed method is performed on the task of 3D model retrieval. The performance of this method was measured on the ModelNet-10 and ModelNet-40 datasets, compared to the method that provided the search results and the GIFT method. On the ModelNet dataset, retrieval accuracy is measured by the average retrieval accuracy (mAP). The comparisons were made with original Light fields (LFD,4700dimensions) and the topical Harmonics (SPH,544dimensions), and machine learning was not used to set the evaluation criteria. Also included are the 3D halftones, the deep Pano descriptor, the Geometry Image descriptor that have recently used machine learning methods. The scores of the above competition methods are the scores reported by the authors in the respective papers. For the ModelNet dataset, table 2 shows the results of the search experiment, where the proposed method outperforms the competition method described above.

Table 2ModelNet-10 and ModelNet-40 dataset average retrieval accuracy (mAP) (NONML) shows that no machine learning is involved.

In summary, the embodiments of the present invention illustrate that the addition of a multi-scale panorama view helps to improve performance, and the panorama, in addition to being a good shape descriptor, also makes up the gap between the initial 3D model representation and the 2D input, which is generally more suitable for convolutional neural networks. Related experiments also show the effectiveness of the panoramic view, so that the classification and retrieval accuracy is improved, and the effectiveness of design is shown.

Reference documents:

[1]Anan Liu,Zhongyang Wang,Weizhi Nie,and Yuting Su.Graph-based characteristic view set extraction and matching for 3d model retrieval.Information Sciences,320:429–442,2015.

[2]Luren Yang and Fritz Albregtsen.Fast and exact computation of cartesian geometric moments using discrete green’s theorem.Pattern Recognition,29(7):1061–1073,1996.

[3]Ke Lu,Qian Wang,Jian Xue,and Weiguo Pan.3d model retrieval and classification by semi-supervised learning with content-based similarity.Information Sciences,281:703–713,2014.

[4]PRZEMYSLAW Polewski,W Yao,MARCO Heurich,PETER Krzystek,and U Stilla.Detection of fallen trees in als point clouds of a temperate forest by combining point/primitive-level shape descriptors.Gemeinsame Tagung,2014.

[5]Wei-Zhi Nie,An-An Liu,and Yu-Ting Su.3d object retrieval based on sparse coding in weak supervision.Journal of Visual Communication and Image Representation,37:40–45,2016.

[6]Biao Leng,Xiangyang Zhang,Ming Yao,and Zhang Xiong.A 3d model recognition mechanism based on deep boltzmann machines.Neurocomputing,151:593–602,2015.

[7]Daniel Maturana and Sebastian Scherer.Voxnet:A 3d convolutional neural network for real-time object recognition.In Intelligent Robots and Systems(IROS),2015IEEE/RSJ International Conference on,pages 922–928.IEEE,2015.

[8]Hang Su,Subhransu Maji,Evangelos Kalogerakis,and Erik Learnedmiller.Multiview convolutional neural networks for 3d shape recognition.pages 945–953,2016.

[9]Asako Kanezaki,Yasuyuki Matsushita,and Yoshifumi Nishida.Rotationnet:Joint object categorization and pose estimation using multiviews from unsupervised viewpoints.2016.

[10]R.Qi Charles,Su Hao,Kaichun Mo,and Leonidas J.Guibas.Pointnet:Deep learning on point sets for 3d classification and segmentation.In IEEE Conference on Computer Vision and Pattern Recognition,pages 77–85,2017.

[11]Zhirong Wu,Shuran Song,Aditya Khosla,Fisher Yu,Linguang Zhang,Xiaoou Tang,and Jianxiong Xiao.3d shapenets:A deep representation for volumetric shapes.pages 1912–1920,2014.

[12]Nima Sedaghat,Mohammadreza Zolfaghari,Ehsan Amiri,and Thomas Brox.Orientation-boosted voxel nets for 3d object recognition.arXiv preprint arXiv:1604.03351,2016.

[13] panagiotis Papadaikis, Ioanis Pratikakis, Theoharis Theoharis, and Stavros permantons. Panorac views A3 d shape descriptor based on panoramic views for unsupervised 3d object retrieval International Journal of Computer Vision 89(2-3): 177-.

[14]Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556,2014.

[15]Ding-Yun Chen,Xiao-Pei Tian,Yu-Te Shen,and Ming Ouhyoung.On visual similarity based 3d model retrieval.In Computer graphics forum,volume 22,pages 223–232.Wiley Online Library,2003.

[16]Michael Kazhdan,Thomas Funkhouser,and Szymon Rusinkiewicz.Rotation invariant spherical harmonic representation of 3 d shape descriptors.In Symposium on geometry processing,volume 6,pages 156–164,2003.

[17]Zhirong Wu,Shuran Song,Aditya Khosla,Fisher Yu,Linguang Zhang,Xiaoou Tang,and Jianxiong Xiao.3d shapenets:A deep representation for volumetric shapes.In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 1912–1920,2015.

[18]Baoguang Shi,Song Bai,Zhichao Zhou,and Xiang Bai.Deeppano:Deep panoramic representation for 3-d shape recognition.IEEE Signal Processing Letters,22(12):2339–2343,2015.

[19]Ayan Sinha,Jing Bai,and Karthik Ramani.Deep learning 3d shape surfaces using geometry images.In European Conference on Computer Vision,pages 223–240.Springer,2016.

[20]Song Bai,Xiang Bai,Zhichao Zhou,Zhaoxiang Zhang,and Longin Jan Latecki.

Gift:A real-time and scalable 3d shape search engine.In Computer Vision and Pattern Recognition(CVPR),2016IEEE Conference on,pages 5023–5032.IEEE,2016.

[21]Siamak Ravanbakhsh,Jeff Schneider,and Barnabas Poczos.Deep learning with sets and point clouds.arXiv preprint arXiv:1611.04500,2016.

[22]Jiajun Wu,Chengkai Zhang,Tianfan Xue,Bill Freeman,and Josh Tenenbaum.Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling.In Advances in Neural Information Processing Systems,pages 82–90,2016.

[23]Alberto Garcia-Garcia,Francisco Gomez-Donoso,Jose Garcia-Rodriguez,Sergio Orts-Escolano,Miguel Cazorla,and J Azorin-Lopez.Pointnet:A 3d convolutional neural network for real-time object class recognition.In Neural Networks(IJCNN),2016International Joint Conference on,pages 1578–1584.IEEE,2016.

[24]Xu Xu and Sinisa Todorovic.Beam search for learning a deep convolutional neural network of 3d shapes.In Pattern Recognition(ICPR),201623rd International Conference on,pages3506–3511.IEEE,2016.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for extracting a three-dimensional model view based on a panorama and a multichannel CNN is characterized by comprising the following steps:

by taking an angle in a plane formed by any two coordinate axes

Sampling with y coordinate to obtain each point in initial panorama

2. The method for extracting the three-dimensional model view based on the panorama and the multi-channel CNN as claimed in claim 1, wherein the preset conditions are as follows: the height of the cylinder is 2 times of the radius of the bottom surface.

3. The method as claimed in claim 1, wherein the preset rate is: at rates 2B and B diagonal

And the y coordinate.

4. The method for extracting three-dimensional model view based on panorama and multi-channel CNN as claimed in claim 1,

the multi-scale network comprises: extracting view descriptors of different resolutions of the same input picture respectively, wherein the size of the input picture is 256 × 256;

5. The panorama and multi-channel CNN-based three-dimensional model view extraction method of claim 1, wherein the multi-channel convolutional neural network comprises 6 channels,