AU2020204549A1

AU2020204549A1 - Multi-View Three-Dimensional Model Retrieval Method Based on Non-Local Graph Convolutional Network

Info

Publication number: AU2020204549A1
Application number: AU2020204549A
Authority: AU
Inventors: Da Chen; Zhiyong CHENG; Zan GAO; Yinmin LI; Liqiang NIE; Minglei SHU
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan; Shandong Institute of Artificial Intelligence
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan; Shandong Institute of Artificial Intelligence
Priority date: 2019-09-09
Filing date: 2020-07-08
Publication date: 2021-03-25
Also published as: CN110543581A; AU2020104423A4; CN110543581B

Abstract

The present invention relates to the field of computer vision and deep learning. In order to address the drawbacks in the existing view-based deep learning methods that they cannot capture comprehensive spatial information of a 3D model, the present invention provides a multi-view 5 three-dimensional model retrieval method based on non-local graph convolutional network, which explores and fuses high response features of multiple views and thus obtains a single, compact, highly discriminative model descriptor. Its excellent performance has been verified in 3D model retrieval. The present invention includes the following steps: (1) acquiring multi perspective images of a model, (2) preprocessing the multi-perspective images, (3) designing a 0 non-local graph convolutional network, (4) training the non-local graph convolutional network, (5) extracting model depth features, (6) retrieval and matching of three-dimensional model. uQ u Q o Q - o - ) u A -,o A o u Q ] T -- -E > FIG. 1

Description

uQu Q

oQ - o - ) u A -,o A o

u Q

] T-- -E >

FIG. 1

MULTI-VIEW THREE-DIMENSIONAL MODEL RETRIEVAL METHOD BASED ON NON-LOCAL GRAPH CONVOLUTIONAL NETWORK TECHNICAL FIELD

[01] The present invention relates to the field of computer vision and deep learning. The multi-view three-dimensional model retrieval method based on non-local graph convolutional network according to the present invention explores and fuses high response features of multiple views, and thus obtains a single, compact, highly discriminative model descriptor. The present invention can significantly improve the performance of 3D model retrieval.

BACKGROUND

[02] Image retrieval has been an urgent need since the invention of graphic computers. Early researchers developed a large number of image retrieval algorithms to meet this need, such as scale-invariant feature transformation, nearest neighbor estimation, weakly supervised deep metric learning, bipartite graph and feature learning. In recent years, with the advancements in computer hardware performance and the rapid development of various 3D sensors and 3D modeling software, 3D models have become the basic components in many fields, and retrieval of the 3D models has become an important program. Therefore, designing and developing a model retrieval algorithm is a hot topic in the field of computer vision. 3D model retrieval research can be divided into two stages: (1) Early 3D model retrieval based on conventional methods; (2) 3D model retrieval based on deep neural network.

[03] Early 3D model retrieval based on conventional methods includes model-based 3D model retrieval and multi-view-based 3D model retrieval. The process of 3D model retrieval includes two key steps: (1) feature extraction, and (2) model retrieval. The feature extraction of early model-based algorithms includes grid-based representation and point cloud-based representation, and the features are designed mostly based on their own geometric properties and shapes. With respect to model retrieval, researchers have developed various methods. Some researchers convert 3D shape information into histograms and use Euclidean distance to compare the histograms of two models to obtain their similarity.

[04] The 3D model retrieval based on deep neural network, similarly, includes model based 3D model retrieval and multi-view-based 3D model retrieval. The key steps of retrieval are similar to the conventional methods. Deep neural networks have achieved excellent performance in image classification, image segmentation and other fields, and many 3D model retrieval methods based on deep neural networks have been proposed. Model-based methods mainly use 3D convolution or 2D convolution to capture the point cloud and mesh feature information of the model. Multi-view-based algorithms use neural networks to extract deep learning features, such as AlexNet, GoogLeNet, VGG and ResNet, and then use conventional retrieval algorithms to retrieve the 3D model. However, view features extracted by using neural network alone cannot contain comprehensive information of the 3D model.

[05] In 3D model retrieval based on multiple perspectives, each 3D model is presented in multiple perspectives; but existing deep neural networks mainly identify a single image and the identifying performance is limited by the incompleteness of information. How to aggregate multi-perspective image information and how to capture the spatial characteristics of the model are the keys to improving the performance of the 3D model retrieval.

SUMMARY OF PARTICULAR EMBODIMENTS

[06] In order to address the drawbacks in the existing view-based deep learning methods that they cannot capture comprehensive spatial information of a 3D model, the present invention provides a multi-view three-dimensional model retrieval method based on non-local graph convolutional network, which explores and fuses high response features of multiple views and O thus obtains a single, compact, highly discriminative model descriptor. Its excellent performance has been verified in 3D model retrieval.

[07] A multi-view three-dimensional model retrieval method based on non-local graph convolutional network is provided, which has excellent performance in 3D model retrieval, the method including the following steps.

Step 1 - acquiring multi-perspective images of a model:

The present invention is applicable to real-world objects, and applicable to computer-made three dimensional models. When acquiring multi-view images of a model, for real-world objects, multiple angle views of an object can be captured by setting multiple multi-angle cameras; for computer three-dimensional models, multiple angle views of the model can be captured by setting various angles of a virtual camera in the software.

Step 2 - preprocessing the multi-perspective images:

In order to better train the network and meet the requirements of non-local graph convolutional network-based retrieval according to the present invention, the multi-perspective images are pre processed, which includes image cropping, image resizing, image flipping and image normalization.

Step 3 - designing a non-local graph convolutional network:

In order to address the problem in the existing neural networks that they cannot fully capture the spatial information of the model itself, the present invention designs and invents a non-local graph convolution network to explore and fuse high-response features of multiple views. The non-local graph convolutional network includes the following modules: a convolution module 1, a graph convolution module, a convolution module 2, a graph convolution module, a convolution module 3, a feature fusion module, and a model classification module.

Step 4 - training the non-local graph convolutional network:

Through the previous three steps, the data required for training the non-local graph convolutional network and its network architecture are obtained. The present invention uses a pytorch deep learning framework to train the network model, and the language is python3.6. The network can simultaneously be input multiple images. As the number of iterations increases, the value of a D loss function decreases until convergence, the convergence condition being that the value of the loss function is stable at about 1x103 .

Step 5. extracting model depth features:

Pytorch deep learning framework is also used in extracting the deep features of the model. Once the non-local graph convolutional network according to the present invention is trained, the trained non-local graph convolutional network model parameters are obtained. Then all three dimensional models that the retrieval and matching include are input into the pre-trained non local graph convolutional network. High-response features of the multiple views are obtained through the previous convolution modules and graph convolution module. Max pooling is used to fuse the multiple views, to obtain a single, compact, highly discriminative model descriptor.

Step 6 - retrieval and matching of three-dimensional model:

Using a correlation-based measurement for the three-dimensional model retrieval, where an L2 norm based Euclidean distance measurement is used to calculate the distance between two models and the magnitude of the distance represents the correlation between the three dimensional models, the calculation being:

n L(a,b)= (a. -b.) i i=1 r

where a and b represent two different models; L(a,b) is the calculated distance between the two models; a,,b, represent an ith dimension feature of a and an ith dimension feature of b,

respectively.

[08] The present invention has the following advantages and positive effects:

1) For multi-perspective images, non-local graph convolution layers are used to capture advanced spatial features between multiple views. 2) Max-pooling is used to fuse multiple views and obtain its high response characteristics, so as to obtain a compact and highly discriminative model descriptor. 3) The multi-view 3D model retrieval method based on non-local graph convolution network according to the present invention achieves excellent performance in 3D model retrieval.

BRIEF DESCRIPTION OF THE DRAWINGS

[09] FIG. 1 is a flowchart of the present invention.

[010] FIG. 2 illustrates a non-local graph convolution network according to the present invention.

[011] FIG. 3 shows an example of a multi-view representation of a three-dimensional model.

[012] FIG. 4 shows an example of multi-view acquisition of a three-dimensional model.

[013] FIG. 5 shows a performance comparison between the present invention and a current advanced method on an MVRED dataset.

A

[014] FIG. 6 shows a performance comparison between the present invention and a current advanced method on an NTU dataset. The corresponding literatures of the current methods in FIG. 5 and FIG. 6 are shown below:

[1] Krizhevsky A, Sutskever I, Hinton G E, et al. ImageNetClassification with Deep Convolutional Neural Networks[J]. neural information processing systems, 2012, 141 (5): 1097-1105.

[2] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[J]. computer vision and pattern recognition, 2016: 770-778.

[3] Su H, Maji S, Kalogerakis E, et al. Multi-view Convolutional Neural Networks for 3D Shape Recognition[J]. international conference on computer vision, 2015: 945-953.

DETAILED DESCRIPTION OF PARTICULAR EMBODIMENTS

[015] The present invention is further described below with reference to the drawings.

[016] Embodiment 1

FIG. 1 shows a flowchart of a multi-view three-dimensional model retrieval method based on non-local graph convolutional network according to the present invention. Implementation steps of the method include the following.

Step 1 - acquiring multi-perspective images of a model

The present invention is applicable to real-world objects, and applicable to computer-made three dimensional models. When acquiring multi-view images of a model, for real-world objects, multiple angle views of an object can be captured by setting multiple multi-angle cameras, and preferably the multiple groups of cameras are evenly distributed, to obtain multi-view images with rich information. In an embodiment of the present invention, twelve cameras are placed around the object at intervals of 30 degrees, to capture twelve multi-perspective pictures of the image. For computer three-dimensional models, multiple angle views of the model can be captured by setting various angles of a virtual camera in the software, where the virtual camera renders each mesh. Similar to the camera setup above, twelve virtual cameras are placed around the mesh at intervals of 30 degrees, to generate twelve rendered views. If there are n classes and each class has m models, the three-dimensional model dataset may contain a total of n x m x 12 multi-perspective images.

Step 2 - preprocessing the multi-perspective images

In order to better train the network and meet the requirements of non-local graph convolutional network-based retrieval according to the present invention, the multi-perspective images are pre processed, which includes image cropping, image resizing, image flipping and image normalization. The data normalization is used to normalize the original data to a statistical distribution on a fixed interval to ensure that the program converges faster. The original image size is resized to a unified size because the size of the network model is fixed once the design is completed, and the size of the input images has to match the size of the network model. Random cropping, horizontal flipping and vertical flipping are to increase the amount of data to generalize the training model parameters and prevent the network model from overfitting.

Step 3 - designing a non-local graph convolutional network

In order to address the problem in the existing neural networks that they cannot fully capture the spatial information of the model itself, the present invention designs and invents a non-local graph convolution network to explore and fuse high-response features of multiple views. The non-local graph convolutional network includes the following modules or blocks: a convolution module or block 1, a graph convolution module, a convolution module or block 2, a graph convolution module, a convolution module or block 3, a feature fusion module, and a model classification module. The convolution module 1 includes a convolution layer, with a stride of 2 and a D convolution kernel of 7x7; following the convolution is a maximum pooling layer, with a stride of 2 and a kernel of 3x3. The graph convolution model connected to the convolution module 1 captures the spatial features contained in the multiple views. It is inspired by the idea of non-local means. Its definition is expressed below:

NL[v](i) = jE= I w(ij)v(j) (2)

where i is the search area; v(j) denotes other points in the area; w(ij) is the weight, which is determined by the correlation between matching blocks; NL[v](i) is the output corresponding to point i. In the present invention, graph convolution is used to capture the long dependency between views, and the convolution form is designed from how to establish the relationship between multiple views of the model. The form of the non-local graph convolution is shown below: in Y = x f(xix.)g(x.) (3) where i is an index of the output position, which can be a corresponding spatial position, and its response is calculated from x, and corresponding points x, in other views; x is the input signal, which is the characteristic of the views, and y is the corresponding output signal with the same size as x. The matching functionfis used to calculate the correlation between the input signals; a one-variable function g is used to scale the input; and a function C(x) is used to normalize the summation of the output signals.

[017] The matching functionfis calculated according to the equation below:

f (xi,,x) = e"O( T(( (4)

where the expression of (x,) is 9(x,)=wx,; the expression of P(xj) is P(x )= WVx 1 ; W, and

W, are corresponding convolution layer parameters. The normalization factor: C(x)= f (xi,,x).

[018] The one-variable function g is a linear function:

h(x )=Whxj (5)

In order to enable non-local operations to be incorporated into convolutional networks of many existing architectures, the non-local module is defined as follows:

z. =W y.+x. (6) 1 zl 1

where Wzy + x, represents a residual connection; y, is calculated according to equation (3); xi

is the original input; and z, is the final output of the graph convolution module. This residual

connection method can insert the non-local module into any pre-trained network model, without damaging the initialization of the original model.

[019] Following the non-local module, is the convolution module 2 which includes eight convolution layers with a stride of 1 x 1 and a kernel of 3 x 3. Following the module 2, a non local module is added again to capture advanced model space features. Then, it is the convolution module 3, which includes eighteen convolution layers with a stride of 1 x 1 and a kernel of 3 x 3, for extracting high-dimensional abstract features of each multi-view image. After acquiring the extracted high-dimensional abstract features of the multi-view images, max pooling is used to aggregate them into a single, compact, highly discriminative model descriptor. A classifier is subsequently added to classify and adjust the model. SoftMax classification loss function is used for classification adjustment.

[020] Step 4 - training the non-local graph convolutional network

Through the previous three steps, the data required for training the non-local graph convolutional network and its network architecture are obtained. The present invention uses a pytorch deep learning framework to train the network model, and the language is python 3.6. The network can simultaneously be input multiple images. In an embodiment of the present invention, there are twelve views of a model, and the number of input images is a multiple of 12. In an initial parameter setting, the number of iterations is set to be 40, with 32 samples for each iteration, where the initial learning rate is 0.001, and pre-trained network model parameters are network model parameters pre-trained on a large dataset imagenet. The present invention uses an adaptive gradient optimizer, which can adaptively adjust the learning rate for different parameters. As the number of iterations increases, the value of its loss function decreases until convergence, and the convergence condition is that the value of the loss function is stable at about xI10- 3 .

Step 5 - extracting model depth features

Pytorch deep learning framework is also used in extracting the deep features of the model. Once the non-local graph convolutional network according to the present invention is trained, the trained non-local graph convolutional network model parameters are obtained. Then all three dimensional models that the retrieval and matching include are input into the pre-trained non local graph convolutional network. The model input is multi-view images representing a single model. High-response features of the multiple views are obtained through the convolution modules and graph convolution module previously discussed. Max-pooling is used to fuse the multiple views, to obtain a single, compact, highly discriminative model descriptor. The present

Q invention uses the output of the max-pooling operation as a model feature, with a feature dimension of 512.

Step 6- retrieval and matching of three-dimensional model

Given a model, find a model in a target dataset that belongs to the same class as the model, i.e., a related model. Assuming a retrieval model set is Q and a dataset to be searched is G, the target is to find a model in G that is related to a model in Q. Its implementation is to calculate the correlations between model Q, and the models in dataset G, and sort them according to the

magnitude of the correlation, so as to obtain the model related to the model Qj. A specific

implementation form is shown below.

[021] Both the retrieval model set and the dataset to be searched are represented by feature vectors. In the present invention, features can be extracted according to step 5. Upon obtaining the feature representation of each model in the retrieval dataset and the dataset to be searched, the distance between the model Qi and each model in the dataset G to be searched is calculated according to the equation below:

n L..= Y f(Q.,G.) j (7)

where L. is the distance between the models Q,,G ; f (Q,,Gj) denotes the distance

measurement method between two models, in the present invention, Euclidean distance is used as the distance measurement method, and the calculation is shown below:

n 2 d (x, y)= x-g (8) i1

where x, y represents different models respectively; d(x, y) indicates that the distances between models are sorted after; x,, y, represents an ith dimension feature of x and an ith dimension

feature of y, respectively. Upon obtaining the distances between the model in Q, and in G, the

distances are sorted, and the first k ones can be taken as the related models to Qj. Sequentially,

the models in G that are related to the models in Q are calculated.

[022] In order to verify the effectiveness of the present invention, evaluations have been performed on public 3D model datasets MVRED and NTU. FIG. 6 shows a performance comparison between different algorithms and the method of the present invention. As can be seen from the figures, the proposed multi-view 3D model retrieval method based on non-local graph convolutional network according to the present invention has excellent performance.

[023] The appended claims are to be considered as incorporated into the above description.

[024] Throughout this specification, reference to any advantages, promises, objects or the like should not be regarded as cumulative, composite and/or collective and should be regarded as preferable or desirable rather than stated as a warranty.

[025] Throughout this specification, unless otherwise indicated, "comprise," "comprises," and "comprising," (and variants thereof) or related terms such as "includes" (and variants thereof)," are used inclusively rather than exclusively, so that a stated integer or group of integers may include one or more other non-stated integers or groups of integers.

[026] When any number or range is described herein, unless clearly stated otherwise, that number or range is approximate. Recitation of ranges of values herein are intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value and each separate subrange defined by such separate values is incorporated into the specification as if it were individually recited herein.

[027] Words indicating direction or orientation, such as "front", "rear", "back", etc, are D used for convenience. The inventor(s) envisages that various embodiments can be used in a non operative configuration, such as when presented for sale. Thus, such words are to be regarded as illustrative in nature, and not as restrictive.

[028] The term "and/or", e.g., "A and/or B" shall be understood to mean either "A and B" or "A or B" and shall be taken to provide explicit support for both meanings or for either meaning.

[029] It should be noted that the embodiments described herein are for illustrative purposes only and shall not be construed as limiting the scope of the invention. The present invention has been described in detail with reference to the embodiments; however, it should be understood by

1n those skilled in the art that modifications or equivalents can be made to the technical solutions of the present invention without deviation from the spirit and scope of the present invention. All those modifications or equivalents shall fall within the scope of the invention.

Claims

CLAIMS What is claimed is:

1. A multi-view three-dimensional model retrieval method based on non-local graph convolutional network, comprising:

Step 1 - acquiring multi-perspective images of a model:

capturing multiple angle views of an object by setting multiple multi-angle cameras, capturing multiple angle views of a computer three-dimensional model by setting multiple angles of a virtual camera in a software;

Step 2- preprocessing the multi-perspective images:

preprocessing the multi-perspective images, which comprises image cropping, image resizing, image flipping and image normalization;

Step 3 - designing a non-local graph convolutional network:

designing a non-local graph convolution network to explore and fuse high-response features of multiple views, where the non-local graph convolutional network comprises convolution modules, including graph convolution modules, a feature fusion module and a model classification module;

Step 4 - training the non-local graph convolutional network:

with data for training the non-local graph convolutional network and its network architecture obtained through the previous three steps, using a pytorch deep learning framework to train a network model, python3.6 being the language, where multiple images are input simultaneously to the network; as the number of iterations increases, the value of a loss function decreases until convergence, the convergence condition being that the value of the loss function is stable at about 1x10- 3 ;

Step 5 - extracting model depth features:

using pytorch deep learning framework to extract deep features of the model; obtaining trained non-local graph convolutional network model parameters upon completing the training of the

1') non-local graph convolutional network; inputting all three-dimensional models that the retrieval and matching include to the pre-trained non-local graph convolutional network; performing max pooling to fuse multiple views with the high-response features of the multiple views obtained through the previous convolution modules and graph convolution module, to obtain a single, compact, highly discriminative model descriptor;

Step 6 - retrieval and matching of three-dimensional model:

n L(a,b (1) i1 1

respectively.

1'2