Three-dimensional model retrieval method of multi-view neural network based on global feature capture aggregation
Technical Field
The invention belongs to the field of computer vision and deep learning, and discloses a three-dimensional model retrieval method based on a multi-view neural network of global feature capture and aggregation to mine the internal relation between multi-view images representing a three-dimensional model, so that the performance of three-dimensional model retrieval is improved.
Background
With the great improvement of computer performance, a large number of three-dimensional models are generated in the fields of three-dimensional medicine, virtual reality and three-dimensional games, and how to identify and search the three-dimensional models becomes a research direction which is concerned by people in the current field of computer vision. Model representation methods in three-dimensional model retrieval can be divided into two types: 1) model-based representation methods, such as grid-or voxel-based discrete representation, but also point cloud-based representation methods. The feature design based on model representation is mostly based on the shape of the model itself and its geometrical properties, such as a three-dimensional histogram of hand design, a feature bag constructed by surface curvature and normal. 2) Based on a multi-view representation method, a three-dimensional model is represented using two-dimensional images acquired from different views. The representation method based on the two-dimensional image also has various manual design features, such as directional gradient histograms, zernike moments, SIFT features and the like.
However, the conventional manual design features are not good in search performance, and because the features obtained from different design algorithms are different in emphasis due to manual design, the model features cannot be comprehensively represented. With the wide application of deep learning techniques in the field of computer vision, such as classical AlexNet and google lenet deep convolutional neural networks. The data is automatically learned and fitted with the image features through the neural network, and compared with the manually designed features, the data can learn more comprehensive features, so that the image recognition effect is greatly improved. In multi-view-based three-dimensional model retrieval, each three-dimensional model has a plurality of view image representations, but the existing deep neural network is mainly used for identifying a single image, and the identification effect is limited by incompleteness of information. How to aggregate multi-view image information and how to capture the spatial characteristics of the model is the key to improve the retrieval performance of the three-dimensional model.
The method has the advantages that the multi-view image features cannot be simply and directly spliced by aggregating the multi-view information and capturing the model space features, and the splicing has various defects, so that the feature dimension is multiplied to increase, the retrieval time is increased, and the space features cannot be effectively captured by simple splicing, and the retrieval performance is not obviously improved.
Disclosure of Invention
The invention aims to solve the problems that multi-view features cannot be effectively aggregated and model space information is lost in the conventional method, and provides a three-dimensional model retrieval method of a multi-view neural network based on global feature capture aggregation.
The method is used for mining the internal relation among the multi-view images of the three-dimensional model, capturing the spatial information of the three-dimensional model and simultaneously improving the retrieval speed by fusing the multi-view features. The invention is specifically verified in three-dimensional model retrieval.
The invention provides a multi-view neural network three-dimensional model retrieval method based on global feature capture aggregation, which comprises the following steps:
1 st, multiview representation of three-dimensional models
The invention carries out the retrieval of the three-dimensional model based on the multi-view representation of the three-dimensional model, sets the view angle through processing software after obtaining the three-dimensional model data, and captures the view image of the corresponding view angle of the three-dimensional model.
2 nd, designing a network model
And designing a special double-chain deep neural network model according to the characteristics of three-dimensional model retrieval, and using the special double-chain deep neural network model for training and learning the characteristic representation suitable for the three-dimensional model. The double-chain deep neural network model comprises 5 parts, namely a low-dimensional convolution module, a non-local module, a high-dimensional convolution module, a weighted local aggregation layer and a classification layer. Meanwhile, a fusion loss function based on the central loss and the paired boundary loss is designed to increase the distinguishability between different types of three-dimensional models.
3, generating the most difficult sample pairs
The use of the double-chain deep neural network model requires the input in the form of sample pairs, and if all samples are paired, the number of generated sample pairs is extremely large. And the most difficult sample pairs are generated according to the principle that the samples in the same type are farthest away and the samples in different types are nearest.
4, training network model
And training a double-chain deep neural network by using a three-dimensional model training set, wherein the double-chain deep neural network learns the network model parameters capable of comprehensively representing the training data by an objective function.
5, extracting depth features
In the retrieval process, each three-dimensional model uses feature representation, and the invention uses the network model parameters trained in the step 4 to extract features. The network model is input into a plurality of view images representing a single three-dimensional model, and the features of the plurality of view images are aggregated into a three-dimensional model feature descriptor with high discrimination degree through feature extraction and aggregation of a double-chain deep neural network.
6 th, performing three-dimensional model search
Given a three-dimensional model, we want to find a three-dimensional model that is of the same kind as the three-dimensional model in the target dataset, i.e. a related three-dimensional model. The feature description and distance measurement method in the three-dimensional model retrieval is very important. The feature description uses the depth features extracted in the step 5, and the distance measurement method uses the Euclidean distance formula, and the calculation process is as follows.
x and y respectively represent three-dimensional models, wherein d (x, y) represents the distance between the two three-dimensional models, and xi,yiRespectively representing the i-dimensional features of x and the i-dimensional features of y.
The advantages and beneficial effects of the invention;
1) for multi-view images, a non-local module is used to mine potential relevance between the various view images.
2) And aggregating the captured high-dimensional features of each view by using a weighted local aggregation layer to obtain a high-discrimination three-dimensional model feature descriptor.
3) Through the two improvements, the invention achieves advanced performance in the three-dimensional model search, and the search result is shown in fig. 5.
Drawings
FIG. 1 is a double-chain deep neural network structure designed by the present invention.
FIG. 2 is a retrieval flow diagram of the present invention.
FIG. 3 is an example of a three-dimensional model data set.
FIG. 4 is a multi-perspective image of a three-dimensional model.
FIG. 5 is a comparison of the search performance results of the present invention method with the current advanced method on a ModelNet40 dataset. The corresponding documents of fig. 5 are as follows.
[1]You H,FengY,Ji R,et al.PVNet:A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition[J].acm multimedia,2018:1310-1318.
[2]He X,ZhouY,Zhou Z,et al.Triplet-Center Loss for Multi-view 3D Object Retrieval[J].computer vision andpattern recognition,2018:1945-1954.
[3]Yavartanoo M,Kim EY,Lee K M,et al.SPNet:Deep 3D Object Classification and Retrieval using Stereographic Projection.[J].arXiv:Computer Vision and Pattern Recognition,2018.
[4]Feng Y,Zhang Z,Zhao X,et al.GVCNN:Group-View Convolutional Neural Networks for 3D Shape Recognition[C].computer vision andpattern recognition,2018:264-272.
[5]Su H,Maji S,Kalogerakis E,et al.Multi-view Convolutional Neural Networks for 3D Shape Recognition[J].international conference on computer vision,2015:945-953.
[6]Bai S,Bai X,Zhou Z,et al.GIFT:A Real-Time and Scalable 3D Shape Search Engine[J].computer vision and pattern recognition,2016:5023-5032.
[7]Shi B,Bai S,Zhou Z,et al.DeepPano:Deep Panoramic Representation for 3-D Shape Recognition[J].IEEE Signal Processing Letters,2015,22(12):2339-2343.
[8]Sinha A,Bai J,Ramani K,et al.Deep Learning 3D Shape Surfaces Using Geometry Images[C].european conference on computer vision,2016:223-240.
[9]Wu Z,Song S,KhoslaA,et al.3D ShapeNets:A deep representation for volumetric shapes[J].computer vision and pattern recognition,2015:1912-1920.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Example 1:
fig. 2 is a flowchart illustrating steps of a three-dimensional model retrieval method based on a multi-view neural network with global feature capture aggregation according to the present invention, and the specific operation steps are as follows.
Step one Multi-View representation of three-dimensional model
Three-dimensional models in the fields of medicine, games, industrial design, and the like are all represented as a single three-dimensional model, and are usually stored as a polygonal mesh, which is a collection of points connected to edges forming a surface. Therefore, in the method proposed by the present invention, the three-dimensional model needs to be represented by multi-view images, and to create a multi-view shape representation, we need to set up a view (virtual camera) to render each mesh. We created 12 rendered view images by placing 12 virtual cameras every 30 degrees around the grid, as shown in fig. 4.
Step two, designing a network model
According to the characteristics of three-dimensional model retrieval, a special double-chain deep neural network model is designed for training and learning the characteristic representation suitable for the three-dimensional model. The double-chain deep neural network model comprises 5 parts, namely a low-dimensional convolution module, a non-local module, a high-dimensional convolution module, a weighted local aggregation layer and a classification layer. Meanwhile, a fusion loss function based on the center loss and the paired boundary loss is designed to increase the distinguishability between different types of three-dimensional models.
The low-dimensional convolution module contains a convolution layer with convolution kernel of 7x7 and step size of 2, followed by a max pooling layer with kernel of 3x3 and step size of 2. The module is used to capture low-dimensional features of the extracted view. In the non-local module, the non-local module is used for mining the relation between the views, and a graph structure is constructed through the non-local idea to connect different view images. The formula of the non-local module is shown as follows.
Is provided with a view set V, VaIs the a view in V, VbIs the b view in V, yaTo correspond to vaTo output of (c). Become intoThe function g is used to calculate the correlation between the two views, the univariate function h is used to scale the input, and the function U is used for normalization.
For the convenience of convolution operation, the formula of the pairwise function g is as follows:
g(va,vb)=α(va)Tβ(vb) (3)
wherein α (v)a)=Wαva,β(va)=Wβvb,Wα,WβIs a weight matrix that can be learned. The normalization factor u (x) is N, where N is the number of views included in the view set V.
The unary function h is a linear function:
h(mb)=Whmb (4)
Whfor convolutional layer network parameters, a 1x1 convolution operation is used in the implementation. The non-local module is as follows:
za=Wzya+ma (5)
wherein Wzya+maDenotes a residual connection, yaCalculated from formula (2) to obtain maAs an original input, zaIs the final output of the non-local module. The module can be conveniently inserted into the existing network model through the implementation form, and the original model does not need to be adjusted to adapt to the existing network model.
A high-dimensional convolution module (four residual convolution modules) is added after the non-local modules to capture the high-level abstract features of the view. The first module comprises 6 convolutional layers with 3x3 cores, the second module comprises 8 convolutional layers with 3x3 cores, the third module comprises 12 convolutional layers with 3x3 cores, and the fourth module comprises 6 convolutional layers with 3x3 cores, wherein residual operation is performed on every two convolutional layers, and gradient explosion is effectively prevented through the residual operation. After high-level abstract features of the multiple views are extracted, a weighted local aggregation layer is used for self-learning the differentiated weight of the multiple views, a virtual class center thought is used for classifying the multiple view pictures, the virtual classes participate in the classification process, but the virtual classes are directly abandoned in the return process, so that the contribution degree of the pictures with low distinctiveness is reduced. The input of the layer is vector characteristics corresponding to each multi-view, and the output of the layer is class center residual error vector characteristics for removing virtual classes. And finally, aggregating the model descriptors into a compact model descriptor with high discrimination degree through a Max-posing operation. The classification layer uses Softmax classification.
A double-chain network is designed on the basis of single chains, and meanwhile, a plurality of loss functions are fused to learn more distinctive feature descriptors, namely a paired boundary loss function and a central loss function.
The pairwise boundary loss function is shown below:
Lb(xi,xj)=(1-yij)(α-dij)+yij[dij-(α-m)] (6)
xixja pair of samples, dijIs the Euclidean distance, yijIs the relevant weight. Where α is the boundary, m is the distance between the farthest positive sample and the nearest negative sample, Lb(xi,xj) The loss function value is obtained.
The center loss function is as follows:
wherein xiAs a sample feature, kciIs a class center of class i, LcThe corresponding central loss function value. The distance between the sample feature and the center of the corresponding class is calculated, and the corresponding loss function value is smaller when the distance is smaller, so that the effect of reducing the distance in the class is achieved, and the more distinctive feature descriptor is obtained.
By fusing the loss functions, the intra-class distance is reduced, the inter-class distance is increased, and the feature descriptors with higher distinctiveness can be independently learned.
Step three most difficult sample pair generation
During the process of generating the sample pairsIf there are p objects in each class, then generating the most difficult positive sample pairs, and selecting the k objects with the farthest distance in each class for each object. The positive sample pairs have a total of c · k · p sample pairs. When the positive and negative sample pairs are generated, each object needs to be matched with one object which is the nearest in other classes to be used as a nearest neighbor different class sample, and the total c.p2And (4) sample pairs.
Step four training network model
And training a double-chain deep neural network by using a three-dimensional model training set, wherein the double-chain deep neural network learns the network model parameters capable of comprehensively representing the training data by an objective function.
The invention can train the network model by using a Pythrch deep learning framework, and firstly, data preprocessing operation including data normalization, original image size unification, random cutting, image horizontal random inversion and image vertical inversion is required to be carried out on input data. The data normalization is used for normalizing the raw data to the statistical distribution on a fixed interval so as to ensure that the program convergence is accelerated. The reason why the original image size is unified is that the size of the network model is fixed after the network model is designed, and therefore the size of the input image is consistent with the size required by the network model. Random cropping, image flipping horizontally, and vertical inversion are to increase the amount of data to prevent the network model from overfitting. In the initial parameter setting, we set the iteration round number to 8, 20 sample pairs per iteration, where the initial learning rate is set to 0.0001, and the pre-trained network model parameters use network model parameters pre-trained on the large dataset ImageNet. Where the parameter a is set to 1.2 and m is set to 0.4. The present invention uses an adaptive gradient optimizer that can adaptively adjust the learning rate for different parameters. The step update range is calculated by comprehensively considering the first moment estimate and the second moment estimate of the gradient.
Step five, extracting depth features
In the retrieval process, each three-dimensional model uses characteristic representation, the invention uses a double-chain structure to train the network when training the network to obtain a trained neural network model, and then uses the network model parameters trained in the step 4 as the model parameters for extracting the characteristics. After the features are extracted, a single-chain network model parameter is obtained. The network model is input into a plurality of view images representing a single three-dimensional model, feature extraction and aggregation are carried out through the network structure provided by the invention, and the features of the plurality of view images are aggregated into the features of the single three-dimensional model. In the process of extracting the features, the dimension of the features extracted by the method is 512.
Step six three-dimensional model retrieval
Given a three-dimensional model, a three-dimensional model which belongs to the same kind as the three-dimensional model, namely a related three-dimensional model, is found in a target data set, a retrieval three-dimensional model set is set to be Q, a data set to be queried is set to be G, and the target is to find the three-dimensional model related to the three-dimensional model in Q in G. The realization form is to calculate the same three-dimensional model QiThe relevance of each three-dimensional model in the data set G is sorted according to the relevance size to obtain the relevance of the three-dimensional model QiA related three-dimensional model. The specific implementation form is shown as follows.
The three-dimensional model set and the data set to be inquired are retrieved by using the characteristic vector representation, and the invention uses the 5 th step to extract the characteristics. After the characteristic representation of each three-dimensional model in the retrieval data set and the data set to be inquired is obtained, the three-dimensional model Q is calculatediThe distance from each three-dimensional model in the data set G to be queried is expressed in the following form.
LijAs a three-dimensional model Qi,GjWherein f (Q)i,Gj) For the distance measurement method between two three-dimensional models, the distance measurement method of the invention uses Euclidean distance, and the calculation process is as follows:
wherein x and y represent different three-dimensional models, respectively, and d (x, y) represents twoDistance, x, between three-dimensional modelsi,yiRespectively representing the i-dimensional features of x and the i-dimensional features of y. Calculating to obtain QiAfter the distance from each three-dimensional model in G, the distances are sorted, and the first k distances can be taken as the same QiA related three-dimensional model. The results of the sequential calculations to obtain the three-dimensional model in G that is correlated to the three-dimensional model in Q are shown in fig. 5 for the modelnet40 dataset.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.