CN111461166A

CN111461166A - Multi-modal feature fusion method based on L STM network

Info

Publication number: CN111461166A
Application number: CN202010128604.XA
Authority: CN
Inventors: 张静; 陈闯; 聂为之
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-07-28

Abstract

The invention discloses a multi-modal feature fusion method based on L STM network, which comprises the steps of 1) placing a plurality of cameras around a three-dimensional model and obtaining a group of multi-view views representing the three-dimensional model through continuous shooting, 2) obtaining three-dimensional model visual features contained in the multi-view based on a first L STM network and obtaining three-dimensional model skeleton features contained in the multi-view based on a DeepSkeleton network, and 3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model, wherein the multi-modal fusion model is constructed by utilizing the characteristic of L STM information persistence, and the fusion of the visual features and the structural features of the model is realized.

Description

Multi-modal feature fusion method based on L STM network

Technical Field

The invention relates to the field of three-dimensional models, in particular to an L STM network^[1]The multimodal feature fusion Method (MIF).

Background

With the increasingly common application of three-dimensional models in life, the spreading of text information cannot meet the requirements of people on information acquisition, and two-dimensional images become a main carrier of information sources. Compared with images, the three-dimensional model can describe the overall topological structure of the object and has stronger sense of reality. At present, three-dimensional models are more and more widely applied and occupy important positions in the fields of medicine, entertainment, manufacturing and the like, so that the three-dimensional models are applied to three-dimensional modelsModel retrieval method^[2][3][4]The research of (2) becomes particularly important. But unlike the representation of two-dimensional images, the representation of three-dimensional models is more susceptible to position occlusion and light exposure, and three-dimensional model retrieval remains a significant challenge. The way of manually designing the extracted features is a common method for current feature representation, for example: HOG reflecting three-dimensional model texture distribution^[5]、SIFT^[6]The feature, but the manual mode mainly focuses on the representation of local features, and the method cannot describe the global information of the three-dimensional model. In view of the deficiency, methods based on view and deep learning are introduced in the field of three-dimensional model retrieval, such as: MVCNN framework^[7]A three-dimensional model is characterized by two-dimensional views from multiple perspectives acquired from different perspectives, and a multi-view CNN network is proposed to acquire feature descriptors of the three-dimensional model. From Asako Kanezaki^[8]The proposed rotation network also applies a multi-view based neural network to handle the three-dimensional model retrieval problem.

The above method focuses on multi-angle visual information of a three-dimensional model, and therefore achieves remarkable results. However, these methods have the drawback that they focus on only a single two-dimensional view information, and ignore the correlation between two-dimensional views, even though they are similar to human perception of a three-dimensional model from a visual aspect, but in fact human perception involves not only the acquisition of visual information, but also the fusion of visual information, for example: continuity of visual information, fusion of information from different information sources, and the like. These methods ignore the above problems, and therefore, standing on the perspective of human perception, a feature representation method including both visual information and structural information is proposed to be able to better characterize a three-dimensional model.

Disclosure of Invention

The invention provides a multimode feature fusion method based on L STM network, which is characterized in that two L STM networks are connected in series in the forward direction to construct a three-dimensional model feature extraction model, the input of the first L STM network is a multi-view of the three-dimensional model, the input of the second L STM network is the output of the first L STM network and skeleton features extracted from the multi-view, and the output of the second L STM network is the feature after fusion of three-dimensional model visual information and the skeleton information, the invention utilizes the characteristic of L STM network state memory to enhance the correlation between two-dimensional views, and can better characterize the three-dimensional model, and the method is described in detail in the following:

a multi-modal feature fusion method based on L STM network, the method includes:

1) placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;

2) acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;

3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model;

wherein, the step 2) is specifically as follows: the extraction of the three-dimensional model multi-view feature vector is realized by loading a VGG-Net16 model: f ═ F₁,f₂,f₃,...,f₁₂Therein of

f_i∈R⁴⁰⁹⁶Inputting the feature vector set F into the first L STM network, and outputting the visual features of the three-dimensional model

Positioning three-dimensional model skeleton information contained in the multiple views based on a DeepSkeleton network to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of the three-dimensional model through regression prediction of the skeleton elements: g ═ G₁,g₂,...,g₁₂}；

Wherein, the step 3) is to output the first L STM network

And the extracted skeleton feature G ═ { G ═ G₁,g₂,...,g₁₂As input to a second L STM networkTaking the output of the second L STM network as the fused feature:

and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z^*。

The step 1) is specifically as follows:

adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;

acquiring 12 multi-view views representing the three-dimensional model by continuous shooting: v ═ V₁,v₂,...,v₁₂}。

The technical scheme provided by the invention has the beneficial effects that:

1. the invention can represent the global characteristics of the three-dimensional model based on the multi-view characteristics of the three-dimensional model extracted by the neural network;

2. the invention constructs the multi-modal fusion model by utilizing the characteristic of L STM information persistence, and realizes the fusion of model visual characteristics and structural characteristics.

Drawings

FIG. 1 is a framework diagram of a multi-modal feature fusion method based on L STM network;

FIG. 2 is a schematic diagram of a multi-view acquisition of a three-dimensional model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

With the development of three-dimensional model reconstruction tools in recent years, the number of three-dimensional models has been increasing. Due to the diligent research of researchers in the field of three-dimensional model retrieval, various three-dimensional model descriptors have been proposed, and in general, three-dimensional model descriptors are mainly classified into two types: view-based descriptors^[11]Characterizing a three-dimensional model using multi-perspective views of the three-dimensional model, model-based descriptors^[12]And characterizing the three-dimensional model by utilizing the shape characteristics or the topological structure characteristics of the three-dimensional model.

Model-based shape descriptors are further divided into low-level descriptors and high-level descriptors, the low-level descriptors mainly including surface similarity of the model^[12]Geometric moment^[13]Voxel distribution^[14]And the like. High-level descriptors mainly comprising spherical harmonic moments^[15]Polygonal networks (e.g. triangular networks)^[16]Reeb diagram^[17]And the high-level descriptor mainly describes the internal structural relationship of different components of the three-dimensional model, and although the model-based descriptor represents the structural features of the three-dimensional model, the model-based feature extraction method has the defects of high feature extraction complexity and low processing speed because the three-dimensional model needs to be reconstructed, and the problems are more prominent particularly when the structure of the three-dimensional model is relatively complex.

The descriptor based on the view is widely applied to the current three-dimensional model retrieval field, the method converts the operation of the three-dimensional model into the operation of the multi-view of the three-dimensional model, and makes up the defect that the manually designed features cannot describe the global information of the three-dimensional model, but the method has the defects that the method only focuses on single two-dimensional view information and ignores the correlation between the two-dimensional views.

Example 1

In order to realize accurate retrieval of a three-dimensional model, the embodiment of the invention provides a multimodal feature fusion method based on an L STM network, and the method is described in detail in the following description with reference to fig. 1:

101: placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;

102, acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;

and 103, inputting the acquired visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model.

The specific steps of acquiring the multi-view in step 101 are as follows:

1) adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;

2) a set (12) of multi-perspective views characterizing the three-dimensional model is acquired by successive shots: v ═ V₁,v₂,...,v₁₂}. A multi-view acquisition diagram is shown in fig. 2.

The method comprises the following specific steps of extracting visual features and skeleton features in step 102:

1) by loading a pre-trained VGG-Net16 model (the model feature extraction function is

) Extracting the characteristic vectors of the three-dimensional model multi-view views: f ═ F₁,f₂,f₃,...,f₁₂Therein of

f_i∈R⁴⁰⁹⁶Inputting the feature vector set F into a first L STM network, and outputting the feature vector set F as a visual feature h of the three-dimensional model_t ¹And R is a real number.

2) Positioning three-dimensional model skeleton information contained in multiple views by utilizing a DeepSkeleton network proposed by Wei Shen et al to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of a three-dimensional model through regression prediction of the skeleton elements: g ═ G₁,g₂,...,g₁₂}。

Wherein, in step 103, the visual features and the skeletal featuresThe concrete steps of sign fusion are that the output of the first L STM network is output

And the extracted skeleton feature G ═ { G ═ G₁,g₂,...,g₁₂As input to a second L STM network, the output of the second L STM network is taken as a fused feature:

In summary, in the embodiment of the present invention, the view information and the structure information of the three-dimensional model are fused in the steps 101 to 103, so that the description of the three-dimensional model is more comprehensive, and the accuracy of the retrieval can be improved by applying the method to the three-dimensional model retrieval.

Example 2

The scheme of example 1 is described in detail below with reference to specific drawings and calculation formulas, and is described in detail below:

firstly, uniformly placing a camera around a three-dimensional model at a fixed distance, aligning a lens of the camera to a centroid of the three-dimensional model at a depression angle of 30 degrees, and acquiring a group of (12) multi-view views representing the three-dimensional model by continuous shooting: v ═ V₁,v₂,...,v₁₂}. in order to fuse the view information and the structure information of the three-dimensional model, the method constructs a model composed of two L STM networks in a forward tandem mode.

In L STM network, the dependency relationship between multi-view features can be realized by updating the cell state, the L STM model realizes the protection and control of information by using three gates which are divided into an input gate, a forgetting gate and an output gate according to functions, wherein the input gate receives the information input into the L STM network, the forgetting gate determines how much the cell state at the past time is kept to the current time, the output gate generates the final fusion information, and in the constructed model of two L STM networks with forward tandem combination, the input and the output of the model are divided into 12 cell states because the input is 12 two-dimensional view features.

Wherein, the gate function of the forgetting gate is as follows:

wherein m is_tA screening parameter representing the state information of the last unit, sigma represents a sigmoid function,

output, x, representing the last state of the first L STM network_tInput representing the current state, W_mWeight matrix representing forgetting gate, b_mA bias parameter representing a forgetting gate.

The gate function of the input gate is as follows:

wherein the content of the first and second substances,

indicating candidate cell state information to be updated, i_tA filter parameter, W, representing the status information of the unit to be updated_cRepresenting a weight matrix, b_cAnd b_iAre all bias parameters.

The gate function of the output gate is as follows:

wherein o is_tA filter parameter, W, representing current cell state information_oAnd b_oRepresenting the weight matrix and the bias parameters, respectively.

The output of the current state is as follows:

h_t＝o_t⊙tanh(c_t)

wherein, c_t-1,c_tRespectively representing the last cell state information and the current cell state information, h_tOutput representing the current cell state, f_t ²Indicating the current input, g, of a second L STM network_kAnd representing the extracted skeleton feature information.

In order to acquire the visual characteristics of the three-dimensional model, the method loads pre-trained VGG-Net 16:

the extraction of the two-dimensional image feature vector is realized (feature extraction function), wherein the 4096-dimensional vector output by the 8 th full-connection layer of VGG-Net16 is taken as the picture feature: f ═ F₁,f₂,f₃,...,f₁₂Therein of

f_i∈R⁴⁰⁹⁶Feature set F as input to the first L STM network, output of the first L STM network

Is a visual feature of the three-dimensional model.

In order to obtain the skeleton characteristics of the three-dimensional model, the output of the STM network of the first layer L is used

And the extracted skeleton feature G ═ { G ═ G₁,g₂,...,g₁₂As an input to a second layer of L STM networks, and the output of a second L STM network as an updated signature

After 12 operations, the method obtains the final fusion characteristic z by utilizing the maximum pooling layer^*。

In summary, the embodiment of the invention integrates the structural features and the visual features of the three-dimensional model to better represent the three-dimensional model, improves the accuracy of the three-dimensional model retrieval based on the integrated features, reduces the calculated amount, and improves the retrieval efficiency.

Example 3

The following examples are presented to demonstrate the feasibility of the embodiments of examples 1 and 2, and are described in detail below:

experimental validation of the inventive examples was based on the ModelNet40 database^[18]And NTU database^[19]In implementation, the ModelNet40 is a subset of the ModelNet dataset, and contains a total of 12311 CAD models for 40 classes, and the NTU database is constructed by Taiwan university, and contains a total of 549 three-dimensional models for 46 classes.

In order to evaluate the application of the method proposed by the embodiment of the invention in the field of three-dimensional model retrieval, six evaluation criteria which are widely applied in the field of information retrieval are selected: P-R curves, NN, Ft, ST, F-measure, DCG, ANMRR and MAP are used as evaluation parameters for the search performance.

NN: and in the retrieval result, the accuracy of the most similar model retrieved.

FT and ST: FT represents the recall rate when N-P-1, and ST represents the recall rate when K-2 (P-1).

F-Measure: F-Measure is a weighted harmonic mean of precision and recall defined as:

PR is an index used to characterize the relationship between accuracy and recall.

MAP is used to represent the average accuracy of the retrieval.

The invention is based on the above evaluation parametersThe method and CCFV^[20]、AVC^[21]、liu^[22]、MCG^[23]、DLAN^[24]The performance comparison is carried out on the basis of the NTU database by the methods, and the following conclusions are obtained by analyzing the experimental results:

both CCFV and AVC can be regarded as retrieval methods based on mathematical statistical models, and they use gaussian function models and bayesian models to describe feature distributions of three-dimensional models, but there are some differences between the two methods, AVC treats each view as an independent individual, and CCFV mainly uses gaussian models, which take into account view variations of feature spaces in data distribution. The two methods only consider the visual attributes of the two-dimensional images and ignore the structural information of the 3D model, so the final retrieval result is poorer than the effect of the MIF method.

In L iu and MCG methods, a model for query is composed of multi-view views and associations between views, similarity measurement is performed by constructing a three-dimensional model diagram structure, and then three-dimensional model retrieval is realized, but skeleton information of the three-dimensional model is ignored, so that the retrieval result is less effective than the MIF method.

CNN (convolutional neural network) is applied in the NN method, but the network model used is relatively simple and thus has poor performance compared to the MIF method.

The D L AN method proposes a new deep neural network called D L AN for 3DMR, namely a deep local feature aggregation network, the features generated by the network have rotation invariance, but the method only focuses on local features of the model and ignores the global skeleton information of the three-dimensional model, so that the three-dimensional model cannot be well characterized.

In addition, the invention combines the MIF method with the current newest method (PANORAMA-NN) based on the ModelNet database^[25]、 GIFT^[26]、3D ShapeNets^[27]、MVCNN^[7]、LPD^[28]、SPH^[29]) And (5) carrying out performance comparison.

From the above experiments, it can be seen that the MIF method respectively obtains 3-31%, 7.2-20.1%, 2.2-16.4% and 1.3-15.8% improvement on NN, FT, ST and F-measure indexes, and respectively obtains 0.04-0.48 (ModelNet10) and 0.03-0.54(ModelNet40) improvement on MAP compared with the comparison method.

Reference to the literature

[1]Shi X,Chen Z,Wang H,et al.Convolutional LSTM Network:A MachineLearning Approach for Precipitation Nowcasting[J].2015.

[2] L iu, A, Nie, W, Gao, Y, et al, View-Based 3-D Model Retrieval A Benchmark [ J ]. IEEE Transactions on Cybernetics,2018.

[3]Wei-Zhi N,An-An L,Yue G,et al.Hyper-Clique Graph Matching andApplications[J]. IEEE Transactions on Circuits and Systems for VideoTechnology,2018:1-1.

[4]Zhu L,Shen J,Xie L,et al.Unsupervised Visual Hashing with SemanticAssistant for Content-Based Image Retrieval[J].IEEE Transactions on Knowledgeand Data Engineering, 2017,29(2):472-486.

[5]Dalal N,Triggs B.Histograms of Oriented Gradients for HumanDetection[C]//2005IEEE Computer Society Conference on Computer Vision andPattern Recognition(CVPR'05). IEEE,2005.

[6]Lowe D G.Distinctive Image Features from Scale-Invariant Keypoints[J].International Journal of Computer Vision,2004,60(2):91-110.

[7]Su H,Maji S,Kalogerakis E,et al.Multi-view Convolutional NeuralNetworks for 3D Shape Recognition[J].2015.

[8]Kanezaki A,Matsushita Y,Nishida Y.RotationNet:Joint Learning ofObject Classification and Viewpoint Estimation using Unaligned 3D ObjectDataset[J].2016.

[9]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J].arXiv preprint arXiv:1409.1556,2014.

[10]Shen W,Zhao K,Jiang Y,et al.DeepSkeleton:Learning Multi-taskScale-associated Deep Side Outputs for Object Skeleton Extraction in NaturalImages[J].IEEE Transactions on Image Processing,2017:1-1.

[11]Wang D,Wang B,Zhao S,et al.View-based 3D object retrieval withdiscriminative views[J]. Neurocomputing,2017,252:58-66.

[12]Feature-based similarity search in 3D object databases[J].ACMComputing Surveys,2005, 37(4):345-387.

[13]Paquet E,Rioux M,Murching A,et al.Description of shapeinformation for 2-D and 3-D objects[J].Signal Processing:Image Communication,2000,16(1-2):103-122.

[14]Papoiu A D P,Emerson N M,Patel T S,et al.Voxel-based morphometryand arterial spin labeling fMRI reveal neuropathic and neuroplastic featuresof brain processing of itch in end-stage renal disease[J].Journal ofNeurophysiology,2014,112(7):1729-1738.

[15]Liu Q.A survey of recent view-based 3d model retrieval methods[J].arXiv preprint arXiv:1208.3670,2012.

[16]Tangelder J W H,Veltkamp R C.Polyhedral model retrieval usingweighted point sets[J]. International journal of image and graphics,2003,3(01):209-229.

[17]Shinagawa Y,Kunii T L.Constructinga Reeb graph automatically fromcross sections[J]. IEEE Computer Graphics and Applications,1991(6):44-51.

[18]Wu Z,Song S,Khosla A,et al.3d shapenets:A deep representation forvolumetric shapes[C]//Proceedings of the IEEE conference on computer visionand pattern recognition. 2015:1912-1920.

[19]Chen D Y,Tian X P,Shen Y T,et al.On Visual Similarity Based 3DModel Retrieval[J]. Computer Graphics Forum,2003,22(3):223-232.

[20]Gao Y,Tang J,Hong R,et al.Camera constraint-free view-based 3-Dobject retrieval[J]. IEEE Transactions on Image Processing,2011,21(4):2269-2281.

[21]Ansary T F,Daoudi M,Vandeborre J P.A bayesian 3-d search engineusing adaptive views clustering[J].IEEE Transactions on Multimedia,2006,9(1):78-88.

[22]Liu A,Wang Z,Nie W,et al.Graph-based characteristic view setextraction and matching for 3D model retrieval[J].Information Sciences,2015,320:429-442.

[23]Liu A A,Nie W Z,Gao Y,et al.Multi-modal clique-graph matching forview-based 3d model retrieval[J].IEEE Transactions on Image Processing,2016,25(5):2103-2116.

[24]Furuya T,Ohbuchi R.Deep Aggregation of Local 3D GeometricFeatures for 3D Model Retrieval[C]//BMVC.2016,7:8.

[25]Sfikas K,Theoharis T,Pratikakis I.Exploiting the PANORAMARepresentation for Convolutional Neural Network Classification and Retrieval[J].3DOR,2017,6:7.

[26]Bai S,Bai X,Zhou Z,et al.Gift:Towards scalable 3d shape retrieval[J].IEEE Transactions on Multimedia,2017,19(6):1257-1271.

[27]Wu Z,Song S,Khosla A,et al.3d shapenets:A deep representation forvolumetric shapes[C]//Proceedings of the IEEE conference on computer visionand pattern recognition. 2015:1912-1920.

[28]Chen D Y,Tian X P,Shen Y T,et al.On visual similarity based 3Dmodel retrieval[C]//Computer graphics forum.Oxford,UK:Blackwell Publishing,Inc,2003,22(3): 223-232.

[29]Kazhdan M,Funkhouser T,Rusinkiewicz S.Rotation invariantspherical harmonic representation of 3 d shape descriptors[C]//Symposium ongeometry processing.2003,6: 156-164.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-modal feature fusion method based on L STM network, characterized in that the method comprises:

Wherein, the step 3) is to output the first L STM network

2. The method for fusing the multi-modal features based on L STM network according to claim 1, wherein the step 1) is specifically as follows: