CN111461166A - Multi-modal feature fusion method based on L STM network - Google Patents
Multi-modal feature fusion method based on L STM network Download PDFInfo
- Publication number
- CN111461166A CN111461166A CN202010128604.XA CN202010128604A CN111461166A CN 111461166 A CN111461166 A CN 111461166A CN 202010128604 A CN202010128604 A CN 202010128604A CN 111461166 A CN111461166 A CN 111461166A
- Authority
- CN
- China
- Prior art keywords
- dimensional model
- network
- features
- skeleton
- stm network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 9
- 230000000007 visual effect Effects 0.000 claims abstract description 26
- 230000004927 fusion Effects 0.000 claims abstract description 17
- 238000000034 method Methods 0.000 claims description 38
- 239000013598 vector Substances 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000002688 persistence Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 241001429361 Cardamine chlorotic fleck virus Species 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Abstract
The invention discloses a multi-modal feature fusion method based on L STM network, which comprises the steps of 1) placing a plurality of cameras around a three-dimensional model and obtaining a group of multi-view views representing the three-dimensional model through continuous shooting, 2) obtaining three-dimensional model visual features contained in the multi-view based on a first L STM network and obtaining three-dimensional model skeleton features contained in the multi-view based on a DeepSkeleton network, and 3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model, wherein the multi-modal fusion model is constructed by utilizing the characteristic of L STM information persistence, and the fusion of the visual features and the structural features of the model is realized.
Description
Technical Field
The invention relates to the field of three-dimensional models, in particular to an L STM network[1]The multimodal feature fusion Method (MIF).
Background
With the increasingly common application of three-dimensional models in life, the spreading of text information cannot meet the requirements of people on information acquisition, and two-dimensional images become a main carrier of information sources. Compared with images, the three-dimensional model can describe the overall topological structure of the object and has stronger sense of reality. At present, three-dimensional models are more and more widely applied and occupy important positions in the fields of medicine, entertainment, manufacturing and the like, so that the three-dimensional models are applied to three-dimensional modelsModel retrieval method[2][3][4]The research of (2) becomes particularly important. But unlike the representation of two-dimensional images, the representation of three-dimensional models is more susceptible to position occlusion and light exposure, and three-dimensional model retrieval remains a significant challenge. The way of manually designing the extracted features is a common method for current feature representation, for example: HOG reflecting three-dimensional model texture distribution[5]、SIFT[6]The feature, but the manual mode mainly focuses on the representation of local features, and the method cannot describe the global information of the three-dimensional model. In view of the deficiency, methods based on view and deep learning are introduced in the field of three-dimensional model retrieval, such as: MVCNN framework[7]A three-dimensional model is characterized by two-dimensional views from multiple perspectives acquired from different perspectives, and a multi-view CNN network is proposed to acquire feature descriptors of the three-dimensional model. From Asako Kanezaki[8]The proposed rotation network also applies a multi-view based neural network to handle the three-dimensional model retrieval problem.
The above method focuses on multi-angle visual information of a three-dimensional model, and therefore achieves remarkable results. However, these methods have the drawback that they focus on only a single two-dimensional view information, and ignore the correlation between two-dimensional views, even though they are similar to human perception of a three-dimensional model from a visual aspect, but in fact human perception involves not only the acquisition of visual information, but also the fusion of visual information, for example: continuity of visual information, fusion of information from different information sources, and the like. These methods ignore the above problems, and therefore, standing on the perspective of human perception, a feature representation method including both visual information and structural information is proposed to be able to better characterize a three-dimensional model.
Disclosure of Invention
The invention provides a multimode feature fusion method based on L STM network, which is characterized in that two L STM networks are connected in series in the forward direction to construct a three-dimensional model feature extraction model, the input of the first L STM network is a multi-view of the three-dimensional model, the input of the second L STM network is the output of the first L STM network and skeleton features extracted from the multi-view, and the output of the second L STM network is the feature after fusion of three-dimensional model visual information and the skeleton information, the invention utilizes the characteristic of L STM network state memory to enhance the correlation between two-dimensional views, and can better characterize the three-dimensional model, and the method is described in detail in the following:
a multi-modal feature fusion method based on L STM network, the method includes:
1) placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;
2) acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;
3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model;
wherein, the step 2) is specifically as follows: the extraction of the three-dimensional model multi-view feature vector is realized by loading a VGG-Net16 model: f ═ F1,f2,f3,...,f12Therein offi∈R4096Inputting the feature vector set F into the first L STM network, and outputting the visual features of the three-dimensional modelPositioning three-dimensional model skeleton information contained in the multiple views based on a DeepSkeleton network to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of the three-dimensional model through regression prediction of the skeleton elements: g ═ G1,g2,...,g12};
Wherein, the step 3) is to output the first L STM networkAnd the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As input to a second L STM networkTaking the output of the second L STM network as the fused feature:and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z*。
The step 1) is specifically as follows:
adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;
acquiring 12 multi-view views representing the three-dimensional model by continuous shooting: v ═ V1,v2,...,v12}。
The technical scheme provided by the invention has the beneficial effects that:
1. the invention can represent the global characteristics of the three-dimensional model based on the multi-view characteristics of the three-dimensional model extracted by the neural network;
2. the invention constructs the multi-modal fusion model by utilizing the characteristic of L STM information persistence, and realizes the fusion of model visual characteristics and structural characteristics.
Drawings
FIG. 1 is a framework diagram of a multi-modal feature fusion method based on L STM network;
FIG. 2 is a schematic diagram of a multi-view acquisition of a three-dimensional model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
With the development of three-dimensional model reconstruction tools in recent years, the number of three-dimensional models has been increasing. Due to the diligent research of researchers in the field of three-dimensional model retrieval, various three-dimensional model descriptors have been proposed, and in general, three-dimensional model descriptors are mainly classified into two types: view-based descriptors[11]Characterizing a three-dimensional model using multi-perspective views of the three-dimensional model, model-based descriptors[12]And characterizing the three-dimensional model by utilizing the shape characteristics or the topological structure characteristics of the three-dimensional model.
Model-based shape descriptors are further divided into low-level descriptors and high-level descriptors, the low-level descriptors mainly including surface similarity of the model[12]Geometric moment[13]Voxel distribution[14]And the like. High-level descriptors mainly comprising spherical harmonic moments[15]Polygonal networks (e.g. triangular networks)[16]Reeb diagram[17]And the high-level descriptor mainly describes the internal structural relationship of different components of the three-dimensional model, and although the model-based descriptor represents the structural features of the three-dimensional model, the model-based feature extraction method has the defects of high feature extraction complexity and low processing speed because the three-dimensional model needs to be reconstructed, and the problems are more prominent particularly when the structure of the three-dimensional model is relatively complex.
The descriptor based on the view is widely applied to the current three-dimensional model retrieval field, the method converts the operation of the three-dimensional model into the operation of the multi-view of the three-dimensional model, and makes up the defect that the manually designed features cannot describe the global information of the three-dimensional model, but the method has the defects that the method only focuses on single two-dimensional view information and ignores the correlation between the two-dimensional views.
Example 1
In order to realize accurate retrieval of a three-dimensional model, the embodiment of the invention provides a multimodal feature fusion method based on an L STM network, and the method is described in detail in the following description with reference to fig. 1:
101: placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;
102, acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;
and 103, inputting the acquired visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model.
The specific steps of acquiring the multi-view in step 101 are as follows:
1) adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;
2) a set (12) of multi-perspective views characterizing the three-dimensional model is acquired by successive shots: v ═ V1,v2,...,v12}. A multi-view acquisition diagram is shown in fig. 2.
The method comprises the following specific steps of extracting visual features and skeleton features in step 102:
1) by loading a pre-trained VGG-Net16 model (the model feature extraction function is) Extracting the characteristic vectors of the three-dimensional model multi-view views: f ═ F1,f2,f3,...,f12Therein offi∈R4096Inputting the feature vector set F into a first L STM network, and outputting the feature vector set F as a visual feature h of the three-dimensional modelt 1And R is a real number.
2) Positioning three-dimensional model skeleton information contained in multiple views by utilizing a DeepSkeleton network proposed by Wei Shen et al to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of a three-dimensional model through regression prediction of the skeleton elements: g ═ G1,g2,...,g12}。
Wherein, in step 103, the visual features and the skeletal featuresThe concrete steps of sign fusion are that the output of the first L STM network is outputAnd the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As input to a second L STM network, the output of the second L STM network is taken as a fused feature:and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z*。
In summary, in the embodiment of the present invention, the view information and the structure information of the three-dimensional model are fused in the steps 101 to 103, so that the description of the three-dimensional model is more comprehensive, and the accuracy of the retrieval can be improved by applying the method to the three-dimensional model retrieval.
Example 2
The scheme of example 1 is described in detail below with reference to specific drawings and calculation formulas, and is described in detail below:
firstly, uniformly placing a camera around a three-dimensional model at a fixed distance, aligning a lens of the camera to a centroid of the three-dimensional model at a depression angle of 30 degrees, and acquiring a group of (12) multi-view views representing the three-dimensional model by continuous shooting: v ═ V1,v2,...,v12}. in order to fuse the view information and the structure information of the three-dimensional model, the method constructs a model composed of two L STM networks in a forward tandem mode.
In L STM network, the dependency relationship between multi-view features can be realized by updating the cell state, the L STM model realizes the protection and control of information by using three gates which are divided into an input gate, a forgetting gate and an output gate according to functions, wherein the input gate receives the information input into the L STM network, the forgetting gate determines how much the cell state at the past time is kept to the current time, the output gate generates the final fusion information, and in the constructed model of two L STM networks with forward tandem combination, the input and the output of the model are divided into 12 cell states because the input is 12 two-dimensional view features.
Wherein, the gate function of the forgetting gate is as follows:
wherein m istA screening parameter representing the state information of the last unit, sigma represents a sigmoid function,output, x, representing the last state of the first L STM networktInput representing the current state, WmWeight matrix representing forgetting gate, bmA bias parameter representing a forgetting gate.
The gate function of the input gate is as follows:
wherein the content of the first and second substances,indicating candidate cell state information to be updated, itA filter parameter, W, representing the status information of the unit to be updatedcRepresenting a weight matrix, bcAnd biAre all bias parameters.
The gate function of the output gate is as follows:
wherein o istA filter parameter, W, representing current cell state informationoAnd boRepresenting the weight matrix and the bias parameters, respectively.
The output of the current state is as follows:
ht=ot⊙tanh(ct)
wherein, ct-1,ctRespectively representing the last cell state information and the current cell state information, htOutput representing the current cell state, ft 2Indicating the current input, g, of a second L STM networkkAnd representing the extracted skeleton feature information.
In order to acquire the visual characteristics of the three-dimensional model, the method loads pre-trained VGG-Net 16:the extraction of the two-dimensional image feature vector is realized (feature extraction function), wherein the 4096-dimensional vector output by the 8 th full-connection layer of VGG-Net16 is taken as the picture feature: f ═ F1,f2,f3,...,f12Therein offi∈R4096Feature set F as input to the first L STM network, output of the first L STM networkIs a visual feature of the three-dimensional model.
In order to obtain the skeleton characteristics of the three-dimensional model, the output of the STM network of the first layer L is usedAnd the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As an input to a second layer of L STM networks, and the output of a second L STM network as an updated signatureAfter 12 operations, the method obtains the final fusion characteristic z by utilizing the maximum pooling layer*。
In summary, the embodiment of the invention integrates the structural features and the visual features of the three-dimensional model to better represent the three-dimensional model, improves the accuracy of the three-dimensional model retrieval based on the integrated features, reduces the calculated amount, and improves the retrieval efficiency.
Example 3
The following examples are presented to demonstrate the feasibility of the embodiments of examples 1 and 2, and are described in detail below:
experimental validation of the inventive examples was based on the ModelNet40 database[18]And NTU database[19]In implementation, the ModelNet40 is a subset of the ModelNet dataset, and contains a total of 12311 CAD models for 40 classes, and the NTU database is constructed by Taiwan university, and contains a total of 549 three-dimensional models for 46 classes.
In order to evaluate the application of the method proposed by the embodiment of the invention in the field of three-dimensional model retrieval, six evaluation criteria which are widely applied in the field of information retrieval are selected: P-R curves, NN, Ft, ST, F-measure, DCG, ANMRR and MAP are used as evaluation parameters for the search performance.
NN: and in the retrieval result, the accuracy of the most similar model retrieved.
FT and ST: FT represents the recall rate when N-P-1, and ST represents the recall rate when K-2 (P-1).
F-Measure: F-Measure is a weighted harmonic mean of precision and recall defined as:
PR is an index used to characterize the relationship between accuracy and recall.
MAP is used to represent the average accuracy of the retrieval.
The invention is based on the above evaluation parametersThe method and CCFV[20]、AVC[21]、liu[22]、MCG[23]、DLAN[24]The performance comparison is carried out on the basis of the NTU database by the methods, and the following conclusions are obtained by analyzing the experimental results:
both CCFV and AVC can be regarded as retrieval methods based on mathematical statistical models, and they use gaussian function models and bayesian models to describe feature distributions of three-dimensional models, but there are some differences between the two methods, AVC treats each view as an independent individual, and CCFV mainly uses gaussian models, which take into account view variations of feature spaces in data distribution. The two methods only consider the visual attributes of the two-dimensional images and ignore the structural information of the 3D model, so the final retrieval result is poorer than the effect of the MIF method.
In L iu and MCG methods, a model for query is composed of multi-view views and associations between views, similarity measurement is performed by constructing a three-dimensional model diagram structure, and then three-dimensional model retrieval is realized, but skeleton information of the three-dimensional model is ignored, so that the retrieval result is less effective than the MIF method.
CNN (convolutional neural network) is applied in the NN method, but the network model used is relatively simple and thus has poor performance compared to the MIF method.
The D L AN method proposes a new deep neural network called D L AN for 3DMR, namely a deep local feature aggregation network, the features generated by the network have rotation invariance, but the method only focuses on local features of the model and ignores the global skeleton information of the three-dimensional model, so that the three-dimensional model cannot be well characterized.
In addition, the invention combines the MIF method with the current newest method (PANORAMA-NN) based on the ModelNet database[25]、 GIFT[26]、3D ShapeNets[27]、MVCNN[7]、LPD[28]、SPH[29]) And (5) carrying out performance comparison.
From the above experiments, it can be seen that the MIF method respectively obtains 3-31%, 7.2-20.1%, 2.2-16.4% and 1.3-15.8% improvement on NN, FT, ST and F-measure indexes, and respectively obtains 0.04-0.48 (ModelNet10) and 0.03-0.54(ModelNet40) improvement on MAP compared with the comparison method.
Reference to the literature
[1]Shi X,Chen Z,Wang H,et al.Convolutional LSTM Network:A MachineLearning Approach for Precipitation Nowcasting[J].2015.
[2] L iu, A, Nie, W, Gao, Y, et al, View-Based 3-D Model Retrieval A Benchmark [ J ]. IEEE Transactions on Cybernetics,2018.
[3]Wei-Zhi N,An-An L,Yue G,et al.Hyper-Clique Graph Matching andApplications[J]. IEEE Transactions on Circuits and Systems for VideoTechnology,2018:1-1.
[4]Zhu L,Shen J,Xie L,et al.Unsupervised Visual Hashing with SemanticAssistant for Content-Based Image Retrieval[J].IEEE Transactions on Knowledgeand Data Engineering, 2017,29(2):472-486.
[5]Dalal N,Triggs B.Histograms of Oriented Gradients for HumanDetection[C]//2005IEEE Computer Society Conference on Computer Vision andPattern Recognition(CVPR'05). IEEE,2005.
[6]Lowe D G.Distinctive Image Features from Scale-Invariant Keypoints[J].International Journal of Computer Vision,2004,60(2):91-110.
[7]Su H,Maji S,Kalogerakis E,et al.Multi-view Convolutional NeuralNetworks for 3D Shape Recognition[J].2015.
[8]Kanezaki A,Matsushita Y,Nishida Y.RotationNet:Joint Learning ofObject Classification and Viewpoint Estimation using Unaligned 3D ObjectDataset[J].2016.
[9]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J].arXiv preprint arXiv:1409.1556,2014.
[10]Shen W,Zhao K,Jiang Y,et al.DeepSkeleton:Learning Multi-taskScale-associated Deep Side Outputs for Object Skeleton Extraction in NaturalImages[J].IEEE Transactions on Image Processing,2017:1-1.
[11]Wang D,Wang B,Zhao S,et al.View-based 3D object retrieval withdiscriminative views[J]. Neurocomputing,2017,252:58-66.
[12]Feature-based similarity search in 3D object databases[J].ACMComputing Surveys,2005, 37(4):345-387.
[13]Paquet E,Rioux M,Murching A,et al.Description of shapeinformation for 2-D and 3-D objects[J].Signal Processing:Image Communication,2000,16(1-2):103-122.
[14]Papoiu A D P,Emerson N M,Patel T S,et al.Voxel-based morphometryand arterial spin labeling fMRI reveal neuropathic and neuroplastic featuresof brain processing of itch in end-stage renal disease[J].Journal ofNeurophysiology,2014,112(7):1729-1738.
[15]Liu Q.A survey of recent view-based 3d model retrieval methods[J].arXiv preprint arXiv:1208.3670,2012.
[16]Tangelder J W H,Veltkamp R C.Polyhedral model retrieval usingweighted point sets[J]. International journal of image and graphics,2003,3(01):209-229.
[17]Shinagawa Y,Kunii T L.Constructinga Reeb graph automatically fromcross sections[J]. IEEE Computer Graphics and Applications,1991(6):44-51.
[18]Wu Z,Song S,Khosla A,et al.3d shapenets:A deep representation forvolumetric shapes[C]//Proceedings of the IEEE conference on computer visionand pattern recognition. 2015:1912-1920.
[19]Chen D Y,Tian X P,Shen Y T,et al.On Visual Similarity Based 3DModel Retrieval[J]. Computer Graphics Forum,2003,22(3):223-232.
[20]Gao Y,Tang J,Hong R,et al.Camera constraint-free view-based 3-Dobject retrieval[J]. IEEE Transactions on Image Processing,2011,21(4):2269-2281.
[21]Ansary T F,Daoudi M,Vandeborre J P.A bayesian 3-d search engineusing adaptive views clustering[J].IEEE Transactions on Multimedia,2006,9(1):78-88.
[22]Liu A,Wang Z,Nie W,et al.Graph-based characteristic view setextraction and matching for 3D model retrieval[J].Information Sciences,2015,320:429-442.
[23]Liu A A,Nie W Z,Gao Y,et al.Multi-modal clique-graph matching forview-based 3d model retrieval[J].IEEE Transactions on Image Processing,2016,25(5):2103-2116.
[24]Furuya T,Ohbuchi R.Deep Aggregation of Local 3D GeometricFeatures for 3D Model Retrieval[C]//BMVC.2016,7:8.
[25]Sfikas K,Theoharis T,Pratikakis I.Exploiting the PANORAMARepresentation for Convolutional Neural Network Classification and Retrieval[J].3DOR,2017,6:7.
[26]Bai S,Bai X,Zhou Z,et al.Gift:Towards scalable 3d shape retrieval[J].IEEE Transactions on Multimedia,2017,19(6):1257-1271.
[27]Wu Z,Song S,Khosla A,et al.3d shapenets:A deep representation forvolumetric shapes[C]//Proceedings of the IEEE conference on computer visionand pattern recognition. 2015:1912-1920.
[28]Chen D Y,Tian X P,Shen Y T,et al.On visual similarity based 3Dmodel retrieval[C]//Computer graphics forum.Oxford,UK:Blackwell Publishing,Inc,2003,22(3): 223-232.
[29]Kazhdan M,Funkhouser T,Rusinkiewicz S.Rotation invariantspherical harmonic representation of 3 d shape descriptors[C]//Symposium ongeometry processing.2003,6: 156-164.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (2)
1. A multi-modal feature fusion method based on L STM network, characterized in that the method comprises:
1) placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;
2) acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;
3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model;
wherein, the step 2) is specifically as follows: the extraction of the three-dimensional model multi-view feature vector is realized by loading a VGG-Net16 model: f ═ F1,f2,f3,...,f12Therein offi∈R4096Inputting the feature vector set F into the first L STM network, and outputting the visual features of the three-dimensional modelPositioning three-dimensional model skeleton information contained in the multiple views based on a DeepSkeleton network to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of the three-dimensional model through regression prediction of the skeleton elements: g ═ G1,g2,...,g12};
Wherein, the step 3) is to output the first L STM networkAnd the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As input to a second L STM network, the output of the second L STM network is taken as a fused feature:and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z*。
2. The method for fusing the multi-modal features based on L STM network according to claim 1, wherein the step 1) is specifically as follows:
adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;
acquiring 12 multi-view views representing the three-dimensional model by continuous shooting: v ═ V1,v2,...,v12}。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010128604.XA CN111461166A (en) | 2020-02-28 | 2020-02-28 | Multi-modal feature fusion method based on L STM network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010128604.XA CN111461166A (en) | 2020-02-28 | 2020-02-28 | Multi-modal feature fusion method based on L STM network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111461166A true CN111461166A (en) | 2020-07-28 |
Family
ID=71682464
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010128604.XA Pending CN111461166A (en) | 2020-02-28 | 2020-02-28 | Multi-modal feature fusion method based on L STM network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111461166A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150154229A1 (en) * | 2013-11-29 | 2015-06-04 | Canon Kabushiki Kaisha | Scalable attribute-driven image retrieval and re-ranking |
CN109992686A (en) * | 2019-02-24 | 2019-07-09 | 复旦大学 | Based on multi-angle from the image-text retrieval system and method for attention mechanism |
CN110163091A (en) * | 2019-04-13 | 2019-08-23 | 天津大学 | Method for searching three-dimension model based on LSTM network multimodal information fusion |
-
2020
- 2020-02-28 CN CN202010128604.XA patent/CN111461166A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150154229A1 (en) * | 2013-11-29 | 2015-06-04 | Canon Kabushiki Kaisha | Scalable attribute-driven image retrieval and re-ranking |
CN109992686A (en) * | 2019-02-24 | 2019-07-09 | 复旦大学 | Based on multi-angle from the image-text retrieval system and method for attention mechanism |
CN110163091A (en) * | 2019-04-13 | 2019-08-23 | 天津大学 | Method for searching three-dimension model based on LSTM network multimodal information fusion |
Non-Patent Citations (2)
Title |
---|
ZHOU,HY,ET AL: "Dual-level Embedding Alignment Network for 2D Image-based 3D Object Retrieval", PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, pages 1667 - 1675 * |
裴晓敏;范慧杰;唐延东: "时空特征融合深度学习网络人体行为识别方法", 红外与激光工程, no. 002, pages 55 - 60 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qi et al. | Review of multi-view 3D object recognition methods based on deep learning | |
Georgiou et al. | A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision | |
Cong et al. | Going from RGB to RGBD saliency: A depth-guided transformation model | |
Zhang et al. | End-to-end photo-sketch generation via fully convolutional representation learning | |
CN107066559B (en) | Three-dimensional model retrieval method based on deep learning | |
Li et al. | A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries | |
Bashir et al. | Vr-proud: Vehicle re-identification using progressive unsupervised deep architecture | |
Trigeorgis et al. | Face normals" in-the-wild" using fully convolutional networks | |
Lu et al. | Learning view-model joint relevance for 3D object retrieval | |
CN111625667A (en) | Three-dimensional model cross-domain retrieval method and system based on complex background image | |
Li et al. | Sketch-based 3D model retrieval utilizing adaptive view clustering and semantic information | |
Hu et al. | RGB-D semantic segmentation: a review | |
Mosella-Montoro et al. | 2d–3d geometric fusion network using multi-neighbourhood graph convolution for rgb-d indoor scene classification | |
Peng et al. | Evaluation of segmentation quality via adaptive composition of reference segmentations | |
Bazazian et al. | DCG-net: Dynamic capsule graph convolutional network for point clouds | |
Liu et al. | 3D model retrieval based on multi-view attentional convolutional neural network | |
Liang et al. | MVCLN: multi-view convolutional LSTM network for cross-media 3D shape recognition | |
Lu et al. | Memory efficient large-scale image-based localization | |
Li et al. | Multi-view-based siamese convolutional neural network for 3D object retrieval | |
Cinaroglu et al. | Long-term image-based vehicle localization improved with learnt semantic descriptors | |
Nie et al. | The assessment of 3D model representation for retrieval with CNN-RNN networks | |
CN114120095A (en) | Mobile robot autonomous positioning system and method based on aerial three-dimensional model | |
Richard et al. | KAPLAN: A 3D point descriptor for shape completion | |
Benhabiles et al. | Convolutional neural network for pottery retrieval | |
CN111461166A (en) | Multi-modal feature fusion method based on L STM network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200728 |