CN111461166A - Multi-modal feature fusion method based on L STM network - Google Patents

Multi-modal feature fusion method based on L STM network Download PDF

Info

Publication number
CN111461166A
CN111461166A CN202010128604.XA CN202010128604A CN111461166A CN 111461166 A CN111461166 A CN 111461166A CN 202010128604 A CN202010128604 A CN 202010128604A CN 111461166 A CN111461166 A CN 111461166A
Authority
CN
China
Prior art keywords
dimensional model
network
features
skeleton
stm network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010128604.XA
Other languages
Chinese (zh)
Inventor
张静
陈闯
聂为之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010128604.XA priority Critical patent/CN111461166A/en
Publication of CN111461166A publication Critical patent/CN111461166A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a multi-modal feature fusion method based on L STM network, which comprises the steps of 1) placing a plurality of cameras around a three-dimensional model and obtaining a group of multi-view views representing the three-dimensional model through continuous shooting, 2) obtaining three-dimensional model visual features contained in the multi-view based on a first L STM network and obtaining three-dimensional model skeleton features contained in the multi-view based on a DeepSkeleton network, and 3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model, wherein the multi-modal fusion model is constructed by utilizing the characteristic of L STM information persistence, and the fusion of the visual features and the structural features of the model is realized.

Description

Multi-modal feature fusion method based on L STM network
Technical Field
The invention relates to the field of three-dimensional models, in particular to an L STM network[1]The multimodal feature fusion Method (MIF).
Background
With the increasingly common application of three-dimensional models in life, the spreading of text information cannot meet the requirements of people on information acquisition, and two-dimensional images become a main carrier of information sources. Compared with images, the three-dimensional model can describe the overall topological structure of the object and has stronger sense of reality. At present, three-dimensional models are more and more widely applied and occupy important positions in the fields of medicine, entertainment, manufacturing and the like, so that the three-dimensional models are applied to three-dimensional modelsModel retrieval method[2][3][4]The research of (2) becomes particularly important. But unlike the representation of two-dimensional images, the representation of three-dimensional models is more susceptible to position occlusion and light exposure, and three-dimensional model retrieval remains a significant challenge. The way of manually designing the extracted features is a common method for current feature representation, for example: HOG reflecting three-dimensional model texture distribution[5]、SIFT[6]The feature, but the manual mode mainly focuses on the representation of local features, and the method cannot describe the global information of the three-dimensional model. In view of the deficiency, methods based on view and deep learning are introduced in the field of three-dimensional model retrieval, such as: MVCNN framework[7]A three-dimensional model is characterized by two-dimensional views from multiple perspectives acquired from different perspectives, and a multi-view CNN network is proposed to acquire feature descriptors of the three-dimensional model. From Asako Kanezaki[8]The proposed rotation network also applies a multi-view based neural network to handle the three-dimensional model retrieval problem.
The above method focuses on multi-angle visual information of a three-dimensional model, and therefore achieves remarkable results. However, these methods have the drawback that they focus on only a single two-dimensional view information, and ignore the correlation between two-dimensional views, even though they are similar to human perception of a three-dimensional model from a visual aspect, but in fact human perception involves not only the acquisition of visual information, but also the fusion of visual information, for example: continuity of visual information, fusion of information from different information sources, and the like. These methods ignore the above problems, and therefore, standing on the perspective of human perception, a feature representation method including both visual information and structural information is proposed to be able to better characterize a three-dimensional model.
Disclosure of Invention
The invention provides a multimode feature fusion method based on L STM network, which is characterized in that two L STM networks are connected in series in the forward direction to construct a three-dimensional model feature extraction model, the input of the first L STM network is a multi-view of the three-dimensional model, the input of the second L STM network is the output of the first L STM network and skeleton features extracted from the multi-view, and the output of the second L STM network is the feature after fusion of three-dimensional model visual information and the skeleton information, the invention utilizes the characteristic of L STM network state memory to enhance the correlation between two-dimensional views, and can better characterize the three-dimensional model, and the method is described in detail in the following:
a multi-modal feature fusion method based on L STM network, the method includes:
1) placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;
2) acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;
3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model;
wherein, the step 2) is specifically as follows: the extraction of the three-dimensional model multi-view feature vector is realized by loading a VGG-Net16 model: f ═ F1,f2,f3,...,f12Therein of
Figure RE-GDA0002482995370000021
fi∈R4096Inputting the feature vector set F into the first L STM network, and outputting the visual features of the three-dimensional model
Figure RE-GDA0002482995370000022
Positioning three-dimensional model skeleton information contained in the multiple views based on a DeepSkeleton network to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of the three-dimensional model through regression prediction of the skeleton elements: g ═ G1,g2,...,g12};
Wherein, the step 3) is to output the first L STM network
Figure RE-GDA0002482995370000023
And the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As input to a second L STM networkTaking the output of the second L STM network as the fused feature:
Figure RE-GDA0002482995370000024
and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z*
The step 1) is specifically as follows:
adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;
acquiring 12 multi-view views representing the three-dimensional model by continuous shooting: v ═ V1,v2,...,v12}。
The technical scheme provided by the invention has the beneficial effects that:
1. the invention can represent the global characteristics of the three-dimensional model based on the multi-view characteristics of the three-dimensional model extracted by the neural network;
2. the invention constructs the multi-modal fusion model by utilizing the characteristic of L STM information persistence, and realizes the fusion of model visual characteristics and structural characteristics.
Drawings
FIG. 1 is a framework diagram of a multi-modal feature fusion method based on L STM network;
FIG. 2 is a schematic diagram of a multi-view acquisition of a three-dimensional model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
With the development of three-dimensional model reconstruction tools in recent years, the number of three-dimensional models has been increasing. Due to the diligent research of researchers in the field of three-dimensional model retrieval, various three-dimensional model descriptors have been proposed, and in general, three-dimensional model descriptors are mainly classified into two types: view-based descriptors[11]Characterizing a three-dimensional model using multi-perspective views of the three-dimensional model, model-based descriptors[12]And characterizing the three-dimensional model by utilizing the shape characteristics or the topological structure characteristics of the three-dimensional model.
Model-based shape descriptors are further divided into low-level descriptors and high-level descriptors, the low-level descriptors mainly including surface similarity of the model[12]Geometric moment[13]Voxel distribution[14]And the like. High-level descriptors mainly comprising spherical harmonic moments[15]Polygonal networks (e.g. triangular networks)[16]Reeb diagram[17]And the high-level descriptor mainly describes the internal structural relationship of different components of the three-dimensional model, and although the model-based descriptor represents the structural features of the three-dimensional model, the model-based feature extraction method has the defects of high feature extraction complexity and low processing speed because the three-dimensional model needs to be reconstructed, and the problems are more prominent particularly when the structure of the three-dimensional model is relatively complex.
The descriptor based on the view is widely applied to the current three-dimensional model retrieval field, the method converts the operation of the three-dimensional model into the operation of the multi-view of the three-dimensional model, and makes up the defect that the manually designed features cannot describe the global information of the three-dimensional model, but the method has the defects that the method only focuses on single two-dimensional view information and ignores the correlation between the two-dimensional views.
Example 1
In order to realize accurate retrieval of a three-dimensional model, the embodiment of the invention provides a multimodal feature fusion method based on an L STM network, and the method is described in detail in the following description with reference to fig. 1:
101: placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;
102, acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;
and 103, inputting the acquired visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model.
The specific steps of acquiring the multi-view in step 101 are as follows:
1) adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;
2) a set (12) of multi-perspective views characterizing the three-dimensional model is acquired by successive shots: v ═ V1,v2,...,v12}. A multi-view acquisition diagram is shown in fig. 2.
The method comprises the following specific steps of extracting visual features and skeleton features in step 102:
1) by loading a pre-trained VGG-Net16 model (the model feature extraction function is
Figure RE-GDA0002482995370000041
) Extracting the characteristic vectors of the three-dimensional model multi-view views: f ═ F1,f2,f3,...,f12Therein of
Figure RE-GDA0002482995370000042
fi∈R4096Inputting the feature vector set F into a first L STM network, and outputting the feature vector set F as a visual feature h of the three-dimensional modelt 1And R is a real number.
2) Positioning three-dimensional model skeleton information contained in multiple views by utilizing a DeepSkeleton network proposed by Wei Shen et al to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of a three-dimensional model through regression prediction of the skeleton elements: g ═ G1,g2,...,g12}。
Wherein, in step 103, the visual features and the skeletal featuresThe concrete steps of sign fusion are that the output of the first L STM network is output
Figure RE-GDA0002482995370000043
And the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As input to a second L STM network, the output of the second L STM network is taken as a fused feature:
Figure RE-GDA0002482995370000044
and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z*
In summary, in the embodiment of the present invention, the view information and the structure information of the three-dimensional model are fused in the steps 101 to 103, so that the description of the three-dimensional model is more comprehensive, and the accuracy of the retrieval can be improved by applying the method to the three-dimensional model retrieval.
Example 2
The scheme of example 1 is described in detail below with reference to specific drawings and calculation formulas, and is described in detail below:
firstly, uniformly placing a camera around a three-dimensional model at a fixed distance, aligning a lens of the camera to a centroid of the three-dimensional model at a depression angle of 30 degrees, and acquiring a group of (12) multi-view views representing the three-dimensional model by continuous shooting: v ═ V1,v2,...,v12}. in order to fuse the view information and the structure information of the three-dimensional model, the method constructs a model composed of two L STM networks in a forward tandem mode.
In L STM network, the dependency relationship between multi-view features can be realized by updating the cell state, the L STM model realizes the protection and control of information by using three gates which are divided into an input gate, a forgetting gate and an output gate according to functions, wherein the input gate receives the information input into the L STM network, the forgetting gate determines how much the cell state at the past time is kept to the current time, the output gate generates the final fusion information, and in the constructed model of two L STM networks with forward tandem combination, the input and the output of the model are divided into 12 cell states because the input is 12 two-dimensional view features.
Wherein, the gate function of the forgetting gate is as follows:
Figure RE-GDA0002482995370000051
wherein m istA screening parameter representing the state information of the last unit, sigma represents a sigmoid function,
Figure RE-GDA0002482995370000052
output, x, representing the last state of the first L STM networktInput representing the current state, WmWeight matrix representing forgetting gate, bmA bias parameter representing a forgetting gate.
The gate function of the input gate is as follows:
Figure RE-GDA0002482995370000053
Figure RE-GDA0002482995370000054
wherein the content of the first and second substances,
Figure RE-GDA0002482995370000055
indicating candidate cell state information to be updated, itA filter parameter, W, representing the status information of the unit to be updatedcRepresenting a weight matrix, bcAnd biAre all bias parameters.
The gate function of the output gate is as follows:
Figure RE-GDA0002482995370000056
wherein o istA filter parameter, W, representing current cell state informationoAnd boRepresenting the weight matrix and the bias parameters, respectively.
The output of the current state is as follows:
Figure RE-GDA0002482995370000059
ht=ot⊙tanh(ct)
Figure RE-GDA0002482995370000057
wherein, ct-1,ctRespectively representing the last cell state information and the current cell state information, htOutput representing the current cell state, ft 2Indicating the current input, g, of a second L STM networkkAnd representing the extracted skeleton feature information.
In order to acquire the visual characteristics of the three-dimensional model, the method loads pre-trained VGG-Net 16:
Figure RE-GDA0002482995370000058
the extraction of the two-dimensional image feature vector is realized (feature extraction function), wherein the 4096-dimensional vector output by the 8 th full-connection layer of VGG-Net16 is taken as the picture feature: f ═ F1,f2,f3,...,f12Therein of
Figure RE-GDA0002482995370000061
fi∈R4096Feature set F as input to the first L STM network, output of the first L STM network
Figure RE-GDA0002482995370000062
Is a visual feature of the three-dimensional model.
In order to obtain the skeleton characteristics of the three-dimensional model, the output of the STM network of the first layer L is used
Figure RE-GDA0002482995370000063
And the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As an input to a second layer of L STM networks, and the output of a second L STM network as an updated signature
Figure RE-GDA0002482995370000064
After 12 operations, the method obtains the final fusion characteristic z by utilizing the maximum pooling layer*
In summary, the embodiment of the invention integrates the structural features and the visual features of the three-dimensional model to better represent the three-dimensional model, improves the accuracy of the three-dimensional model retrieval based on the integrated features, reduces the calculated amount, and improves the retrieval efficiency.
Example 3
The following examples are presented to demonstrate the feasibility of the embodiments of examples 1 and 2, and are described in detail below:
experimental validation of the inventive examples was based on the ModelNet40 database[18]And NTU database[19]In implementation, the ModelNet40 is a subset of the ModelNet dataset, and contains a total of 12311 CAD models for 40 classes, and the NTU database is constructed by Taiwan university, and contains a total of 549 three-dimensional models for 46 classes.
In order to evaluate the application of the method proposed by the embodiment of the invention in the field of three-dimensional model retrieval, six evaluation criteria which are widely applied in the field of information retrieval are selected: P-R curves, NN, Ft, ST, F-measure, DCG, ANMRR and MAP are used as evaluation parameters for the search performance.
NN: and in the retrieval result, the accuracy of the most similar model retrieved.
FT and ST: FT represents the recall rate when N-P-1, and ST represents the recall rate when K-2 (P-1).
F-Measure: F-Measure is a weighted harmonic mean of precision and recall defined as:
Figure RE-GDA0002482995370000065
PR is an index used to characterize the relationship between accuracy and recall.
MAP is used to represent the average accuracy of the retrieval.
The invention is based on the above evaluation parametersThe method and CCFV[20]、AVC[21]、liu[22]、MCG[23]、DLAN[24]The performance comparison is carried out on the basis of the NTU database by the methods, and the following conclusions are obtained by analyzing the experimental results:
both CCFV and AVC can be regarded as retrieval methods based on mathematical statistical models, and they use gaussian function models and bayesian models to describe feature distributions of three-dimensional models, but there are some differences between the two methods, AVC treats each view as an independent individual, and CCFV mainly uses gaussian models, which take into account view variations of feature spaces in data distribution. The two methods only consider the visual attributes of the two-dimensional images and ignore the structural information of the 3D model, so the final retrieval result is poorer than the effect of the MIF method.
In L iu and MCG methods, a model for query is composed of multi-view views and associations between views, similarity measurement is performed by constructing a three-dimensional model diagram structure, and then three-dimensional model retrieval is realized, but skeleton information of the three-dimensional model is ignored, so that the retrieval result is less effective than the MIF method.
CNN (convolutional neural network) is applied in the NN method, but the network model used is relatively simple and thus has poor performance compared to the MIF method.
The D L AN method proposes a new deep neural network called D L AN for 3DMR, namely a deep local feature aggregation network, the features generated by the network have rotation invariance, but the method only focuses on local features of the model and ignores the global skeleton information of the three-dimensional model, so that the three-dimensional model cannot be well characterized.
In addition, the invention combines the MIF method with the current newest method (PANORAMA-NN) based on the ModelNet database[25]、 GIFT[26]、3D ShapeNets[27]、MVCNN[7]、LPD[28]、SPH[29]) And (5) carrying out performance comparison.
From the above experiments, it can be seen that the MIF method respectively obtains 3-31%, 7.2-20.1%, 2.2-16.4% and 1.3-15.8% improvement on NN, FT, ST and F-measure indexes, and respectively obtains 0.04-0.48 (ModelNet10) and 0.03-0.54(ModelNet40) improvement on MAP compared with the comparison method.
Reference to the literature
[1]Shi X,Chen Z,Wang H,et al.Convolutional LSTM Network:A MachineLearning Approach for Precipitation Nowcasting[J].2015.
[2] L iu, A, Nie, W, Gao, Y, et al, View-Based 3-D Model Retrieval A Benchmark [ J ]. IEEE Transactions on Cybernetics,2018.
[3]Wei-Zhi N,An-An L,Yue G,et al.Hyper-Clique Graph Matching andApplications[J]. IEEE Transactions on Circuits and Systems for VideoTechnology,2018:1-1.
[4]Zhu L,Shen J,Xie L,et al.Unsupervised Visual Hashing with SemanticAssistant for Content-Based Image Retrieval[J].IEEE Transactions on Knowledgeand Data Engineering, 2017,29(2):472-486.
[5]Dalal N,Triggs B.Histograms of Oriented Gradients for HumanDetection[C]//2005IEEE Computer Society Conference on Computer Vision andPattern Recognition(CVPR'05). IEEE,2005.
[6]Lowe D G.Distinctive Image Features from Scale-Invariant Keypoints[J].International Journal of Computer Vision,2004,60(2):91-110.
[7]Su H,Maji S,Kalogerakis E,et al.Multi-view Convolutional NeuralNetworks for 3D Shape Recognition[J].2015.
[8]Kanezaki A,Matsushita Y,Nishida Y.RotationNet:Joint Learning ofObject Classification and Viewpoint Estimation using Unaligned 3D ObjectDataset[J].2016.
[9]Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[J].arXiv preprint arXiv:1409.1556,2014.
[10]Shen W,Zhao K,Jiang Y,et al.DeepSkeleton:Learning Multi-taskScale-associated Deep Side Outputs for Object Skeleton Extraction in NaturalImages[J].IEEE Transactions on Image Processing,2017:1-1.
[11]Wang D,Wang B,Zhao S,et al.View-based 3D object retrieval withdiscriminative views[J]. Neurocomputing,2017,252:58-66.
[12]Feature-based similarity search in 3D object databases[J].ACMComputing Surveys,2005, 37(4):345-387.
[13]Paquet E,Rioux M,Murching A,et al.Description of shapeinformation for 2-D and 3-D objects[J].Signal Processing:Image Communication,2000,16(1-2):103-122.
[14]Papoiu A D P,Emerson N M,Patel T S,et al.Voxel-based morphometryand arterial spin labeling fMRI reveal neuropathic and neuroplastic featuresof brain processing of itch in end-stage renal disease[J].Journal ofNeurophysiology,2014,112(7):1729-1738.
[15]Liu Q.A survey of recent view-based 3d model retrieval methods[J].arXiv preprint arXiv:1208.3670,2012.
[16]Tangelder J W H,Veltkamp R C.Polyhedral model retrieval usingweighted point sets[J]. International journal of image and graphics,2003,3(01):209-229.
[17]Shinagawa Y,Kunii T L.Constructinga Reeb graph automatically fromcross sections[J]. IEEE Computer Graphics and Applications,1991(6):44-51.
[18]Wu Z,Song S,Khosla A,et al.3d shapenets:A deep representation forvolumetric shapes[C]//Proceedings of the IEEE conference on computer visionand pattern recognition. 2015:1912-1920.
[19]Chen D Y,Tian X P,Shen Y T,et al.On Visual Similarity Based 3DModel Retrieval[J]. Computer Graphics Forum,2003,22(3):223-232.
[20]Gao Y,Tang J,Hong R,et al.Camera constraint-free view-based 3-Dobject retrieval[J]. IEEE Transactions on Image Processing,2011,21(4):2269-2281.
[21]Ansary T F,Daoudi M,Vandeborre J P.A bayesian 3-d search engineusing adaptive views clustering[J].IEEE Transactions on Multimedia,2006,9(1):78-88.
[22]Liu A,Wang Z,Nie W,et al.Graph-based characteristic view setextraction and matching for 3D model retrieval[J].Information Sciences,2015,320:429-442.
[23]Liu A A,Nie W Z,Gao Y,et al.Multi-modal clique-graph matching forview-based 3d model retrieval[J].IEEE Transactions on Image Processing,2016,25(5):2103-2116.
[24]Furuya T,Ohbuchi R.Deep Aggregation of Local 3D GeometricFeatures for 3D Model Retrieval[C]//BMVC.2016,7:8.
[25]Sfikas K,Theoharis T,Pratikakis I.Exploiting the PANORAMARepresentation for Convolutional Neural Network Classification and Retrieval[J].3DOR,2017,6:7.
[26]Bai S,Bai X,Zhou Z,et al.Gift:Towards scalable 3d shape retrieval[J].IEEE Transactions on Multimedia,2017,19(6):1257-1271.
[27]Wu Z,Song S,Khosla A,et al.3d shapenets:A deep representation forvolumetric shapes[C]//Proceedings of the IEEE conference on computer visionand pattern recognition. 2015:1912-1920.
[28]Chen D Y,Tian X P,Shen Y T,et al.On visual similarity based 3Dmodel retrieval[C]//Computer graphics forum.Oxford,UK:Blackwell Publishing,Inc,2003,22(3): 223-232.
[29]Kazhdan M,Funkhouser T,Rusinkiewicz S.Rotation invariantspherical harmonic representation of 3 d shape descriptors[C]//Symposium ongeometry processing.2003,6: 156-164.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (2)

1. A multi-modal feature fusion method based on L STM network, characterized in that the method comprises:
1) placing a plurality of cameras around the three-dimensional model, and acquiring a group of multi-view views representing the three-dimensional model through continuous shooting;
2) acquiring visual features of a three-dimensional model contained in the multiple views based on a first L STM network, and acquiring skeleton features of the three-dimensional model contained in the multiple views based on a DeepSkeleton network;
3) inputting the obtained visual features and skeleton features into a second L STM network to realize the fusion of the multi-modal features of the three-dimensional model;
wherein, the step 2) is specifically as follows: the extraction of the three-dimensional model multi-view feature vector is realized by loading a VGG-Net16 model: f ═ F1,f2,f3,...,f12Therein of
Figure FDA0002395175040000011
fi∈R4096Inputting the feature vector set F into the first L STM network, and outputting the visual features of the three-dimensional model
Figure FDA0002395175040000012
Positioning three-dimensional model skeleton information contained in the multiple views based on a DeepSkeleton network to judge whether pixel points are skeleton pixels, and extracting skeleton characteristics of the three-dimensional model through regression prediction of the skeleton elements: g ═ G1,g2,...,g12};
Wherein, the step 3) is to output the first L STM network
Figure FDA0002395175040000014
And the extracted skeleton feature G ═ { G ═ G1,g2,...,g12As input to a second L STM network, the output of the second L STM network is taken as a fused feature:
Figure FDA0002395175040000013
and finally, processing the output results of the 12 pictures by utilizing the maximum pooling layer to obtain the final fusion characteristic z*
2. The method for fusing the multi-modal features based on L STM network according to claim 1, wherein the step 1) is specifically as follows:
adjusting the rotation direction and the rotation angle of the three-dimensional model, and placing 12 virtual cameras at a fixed distance around the three-dimensional model, wherein the cameras are aligned with the lens of the cameras and are aligned with the mass center of the three-dimensional model at a depression angle of 30 degrees;
acquiring 12 multi-view views representing the three-dimensional model by continuous shooting: v ═ V1,v2,...,v12}。
CN202010128604.XA 2020-02-28 2020-02-28 Multi-modal feature fusion method based on L STM network Pending CN111461166A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010128604.XA CN111461166A (en) 2020-02-28 2020-02-28 Multi-modal feature fusion method based on L STM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010128604.XA CN111461166A (en) 2020-02-28 2020-02-28 Multi-modal feature fusion method based on L STM network

Publications (1)

Publication Number Publication Date
CN111461166A true CN111461166A (en) 2020-07-28

Family

ID=71682464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010128604.XA Pending CN111461166A (en) 2020-02-28 2020-02-28 Multi-modal feature fusion method based on L STM network

Country Status (1)

Country Link
CN (1) CN111461166A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154229A1 (en) * 2013-11-29 2015-06-04 Canon Kabushiki Kaisha Scalable attribute-driven image retrieval and re-ranking
CN109992686A (en) * 2019-02-24 2019-07-09 复旦大学 Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110163091A (en) * 2019-04-13 2019-08-23 天津大学 Method for searching three-dimension model based on LSTM network multimodal information fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154229A1 (en) * 2013-11-29 2015-06-04 Canon Kabushiki Kaisha Scalable attribute-driven image retrieval and re-ranking
CN109992686A (en) * 2019-02-24 2019-07-09 复旦大学 Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110163091A (en) * 2019-04-13 2019-08-23 天津大学 Method for searching three-dimension model based on LSTM network multimodal information fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHOU,HY,ET AL: "Dual-level Embedding Alignment Network for 2D Image-based 3D Object Retrieval", PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, pages 1667 - 1675 *
裴晓敏;范慧杰;唐延东: "时空特征融合深度学习网络人体行为识别方法", 红外与激光工程, no. 002, pages 55 - 60 *

Similar Documents

Publication Publication Date Title
Qi et al. Review of multi-view 3D object recognition methods based on deep learning
Georgiou et al. A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision
Cong et al. Going from RGB to RGBD saliency: A depth-guided transformation model
Zhang et al. End-to-end photo-sketch generation via fully convolutional representation learning
CN107066559B (en) Three-dimensional model retrieval method based on deep learning
Li et al. A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries
Bashir et al. Vr-proud: Vehicle re-identification using progressive unsupervised deep architecture
Trigeorgis et al. Face normals" in-the-wild" using fully convolutional networks
Lu et al. Learning view-model joint relevance for 3D object retrieval
CN111625667A (en) Three-dimensional model cross-domain retrieval method and system based on complex background image
Li et al. Sketch-based 3D model retrieval utilizing adaptive view clustering and semantic information
Hu et al. RGB-D semantic segmentation: a review
Mosella-Montoro et al. 2d–3d geometric fusion network using multi-neighbourhood graph convolution for rgb-d indoor scene classification
Peng et al. Evaluation of segmentation quality via adaptive composition of reference segmentations
Bazazian et al. DCG-net: Dynamic capsule graph convolutional network for point clouds
Liu et al. 3D model retrieval based on multi-view attentional convolutional neural network
Liang et al. MVCLN: multi-view convolutional LSTM network for cross-media 3D shape recognition
Lu et al. Memory efficient large-scale image-based localization
Li et al. Multi-view-based siamese convolutional neural network for 3D object retrieval
Cinaroglu et al. Long-term image-based vehicle localization improved with learnt semantic descriptors
Nie et al. The assessment of 3D model representation for retrieval with CNN-RNN networks
CN114120095A (en) Mobile robot autonomous positioning system and method based on aerial three-dimensional model
Richard et al. KAPLAN: A 3D point descriptor for shape completion
Benhabiles et al. Convolutional neural network for pottery retrieval
CN111461166A (en) Multi-modal feature fusion method based on L STM network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200728