CN113191401A - Method and device for three-dimensional model recognition based on visual saliency sharing - Google Patents
Method and device for three-dimensional model recognition based on visual saliency sharing Download PDFInfo
- Publication number
- CN113191401A CN113191401A CN202110402748.4A CN202110402748A CN113191401A CN 113191401 A CN113191401 A CN 113191401A CN 202110402748 A CN202110402748 A CN 202110402748A CN 113191401 A CN113191401 A CN 113191401A
- Authority
- CN
- China
- Prior art keywords
- dimensional model
- branch
- visual
- visual saliency
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 137
- 238000000034 method Methods 0.000 title claims abstract description 67
- 239000013598 vector Substances 0.000 claims abstract description 34
- 230000004927 fusion Effects 0.000 claims abstract description 20
- 238000004590 computer program Methods 0.000 claims description 33
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 4
- WDEFWOPTRFWZLM-LEQGEALCSA-N (2,5-dioxopyrrolidin-1-yl) (2r)-2,5,7,8-tetramethyl-6-(oxan-2-yloxy)-3,4-dihydrochromene-2-carboxylate Chemical compound C([C@@](C)(OC=1C(C)=C2C)C(=O)ON3C(CCC3=O)=O)CC=1C(C)=C2OC1CCCCO1 WDEFWOPTRFWZLM-LEQGEALCSA-N 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 10
- 230000003287 optical effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The application discloses a method and a device for three-dimensional model identification based on visual saliency sharing. The method for three-dimensional model identification based on visual saliency sharing comprises the following steps: acquiring a three-dimensional model to be retrieved; acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved; acquiring a visual feature vector of a two-dimensional view sequence; inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics. Using the LSTM network in the branch of visual saliency, the present application can easily extract a three-dimensional model representation from the last cell state, including both global and dependency information, taking into account all the views in the three-dimensional model. The problem of information loss in the existing multi-view method is solved.
Description
Technical Field
The invention relates to the technical field of three-dimensional model retrieval and classification, in particular to a method for identifying a three-dimensional model based on visual saliency sharing and a device for identifying the three-dimensional model based on visual saliency sharing.
Background
In recent years, the application of three-dimensional technology in the film and television industry has become widespread. Three-dimensional models have spread throughout the corners of people's lives, so there is a natural need to explore more efficient ways to learn representations of 3D models. Furthermore, with the development of computer vision and three-dimensional reconstruction techniques, three-dimensional shape recognition has become a fundamental task in shape analysis, which is the most critical technique for processing and analyzing three-dimensional data. Deep networks for three-dimensional shape recognition have had many research achievements, such as MVCNN, 3 dshapnets, PointNet, VoxNe, thanks to the availability of powerful deep learning neural networks and large-scale labeled three-dimensional shape sets.
Among the current methods, the view-based method works best. The best known example of such a method is the multi-view convolutional neural network (MVCNN), which is a combination of a plurality of two-dimensional projection features that a Convolutional Neural Network (CNN) learns in an end-to-end trainable manner. This approach has become a milestone for three-dimensional shape recognition and achieves current optimal performance. To build deep learning models that unify the task of three-dimensional object classification and retrieval, there has been much work in the field similar to MVCNN.
The current problem with multi-view based three-dimensional model classification and retrieval is:
in the current method, all views are treated equally in the process of generating the shape descriptor, and the similarity and difference between different views are ignored. For example, in MVCNN, visual features are passed through a view merging layer to generate a shape descriptor, while the view merging layer retains only information of a corresponding view having a maximum value and discards other information of a plurality of views. The similarity and disparity of multiple views cannot be fully exploited. Accordingly, a technical solution is desired to overcome or at least alleviate at least one of the above-mentioned drawbacks of the prior art.
Disclosure of Invention
It is an object of the present invention to provide a method for three-dimensional model identification based on visual saliency sharing that overcomes or at least alleviates at least one of the above-mentioned drawbacks of the prior art.
In one aspect of the present invention, a method for three-dimensional model recognition based on visual saliency sharing is provided, and the method for three-dimensional model recognition based on visual saliency sharing comprises:
acquiring a three-dimensional model to be retrieved;
acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;
acquiring a visual feature vector of the two-dimensional view sequence;
inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;
and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.
Optionally, the inputting the visual features into the MVCNN branch and the visual saliency branch, and the fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features includes:
respectively inputting visual feature vectors of the two-dimensional view sequence into the MVCNN branch and the visual saliency branch;
acquiring weight and visual saliency characteristics according to the visual feature vector and the visual saliency branches of the two-dimensional view sequence;
inputting the weights into the MVCNN branch, and generating three-dimensional model complex features by the MVCNN branch according to the weights and the visual feature vector of the two-dimensional view sequence;
feature fusing the visually significant features and the three-dimensional model complex features to form fused features.
Optionally, the MVCNN branch comprises a convolutional neural network with a view attention pooling layer.
Optionally, the visual saliency branch comprises two LSTM layers, feature learning is performed through the two LSTM layers and a soft attention mechanism, a first LSTM module and a soft attention mechanism are adopted to weight each visual feature vector, and view weights are input to the MVCNN branch for guiding fusion to form the three-dimensional model complex features.
Optionally, the obtaining a two-dimensional view sequence according to the three-dimensional model to be retrieved includes:
carrying out normalized processing on the three-dimensional model to be retrieved by using an NPCA method;
a set of views is extracted from the three-dimensional model to be retrieved to form the sequence of two-dimensional views.
Optionally, the extracting a set of views from the three-dimensional model to be retrieved includes:
extraction is performed at intervals of 30 ° on the Z-axis of the three-dimensional model, thereby extracting 12 two-dimensional views, each of which constitutes the two-dimensional view sequence.
The application also provides a device for three-dimensional model recognition based on visual saliency sharing, which comprises:
the three-dimensional model acquisition module is used for acquiring a three-dimensional model to be retrieved;
the two-dimensional view sequence acquisition module is used for acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;
a visual feature vector acquisition module, configured to acquire a visual feature vector of the two-dimensional view sequence;
a fusion module for inputting the visual features into the MVCNN branch and the visual saliency branch, fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;
and the classification or retrieval module is used for retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.
The present application further provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method for three-dimensional model recognition based on visual saliency sharing as described above when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program enabling, when being executed by a processor, a method for three-dimensional model identification based on visual saliency sharing as described above.
Advantageous effects
1. The method uses ResNet18 to extract features of a view sequence of a three-dimensional model, the network model can better optimize the training result of a deep network, and compared with other classical convolutional neural networks (such as AlexNet and VGG-Net), ResNet18 obtains relatively better balance between accuracy and storage cost.
2. The present application introduces a mechanism of attention that allows the neural network to focus on certain specific parts of the input image to minimize task complexity and discard irrelevant information. The invention adopts a soft attention mechanism, so that the network always keeps attention to any information of the three-dimensional object and learns where to pay more attention. And excessive computing cost is avoided while inter-view correlation information is fully utilized.
3. Using the LSTM network in the branch of visual saliency, the present application can easily extract a three-dimensional model representation from the last cell state, including both global and dependency information, taking into account all the views in the three-dimensional model. The problem of information loss in the existing multi-view method is solved.
Drawings
Fig. 1 is a flowchart illustrating a method for three-dimensional model recognition based on visual saliency sharing according to a first embodiment of the present application.
FIG. 2 is an exemplary block diagram of an electronic device capable of implementing a method for three-dimensional model recognition based on visual saliency sharing provided according to one embodiment of the present application.
FIG. 3 is a graphical comparison of performance for different numbers of views of the method for three-dimensional model recognition based on visual saliency sharing shown in FIG. 1.
Detailed Description
In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are a subset of the embodiments in the present application and not all embodiments in the present application. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
In the description of the present application, it is to be understood that the terms "central," "longitudinal," "lateral," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the orientation or positional relationship indicated in the drawings for convenience in describing the present application and for simplicity in description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated in a particular manner and are not to be considered limiting of the scope of the present application.
Fig. 1 is a flowchart illustrating a method for three-dimensional model recognition based on visual saliency sharing according to a first embodiment of the present application.
The method for three-dimensional model recognition based on visual saliency sharing as shown in fig. 1 comprises:
step 1: acquiring a three-dimensional model to be retrieved;
step 2: acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;
and step 3: acquiring a visual feature vector of a two-dimensional view sequence;
and 4, step 4: inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;
and 5: and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.
Advantageous effects
1. The method uses ResNet18 to extract features of a view sequence of a three-dimensional model, the network model can better optimize the training result of a deep network, and compared with other classical convolutional neural networks (such as AlexNet and VGG-Net), ResNet18 obtains relatively better balance between accuracy and storage cost.
2. The present application introduces a mechanism of attention that allows the neural network to focus on certain specific parts of the input image to minimize task complexity and discard irrelevant information. The invention adopts a soft attention mechanism, so that the network always keeps attention to any information of the three-dimensional object and learns where to pay more attention. And excessive computing cost is avoided while inter-view correlation information is fully utilized.
3. Using the LSTM network in the branch of visual saliency, the present application can easily extract a three-dimensional model representation from the last cell state, including both global and dependency information, taking into account all the views in the three-dimensional model. The problem of information loss in the existing multi-view method is solved.
In this embodiment, step 4: inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features includes:
step 41: respectively inputting visual feature vectors of the two-dimensional view sequence into the MVCNN branch and the visual saliency branch;
step 42: acquiring weight and visual saliency characteristics according to the visual feature vector and the visual saliency branches of the two-dimensional view sequence;
step 43: inputting the weight into an MVCNN branch, and generating a three-dimensional model complex feature by the MVCNN branch according to the weight and a visual feature vector of a two-dimensional view sequence;
step 44: and performing feature fusion on the visual saliency features and the complex features of the three-dimensional model to form fused features.
In this embodiment, the MVCNN branch includes a convolutional neural network with a view attention pooling layer.
In this embodiment, the visual saliency branch comprises two LSTM layers, feature learning is performed through the two LSTM layers and the soft attention mechanism, a first LSTM module and the soft attention mechanism are adopted to weight each visual feature vector, and view weights are input to the MVCNN branch for guiding fusion to form the complex features of the three-dimensional model.
In this embodiment, acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved includes:
carrying out standardization processing on the three-dimensional model to be retrieved by using an NPCA method;
a set of views is extracted from the three-dimensional model to be retrieved to form a sequence of two-dimensional views.
In this embodiment, the extracting a set of views from the three-dimensional model to be retrieved includes:
extraction is performed at intervals of 30 ° on the Z-axis of the three-dimensional model, thereby extracting 12 two-dimensional views, each of which constitutes the two-dimensional view sequence.
The present application is described in further detail below by way of examples, it being understood that the examples do not constitute any limitation to the present application.
The dual-stream network based on visual saliency sharing mainly comprises two branches: the visual saliency branch is used for defining view weights based on similarity and difference information of multiple views and guiding visual information fusion in the MVCNN model. The second is a multi-view convolutional neural network (MVCNN) which can extract view information from photographed views.
Step 1: acquiring a three-dimensional model to be retrieved;
step 2: and extracting a two-dimensional view from each three-dimensional model by using the three-dimensional model data, wherein the views are obtained by taking pictures of the three-dimensional models at intervals of 30 degrees by taking the Z axis of the three-dimensional models as a rotating axis. Obtaining a sequence containing 12 views, namely obtaining a two-dimensional view sequence;
and step 3: and acquiring the visual feature vector of the two-dimensional view sequence, and specifically, extracting the 4096-dimensional visual feature vector of the two-dimensional view sequence through a convolutional neural network ResNet 18.
And 4, step 4: inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; specifically, the visual feature vectors are input into the visual saliency branch and the MVCNN branch of the network, respectively. In the visual saliency branch, each visual feature vector V ═ V1, …, vn is weighted using the LSTM layer and soft attention mechanism, thereby assigning a weight to each view of the model;
the resulting view feature weights in the visual salient branch are passed into the MVCNN branch for guiding visual information fusion. And (4) the view weight and the view feature vector extracted in the step (3) pass through the MVCNN containing the attention pooling layer to obtain the feature vector of the MVCNN branch.
Step 6, retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics; specifically, the visual saliency features output by the visual saliency branch are fused with the complex features output by the MVCNN branch to obtain the final features of the model for classification and retrieval.
The operation of weighting each visual vector by using the LSTM layer and the soft attention mechanism in step 4 is specifically:
1) LSTM is a special recurrent neural network that maintains hidden states ht and internal storage states ct. The correlation between hidden state ht and memory state ct is calculated by the output gate:
ht=ot ct
wherein |, indicates element-by-element multiplication. otIs calculated as follows:
ot=σ(U0|ht-1,vi,t|+b0)
where σ is a logarithmic sigmoid function, viAnd t is a feature vector of the virtual view in t time. U shape0,b0Respectively, weight matrices for previous hidden states and biases. Current memory state ctFrom the previous memory state ct-1And updated memory ctDetermining:
ct=ft ct-1+it ct
at ftCalculating the forgetting gate and the input gate in sequence
ft=σ(Uf[ht-1,vi,t]+bf)
it=σ(Ui[ht-1,vi,t]+bi)
Currently updated memory ctComprises the following steps:
ct=tanh(UC[ht-1,vi,t]+bC)
where U0, b0 represent the weight matrices of hidden states and biases, respectively.
Calculating the View weight a using an attention mechanismiThe specific calculation method comprises the following steps:
ei=wTtanh(Uh[ht-1,vi,t]+bv)
wherein a isiSatisfies the following conditions:
the operation of transmitting the view weight in the step 4 into the MVCNN branch to guide the visual fusion specifically includes:
1) using the average of the dynamically weighted sum of the multi-view feature vectors, thereby
Where N is the number of input views and V is V1,...,vnIs a set of visual features of a three-dimensional object. After focusing on the attention, ψ (V) is output.
2) And obtaining the feature vector of the MVCNN branch through a convolutional neural network.
In summary, the embodiment of the present invention extracts the feature information of the three-dimensional model through the above steps, and sufficiently pays attention to the difference and the relevance information between the views through the attention mechanism, so that the description of the feature vector on the three-dimensional model is more comprehensive, the loss of information is avoided, and the identification and the classification retrieval of the three-dimensional model are more accurate and scientific.
Example 2:
the feasibility of the protocol of example 1 is verified below with reference to specific examples, which are described in detail below:
we used the ModelNet40 and sharenetcore 55 datasets to evaluate the performance of the method in three-dimensional shape recognition and retrieval. Where ModelNet40 is a subset of ModelNet, it contains a total of 12311 CAD models of 12 classes. The model was adjusted manually, but no pose normalization was performed. The training and testing subsets of the ModelNet40 contain 9843 and 2468 models, respectively. ShapeNet core55 is a subset of ShapeNet that contains approximately 51300 three-dimensional models out of 55 common classes, each of which is subdivided into several subclasses. ShapeNetCore is divided into three parts, wherein the proportion of a training set, a verification set and a test set is 70%, 10% and 20% respectively. The model is in the OBJ format and provides two dataset versions: consistent alignment (regular data sets), and more challenging data sets, where the model is perturbed by random rotations.
In this approach, we introduce a visual saliency model (comprising two LSTM layers and one soft attention network) to consider structural and dependency information from multiple shot views and guide the pool of visual features in the MVCNN branch. To verify the effectiveness of our approach, we designed experiments for each component of the network. As shown in the table below, we use different parts of the network to perform three-dimensional shape classification on the ModelNet40 dataset to verify the validity of the attention weights. The relevant experimental results are shown in the first and second rows of the table. The results show that our attention weights can focus the model on more representative views, resulting in better performance in three-dimensional shape recognition. It can be seen that a design that treats the captured views as a sequence of views and utilizes its structural information is reasonably feasible for three-dimensional shape recognition. The experimental results prove that the network architecture provided by the inventor can obtain better three-dimensional model representation.
Table 1 shows the effect of different parts of the network structure on classification tasks
To verify the efficiency of the proposed network, we performed 3D shape classification and retrieval experiments on the Princeton model net dataset.
For the data set, we followed the same ModelNet40 training and test splitting setup as Wu et al. In experiments, we compared the dual-stream model with various models based on different representations, including a volume-based model (Wu et al 3D Shapes), a manual descriptor of multi-view data (Kazhdan et al SPH and Chen et al LFD), a deep learning model for multi-view data (Su et al MVCNN and Qi et al MVNN-MultiRes), a point cloud-based model (Qi et al PointNet, Qi et al PointNet + +, Klokov et al KD-Network, Li et al PointCNN and Wang et al DGCNN), and a panoramic view-based model (Sfikas et al PANORMA-NN). The following table provides the classification and search results for all comparison methods. The result shows that the classification accuracy of the invention is higher.
TABLE 2 comparison of Classification accuracy of the methods on the ModelNet40 dataset
On ShapeNetCore55, each evaluation index of the macroaverage is used to provide an unweighted average of the entire data set. The indicators of the micro-averaging are used to adjust the size of the model classes, thereby providing representative performance indicators averaged across the classes. Evaluation codes for calculating all these metrics are provided on the SHREC contest official website.
We obtained the results of the attitude normalization and perturbation search experiments on the ShapeNetCore dataset, respectively. The comparison was performed using the higher precision method in large scale 3D shape retrieval on ShapeNetCore55 in SHREC2016 and SHREC 2017. We have tested and verified the superiority of the method over the ModelNet dataset and the shapenet 55 dataset. The search results in table 3 illustrate the robustness of the method. From table 3 below, we can find that this method is superior to other methods in terms of macro-average F-score, mAP and NDCG metrics, compared to rotanent. On micro-average, the performance of the method is almost superior to all other methods, and is always very close to the best results of this dataset.
TABLE 3 retrieval accuracy as measured by mAP, F-SCORE and NDCG on SHAPENETCORE55 dataset
The number of views rendered from a three-dimensional object may affect retrieval performance and classification performance, so we have conducted comparative experiments to select the optimal number of views. We set up the virtual camera array at an angle theta around the z-axis. θ is set to 180 °, 90 °, 60 °, 45 °, 36 °, 30 ° and 18 °, respectively, and 2, 4, 6, 8, 10, 12, and 20 views can be generated for each three-dimensional object, respectively.
As shown in table 4, when the number of views was set to 12, the NN, FT, ST, DCG, F-fraction and ANMRR were increased by 15.8% to 46.7%, 11.8% to 118.8%, 17.0% to 71.5%, 18.0% to 52.4%, 12.0% to 95.6%, and 43.6% to 77.9%, respectively, compared to the other numbers of views. Therefore, we set the optimal number of views to 12.
TABLE 4 Performance of different view quantities on ModelNet40
Intuitively, the order of view capture directly affects three-dimensional object feature learning. To verify whether our method is limited by a particular view order, we set the view order number of the multi-view sequence 50 times during the test to verify the robustness of the proposed network. The results of the search and classification are shown in the following table.
TABLE 5 comparison of Performance on ModelNet40 for different view sequences
The results of the out-of-order input view are even better than those of the in-order input view approach. It can be seen that the present invention can adaptively calculate the importance of each view and can use the structural and visual information in a plurality of photographed views with the help of such information.
The application also provides a device for identifying the three-dimensional model based on visual saliency sharing, which comprises a three-dimensional model acquisition module, a two-dimensional view sequence acquisition module, a visual feature vector acquisition module, a fusion module and a classification or retrieval module, wherein the three-dimensional model acquisition module is used for acquiring the three-dimensional model to be retrieved; the two-dimensional view sequence acquisition module is used for acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved; the visual feature vector acquisition module is used for acquiring a visual feature vector of the two-dimensional view sequence; the fusion module is used for inputting the visual features to the MVCNN branch and the visual saliency branch and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fusion features; and the classification or retrieval module is used for retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.
It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus of this embodiment, and are not repeated herein.
The present application further provides an electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, the processor when executing the computer program implementing the above method for three-dimensional model recognition based on visual saliency sharing.
The present application also provides a computer readable storage medium having stored thereon a computer program enabling, when being executed by a processor, a method for three-dimensional model recognition based on visual saliency sharing as above.
FIG. 2 is an exemplary block diagram of an electronic device capable of implementing a method for three-dimensional model recognition based on visual saliency sharing provided according to one embodiment of the present application.
As shown in fig. 2, the electronic device includes an input device 501, an input interface 502, a central processor 503, a memory 504, an output interface 505, and an output device 506. The input interface 502, the central processing unit 503, the memory 504 and the output interface 505 are connected to each other through a bus 507, and the input device 501 and the output device 506 are connected to the bus 507 through the input interface 502 and the output interface 505, respectively, and further connected to other components of the electronic device. Specifically, the input device 504 receives input information from the outside and transmits the input information to the central processor 503 through the input interface 502; the central processor 503 processes input information based on computer-executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; the output device 506 outputs the output information to the outside of the electronic device for use by the user.
That is, the electronic device shown in fig. 2 may also be implemented to include: a memory storing computer-executable instructions; and one or more processors which, when executing the computer-executable instructions, may implement the method for three-dimensional model recognition based on visual saliency sharing described in connection with fig. 1.
In one embodiment, the electronic device shown in fig. 2 may be implemented to include: a memory 504 configured to store executable program code; one or more processors 503 configured to execute executable program code stored in the memory 504 to perform the method for three-dimensional model recognition based on visual saliency sharing in the above-described embodiments.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media include both non-transitory and non-transitory, removable and non-removable media that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Furthermore, it will be obvious that the term "comprising" does not exclude other elements or steps. A plurality of units, modules or devices recited in the device claims may also be implemented by one unit or overall device by software or hardware. The terms first, second, etc. are used to identify names, but not any particular order.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks identified in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The Processor in this embodiment may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
In this embodiment, the module/unit integrated with the apparatus/terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media include both non-transitory and non-transitory, removable and non-removable media that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Furthermore, it will be obvious that the term "comprising" does not exclude other elements or steps. A plurality of units, modules or devices recited in the device claims may also be implemented by one unit or overall device by software or hardware. The terms first, second, etc. are used to identify names, but not any particular order.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks identified in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The Processor in this embodiment may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
In this embodiment, the module/unit integrated with the apparatus/terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction.
Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.
Claims (9)
1. A method for three-dimensional model identification based on visual saliency sharing, characterized in that the method for three-dimensional model identification based on visual saliency sharing comprises:
acquiring a three-dimensional model to be retrieved;
acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;
acquiring a visual feature vector of the two-dimensional view sequence;
inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;
and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.
2. The method for three-dimensional model recognition based on visual saliency sharing of claim 1, wherein said inputting the visual features into an MVCNN branch and a visual saliency branch, fusing complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features comprises:
respectively inputting visual feature vectors of the two-dimensional view sequence into the MVCNN branch and the visual saliency branch;
acquiring weight and visual saliency characteristics according to the visual feature vector and the visual saliency branches of the two-dimensional view sequence;
inputting the weights into the MVCNN branch, and generating three-dimensional model complex features by the MVCNN branch according to the weights and the visual feature vector of the two-dimensional view sequence;
feature fusing the visually significant features and the three-dimensional model complex features to form fused features.
3. The method for three-dimensional model identification based on visual saliency sharing of claim 2, wherein said MVCNN branch comprises a convolutional neural network with a view attention pooling layer.
4. The method for three-dimensional model identification based on visual saliency sharing of claim 3, wherein said visual saliency branch comprises two LSTM layers, feature learning through said two LSTM layers and a soft attention mechanism, a first LSTM module and a soft attention mechanism are employed to weight each visual feature vector, and view weights are input to MVCNN branch for guiding fusion to form said three-dimensional model complex features.
5. The method for three-dimensional model recognition based on visual saliency sharing of claim 4, wherein said obtaining a sequence of two-dimensional views from said three-dimensional model to be retrieved comprises:
carrying out normalized processing on the three-dimensional model to be retrieved by using an NPCA method;
a set of views is extracted from the three-dimensional model to be retrieved to form the sequence of two-dimensional views.
6. The method for three-dimensional model recognition based on visual saliency sharing of claim 5, wherein said extracting a set of views from a three-dimensional model to be retrieved comprises:
extraction is performed at intervals of 30 ° on the Z-axis of the three-dimensional model, thereby extracting 12 two-dimensional views, each of which constitutes the two-dimensional view sequence.
7. An apparatus for three-dimensional model recognition based on visual saliency sharing, characterized in that the apparatus for three-dimensional model recognition based on visual saliency sharing comprises:
the three-dimensional model acquisition module is used for acquiring a three-dimensional model to be retrieved;
the two-dimensional view sequence acquisition module is used for acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;
a visual feature vector acquisition module, configured to acquire a visual feature vector of the two-dimensional view sequence;
a fusion module for inputting the visual features into the MVCNN branch and the visual saliency branch, fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;
and the classification or retrieval module is used for retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.
8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method for three-dimensional model recognition based on visual saliency sharing of any one of claims 1 to 6.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is capable of carrying out the method for three-dimensional model recognition based on visual saliency sharing according to one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110402748.4A CN113191401A (en) | 2021-04-14 | 2021-04-14 | Method and device for three-dimensional model recognition based on visual saliency sharing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110402748.4A CN113191401A (en) | 2021-04-14 | 2021-04-14 | Method and device for three-dimensional model recognition based on visual saliency sharing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113191401A true CN113191401A (en) | 2021-07-30 |
Family
ID=76973976
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110402748.4A Pending CN113191401A (en) | 2021-04-14 | 2021-04-14 | Method and device for three-dimensional model recognition based on visual saliency sharing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113191401A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113869120A (en) * | 2021-08-26 | 2021-12-31 | 西北大学 | Aggregation convolution three-dimensional model classification method based on view filtering |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298262A (en) * | 2019-06-06 | 2019-10-01 | 华为技术有限公司 | Object identification method and device |
CN110347873A (en) * | 2019-06-26 | 2019-10-18 | Oppo广东移动通信有限公司 | Video classification methods, device, electronic equipment and storage medium |
CN111242207A (en) * | 2020-01-08 | 2020-06-05 | 天津大学 | Three-dimensional model classification and retrieval method based on visual saliency information sharing |
CN111582342A (en) * | 2020-04-29 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Image identification method, device, equipment and readable storage medium |
-
2021
- 2021-04-14 CN CN202110402748.4A patent/CN113191401A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298262A (en) * | 2019-06-06 | 2019-10-01 | 华为技术有限公司 | Object identification method and device |
CN110347873A (en) * | 2019-06-26 | 2019-10-18 | Oppo广东移动通信有限公司 | Video classification methods, device, electronic equipment and storage medium |
CN111242207A (en) * | 2020-01-08 | 2020-06-05 | 天津大学 | Three-dimensional model classification and retrieval method based on visual saliency information sharing |
CN111582342A (en) * | 2020-04-29 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Image identification method, device, equipment and readable storage medium |
Non-Patent Citations (3)
Title |
---|
CHAO MA ET AL: "Learning Multi-View Representation With LSTM for 3-D Shape Recognition and Retrieval", 《IEEE TRANSACTIONS ON MULTIMEDIA》 * |
KAIMING HE ET AL: "Deep Residual Learning for Image Recognition", 《ARXIV:1512.03385V1》 * |
刘树春等: "《深度实践OCR:基于深度学习的文字识别》", 31 May 2020 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113869120A (en) * | 2021-08-26 | 2021-12-31 | 西北大学 | Aggregation convolution three-dimensional model classification method based on view filtering |
CN113869120B (en) * | 2021-08-26 | 2022-08-05 | 西北大学 | Aggregation convolution three-dimensional model classification method based on view filtering |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10621971B2 (en) | Method and device for extracting speech feature based on artificial intelligence | |
CN111062871B (en) | Image processing method and device, computer equipment and readable storage medium | |
Wang et al. | Sketch-based 3d shape retrieval using convolutional neural networks | |
CN110503076B (en) | Video classification method, device, equipment and medium based on artificial intelligence | |
CN112016475B (en) | Human body detection and identification method and device | |
Guan et al. | On-device mobile landmark recognition using binarized descriptor with multifeature fusion | |
CN109740415A (en) | Vehicle attribute recognition methods and Related product | |
CN111680678B (en) | Target area identification method, device, equipment and readable storage medium | |
CN114238904A (en) | Identity recognition method, and training method and device of two-channel hyper-resolution model | |
CN111126358A (en) | Face detection method, face detection device, storage medium and equipment | |
EP4075328A1 (en) | Method and device for classifying and searching for a 3d model on basis of deep attention | |
CN110992404A (en) | Target tracking method, device and system and storage medium | |
CN116912924B (en) | Target image recognition method and device | |
Zong et al. | A cascaded refined rgb-d salient object detection network based on the attention mechanism | |
CN113191401A (en) | Method and device for three-dimensional model recognition based on visual saliency sharing | |
CN111310590B (en) | Action recognition method and electronic equipment | |
CN116993978A (en) | Small sample segmentation method, system, readable storage medium and computer device | |
CN111191065A (en) | Homologous image determining method and device | |
CN114155417B (en) | Image target identification method and device, electronic equipment and computer storage medium | |
CN115358777A (en) | Advertisement putting processing method and device of virtual world | |
CN112825145B (en) | Human body orientation detection method and device, electronic equipment and computer storage medium | |
CN115661821B (en) | Loop detection method, loop detection device, electronic apparatus, storage medium, and program product | |
CN117351246B (en) | Mismatching pair removing method, system and readable medium | |
CN116452741B (en) | Object reconstruction method, object reconstruction model training method, device and equipment | |
CN113191400B (en) | Method and device for retrieving corresponding three-dimensional model based on two-dimensional image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210730 |