CN113191401A

CN113191401A - Method and device for three-dimensional model recognition based on visual saliency sharing

Info

Publication number: CN113191401A
Application number: CN202110402748.4A
Authority: CN
Inventors: 魏志强; 贾东宁; 殷波; 张澜; 张成峰; 褚宏奎
Original assignee: Ocean University of China; Qingdao National Laboratory for Marine Science and Technology Development Center
Current assignee: Ocean University of China; Qingdao National Laboratory for Marine Science and Technology Development Center
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-07-30

Abstract

The application discloses a method and a device for three-dimensional model identification based on visual saliency sharing. The method for three-dimensional model identification based on visual saliency sharing comprises the following steps: acquiring a three-dimensional model to be retrieved; acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved; acquiring a visual feature vector of a two-dimensional view sequence; inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics. Using the LSTM network in the branch of visual saliency, the present application can easily extract a three-dimensional model representation from the last cell state, including both global and dependency information, taking into account all the views in the three-dimensional model. The problem of information loss in the existing multi-view method is solved.

Description

Method and device for three-dimensional model recognition based on visual saliency sharing

Technical Field

The invention relates to the technical field of three-dimensional model retrieval and classification, in particular to a method for identifying a three-dimensional model based on visual saliency sharing and a device for identifying the three-dimensional model based on visual saliency sharing.

Background

In recent years, the application of three-dimensional technology in the film and television industry has become widespread. Three-dimensional models have spread throughout the corners of people's lives, so there is a natural need to explore more efficient ways to learn representations of 3D models. Furthermore, with the development of computer vision and three-dimensional reconstruction techniques, three-dimensional shape recognition has become a fundamental task in shape analysis, which is the most critical technique for processing and analyzing three-dimensional data. Deep networks for three-dimensional shape recognition have had many research achievements, such as MVCNN, 3 dshapnets, PointNet, VoxNe, thanks to the availability of powerful deep learning neural networks and large-scale labeled three-dimensional shape sets.

Among the current methods, the view-based method works best. The best known example of such a method is the multi-view convolutional neural network (MVCNN), which is a combination of a plurality of two-dimensional projection features that a Convolutional Neural Network (CNN) learns in an end-to-end trainable manner. This approach has become a milestone for three-dimensional shape recognition and achieves current optimal performance. To build deep learning models that unify the task of three-dimensional object classification and retrieval, there has been much work in the field similar to MVCNN.

The current problem with multi-view based three-dimensional model classification and retrieval is:

in the current method, all views are treated equally in the process of generating the shape descriptor, and the similarity and difference between different views are ignored. For example, in MVCNN, visual features are passed through a view merging layer to generate a shape descriptor, while the view merging layer retains only information of a corresponding view having a maximum value and discards other information of a plurality of views. The similarity and disparity of multiple views cannot be fully exploited. Accordingly, a technical solution is desired to overcome or at least alleviate at least one of the above-mentioned drawbacks of the prior art.

Disclosure of Invention

It is an object of the present invention to provide a method for three-dimensional model identification based on visual saliency sharing that overcomes or at least alleviates at least one of the above-mentioned drawbacks of the prior art.

In one aspect of the present invention, a method for three-dimensional model recognition based on visual saliency sharing is provided, and the method for three-dimensional model recognition based on visual saliency sharing comprises:

acquiring a three-dimensional model to be retrieved;

acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;

acquiring a visual feature vector of the two-dimensional view sequence;

inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;

and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.

Optionally, the inputting the visual features into the MVCNN branch and the visual saliency branch, and the fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features includes:

respectively inputting visual feature vectors of the two-dimensional view sequence into the MVCNN branch and the visual saliency branch;

acquiring weight and visual saliency characteristics according to the visual feature vector and the visual saliency branches of the two-dimensional view sequence;

inputting the weights into the MVCNN branch, and generating three-dimensional model complex features by the MVCNN branch according to the weights and the visual feature vector of the two-dimensional view sequence;

feature fusing the visually significant features and the three-dimensional model complex features to form fused features.

Optionally, the MVCNN branch comprises a convolutional neural network with a view attention pooling layer.

Optionally, the visual saliency branch comprises two LSTM layers, feature learning is performed through the two LSTM layers and a soft attention mechanism, a first LSTM module and a soft attention mechanism are adopted to weight each visual feature vector, and view weights are input to the MVCNN branch for guiding fusion to form the three-dimensional model complex features.

Optionally, the obtaining a two-dimensional view sequence according to the three-dimensional model to be retrieved includes:

carrying out normalized processing on the three-dimensional model to be retrieved by using an NPCA method;

a set of views is extracted from the three-dimensional model to be retrieved to form the sequence of two-dimensional views.

Optionally, the extracting a set of views from the three-dimensional model to be retrieved includes:

extraction is performed at intervals of 30 ° on the Z-axis of the three-dimensional model, thereby extracting 12 two-dimensional views, each of which constitutes the two-dimensional view sequence.

The application also provides a device for three-dimensional model recognition based on visual saliency sharing, which comprises:

the three-dimensional model acquisition module is used for acquiring a three-dimensional model to be retrieved;

the two-dimensional view sequence acquisition module is used for acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;

a visual feature vector acquisition module, configured to acquire a visual feature vector of the two-dimensional view sequence;

a fusion module for inputting the visual features into the MVCNN branch and the visual saliency branch, fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;

and the classification or retrieval module is used for retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.

The present application further provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method for three-dimensional model recognition based on visual saliency sharing as described above when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program enabling, when being executed by a processor, a method for three-dimensional model identification based on visual saliency sharing as described above.

Advantageous effects

1. The method uses ResNet18 to extract features of a view sequence of a three-dimensional model, the network model can better optimize the training result of a deep network, and compared with other classical convolutional neural networks (such as AlexNet and VGG-Net), ResNet18 obtains relatively better balance between accuracy and storage cost.

2. The present application introduces a mechanism of attention that allows the neural network to focus on certain specific parts of the input image to minimize task complexity and discard irrelevant information. The invention adopts a soft attention mechanism, so that the network always keeps attention to any information of the three-dimensional object and learns where to pay more attention. And excessive computing cost is avoided while inter-view correlation information is fully utilized.

3. Using the LSTM network in the branch of visual saliency, the present application can easily extract a three-dimensional model representation from the last cell state, including both global and dependency information, taking into account all the views in the three-dimensional model. The problem of information loss in the existing multi-view method is solved.

Drawings

Fig. 1 is a flowchart illustrating a method for three-dimensional model recognition based on visual saliency sharing according to a first embodiment of the present application.

FIG. 2 is an exemplary block diagram of an electronic device capable of implementing a method for three-dimensional model recognition based on visual saliency sharing provided according to one embodiment of the present application.

FIG. 3 is a graphical comparison of performance for different numbers of views of the method for three-dimensional model recognition based on visual saliency sharing shown in FIG. 1.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are a subset of the embodiments in the present application and not all embodiments in the present application. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In the description of the present application, it is to be understood that the terms "central," "longitudinal," "lateral," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the orientation or positional relationship indicated in the drawings for convenience in describing the present application and for simplicity in description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated in a particular manner and are not to be considered limiting of the scope of the present application.

The method for three-dimensional model recognition based on visual saliency sharing as shown in fig. 1 comprises:

step 1: acquiring a three-dimensional model to be retrieved;

step 2: acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved;

and step 3: acquiring a visual feature vector of a two-dimensional view sequence;

and 4, step 4: inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features;

and 5: and retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.

Advantageous effects

In this embodiment, step 4: inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features includes:

step 41: respectively inputting visual feature vectors of the two-dimensional view sequence into the MVCNN branch and the visual saliency branch;

step 42: acquiring weight and visual saliency characteristics according to the visual feature vector and the visual saliency branches of the two-dimensional view sequence;

step 43: inputting the weight into an MVCNN branch, and generating a three-dimensional model complex feature by the MVCNN branch according to the weight and a visual feature vector of a two-dimensional view sequence;

step 44: and performing feature fusion on the visual saliency features and the complex features of the three-dimensional model to form fused features.

In this embodiment, the MVCNN branch includes a convolutional neural network with a view attention pooling layer.

In this embodiment, the visual saliency branch comprises two LSTM layers, feature learning is performed through the two LSTM layers and the soft attention mechanism, a first LSTM module and the soft attention mechanism are adopted to weight each visual feature vector, and view weights are input to the MVCNN branch for guiding fusion to form the complex features of the three-dimensional model.

In this embodiment, acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved includes:

carrying out standardization processing on the three-dimensional model to be retrieved by using an NPCA method;

a set of views is extracted from the three-dimensional model to be retrieved to form a sequence of two-dimensional views.

In this embodiment, the extracting a set of views from the three-dimensional model to be retrieved includes:

The present application is described in further detail below by way of examples, it being understood that the examples do not constitute any limitation to the present application.

The dual-stream network based on visual saliency sharing mainly comprises two branches: the visual saliency branch is used for defining view weights based on similarity and difference information of multiple views and guiding visual information fusion in the MVCNN model. The second is a multi-view convolutional neural network (MVCNN) which can extract view information from photographed views.

Step 1: acquiring a three-dimensional model to be retrieved;

step 2: and extracting a two-dimensional view from each three-dimensional model by using the three-dimensional model data, wherein the views are obtained by taking pictures of the three-dimensional models at intervals of 30 degrees by taking the Z axis of the three-dimensional models as a rotating axis. Obtaining a sequence containing 12 views, namely obtaining a two-dimensional view sequence;

and step 3: and acquiring the visual feature vector of the two-dimensional view sequence, and specifically, extracting the 4096-dimensional visual feature vector of the two-dimensional view sequence through a convolutional neural network ResNet 18.

And 4, step 4: inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; specifically, the visual feature vectors are input into the visual saliency branch and the MVCNN branch of the network, respectively. In the visual saliency branch, each visual feature vector V ═ V1, …, vn is weighted using the LSTM layer and soft attention mechanism, thereby assigning a weight to each view of the model;

the resulting view feature weights in the visual salient branch are passed into the MVCNN branch for guiding visual information fusion. And (4) the view weight and the view feature vector extracted in the step (3) pass through the MVCNN containing the attention pooling layer to obtain the feature vector of the MVCNN branch.

Step 6, retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics; specifically, the visual saliency features output by the visual saliency branch are fused with the complex features output by the MVCNN branch to obtain the final features of the model for classification and retrieval.

The operation of weighting each visual vector by using the LSTM layer and the soft attention mechanism in step 4 is specifically:

1) LSTM is a special recurrent neural network that maintains hidden states ht and internal storage states ct. The correlation between hidden state ht and memory state ct is calculated by the output gate:

h_t＝o_t c_t

wherein |, indicates element-by-element multiplication. o_tIs calculated as follows:

o_t＝σ(U₀|h_t-1,v_i,t|+b₀)

where σ is a logarithmic sigmoid function, v_iAnd t is a feature vector of the virtual view in t time. U shape₀，b₀Respectively, weight matrices for previous hidden states and biases. Current memory state c_tFrom the previous memory state c_t-1And updated memory c_tDetermining:

c_t＝f_t c_t-1+i_t c_t

at f_tCalculating the forgetting gate and the input gate in sequence

f_t＝σ(U_f[h_t-1,v_i,t]+b_f)

i_t＝σ(U_i[h_t-1,v_i,t]+b_i)

Currently updated memory c_tComprises the following steps:

c_t＝tanh(U_C[h_t-1,v_i,t]+b_C)

where U0, b0 represent the weight matrices of hidden states and biases, respectively.

Calculating the View weight a using an attention mechanism_iThe specific calculation method comprises the following steps:

e_i＝w^Ttanh(U_h[h_t-1,v_i,t]+b_v)

wherein a is_iSatisfies the following conditions:

the operation of transmitting the view weight in the step 4 into the MVCNN branch to guide the visual fusion specifically includes:

1) using the average of the dynamically weighted sum of the multi-view feature vectors, thereby

Where N is the number of input views and V is V₁，...，v_nIs a set of visual features of a three-dimensional object. After focusing on the attention, ψ (V) is output.

2) And obtaining the feature vector of the MVCNN branch through a convolutional neural network.

In summary, the embodiment of the present invention extracts the feature information of the three-dimensional model through the above steps, and sufficiently pays attention to the difference and the relevance information between the views through the attention mechanism, so that the description of the feature vector on the three-dimensional model is more comprehensive, the loss of information is avoided, and the identification and the classification retrieval of the three-dimensional model are more accurate and scientific.

Example 2:

the feasibility of the protocol of example 1 is verified below with reference to specific examples, which are described in detail below:

we used the ModelNet40 and sharenetcore 55 datasets to evaluate the performance of the method in three-dimensional shape recognition and retrieval. Where ModelNet40 is a subset of ModelNet, it contains a total of 12311 CAD models of 12 classes. The model was adjusted manually, but no pose normalization was performed. The training and testing subsets of the ModelNet40 contain 9843 and 2468 models, respectively. ShapeNet core55 is a subset of ShapeNet that contains approximately 51300 three-dimensional models out of 55 common classes, each of which is subdivided into several subclasses. ShapeNetCore is divided into three parts, wherein the proportion of a training set, a verification set and a test set is 70%, 10% and 20% respectively. The model is in the OBJ format and provides two dataset versions: consistent alignment (regular data sets), and more challenging data sets, where the model is perturbed by random rotations.

In this approach, we introduce a visual saliency model (comprising two LSTM layers and one soft attention network) to consider structural and dependency information from multiple shot views and guide the pool of visual features in the MVCNN branch. To verify the effectiveness of our approach, we designed experiments for each component of the network. As shown in the table below, we use different parts of the network to perform three-dimensional shape classification on the ModelNet40 dataset to verify the validity of the attention weights. The relevant experimental results are shown in the first and second rows of the table. The results show that our attention weights can focus the model on more representative views, resulting in better performance in three-dimensional shape recognition. It can be seen that a design that treats the captured views as a sequence of views and utilizes its structural information is reasonably feasible for three-dimensional shape recognition. The experimental results prove that the network architecture provided by the inventor can obtain better three-dimensional model representation.

Table 1 shows the effect of different parts of the network structure on classification tasks

To verify the efficiency of the proposed network, we performed 3D shape classification and retrieval experiments on the Princeton model net dataset.

For the data set, we followed the same ModelNet40 training and test splitting setup as Wu et al. In experiments, we compared the dual-stream model with various models based on different representations, including a volume-based model (Wu et al 3D Shapes), a manual descriptor of multi-view data (Kazhdan et al SPH and Chen et al LFD), a deep learning model for multi-view data (Su et al MVCNN and Qi et al MVNN-MultiRes), a point cloud-based model (Qi et al PointNet, Qi et al PointNet + +, Klokov et al KD-Network, Li et al PointCNN and Wang et al DGCNN), and a panoramic view-based model (Sfikas et al PANORMA-NN). The following table provides the classification and search results for all comparison methods. The result shows that the classification accuracy of the invention is higher.

TABLE 2 comparison of Classification accuracy of the methods on the ModelNet40 dataset

On ShapeNetCore55, each evaluation index of the macroaverage is used to provide an unweighted average of the entire data set. The indicators of the micro-averaging are used to adjust the size of the model classes, thereby providing representative performance indicators averaged across the classes. Evaluation codes for calculating all these metrics are provided on the SHREC contest official website.

We obtained the results of the attitude normalization and perturbation search experiments on the ShapeNetCore dataset, respectively. The comparison was performed using the higher precision method in large scale 3D shape retrieval on ShapeNetCore55 in SHREC2016 and SHREC 2017. We have tested and verified the superiority of the method over the ModelNet dataset and the shapenet 55 dataset. The search results in table 3 illustrate the robustness of the method. From table 3 below, we can find that this method is superior to other methods in terms of macro-average F-score, mAP and NDCG metrics, compared to rotanent. On micro-average, the performance of the method is almost superior to all other methods, and is always very close to the best results of this dataset.

TABLE 3 retrieval accuracy as measured by mAP, F-SCORE and NDCG on SHAPENETCORE55 dataset

The number of views rendered from a three-dimensional object may affect retrieval performance and classification performance, so we have conducted comparative experiments to select the optimal number of views. We set up the virtual camera array at an angle theta around the z-axis. θ is set to 180 °, 90 °, 60 °, 45 °, 36 °, 30 ° and 18 °, respectively, and 2, 4, 6, 8, 10, 12, and 20 views can be generated for each three-dimensional object, respectively.

As shown in table 4, when the number of views was set to 12, the NN, FT, ST, DCG, F-fraction and ANMRR were increased by 15.8% to 46.7%, 11.8% to 118.8%, 17.0% to 71.5%, 18.0% to 52.4%, 12.0% to 95.6%, and 43.6% to 77.9%, respectively, compared to the other numbers of views. Therefore, we set the optimal number of views to 12.

TABLE 4 Performance of different view quantities on ModelNet40

Intuitively, the order of view capture directly affects three-dimensional object feature learning. To verify whether our method is limited by a particular view order, we set the view order number of the multi-view sequence 50 times during the test to verify the robustness of the proposed network. The results of the search and classification are shown in the following table.

TABLE 5 comparison of Performance on ModelNet40 for different view sequences

The results of the out-of-order input view are even better than those of the in-order input view approach. It can be seen that the present invention can adaptively calculate the importance of each view and can use the structural and visual information in a plurality of photographed views with the help of such information.

The application also provides a device for identifying the three-dimensional model based on visual saliency sharing, which comprises a three-dimensional model acquisition module, a two-dimensional view sequence acquisition module, a visual feature vector acquisition module, a fusion module and a classification or retrieval module, wherein the three-dimensional model acquisition module is used for acquiring the three-dimensional model to be retrieved; the two-dimensional view sequence acquisition module is used for acquiring a two-dimensional view sequence according to the three-dimensional model to be retrieved; the visual feature vector acquisition module is used for acquiring a visual feature vector of the two-dimensional view sequence; the fusion module is used for inputting the visual features to the MVCNN branch and the visual saliency branch and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fusion features; and the classification or retrieval module is used for retrieving or classifying the three-dimensional model to be retrieved through the fusion characteristics.

It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus of this embodiment, and are not repeated herein.

The present application further provides an electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, the processor when executing the computer program implementing the above method for three-dimensional model recognition based on visual saliency sharing.

The present application also provides a computer readable storage medium having stored thereon a computer program enabling, when being executed by a processor, a method for three-dimensional model recognition based on visual saliency sharing as above.

As shown in fig. 2, the electronic device includes an input device 501, an input interface 502, a central processor 503, a memory 504, an output interface 505, and an output device 506. The input interface 502, the central processing unit 503, the memory 504 and the output interface 505 are connected to each other through a bus 507, and the input device 501 and the output device 506 are connected to the bus 507 through the input interface 502 and the output interface 505, respectively, and further connected to other components of the electronic device. Specifically, the input device 504 receives input information from the outside and transmits the input information to the central processor 503 through the input interface 502; the central processor 503 processes input information based on computer-executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; the output device 506 outputs the output information to the outside of the electronic device for use by the user.

That is, the electronic device shown in fig. 2 may also be implemented to include: a memory storing computer-executable instructions; and one or more processors which, when executing the computer-executable instructions, may implement the method for three-dimensional model recognition based on visual saliency sharing described in connection with fig. 1.

In one embodiment, the electronic device shown in fig. 2 may be implemented to include: a memory 504 configured to store executable program code; one or more processors 503 configured to execute executable program code stored in the memory 504 to perform the method for three-dimensional model recognition based on visual saliency sharing in the above-described embodiments.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media include both non-transitory and non-transitory, removable and non-removable media that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Furthermore, it will be obvious that the term "comprising" does not exclude other elements or steps. A plurality of units, modules or devices recited in the device claims may also be implemented by one unit or overall device by software or hardware. The terms first, second, etc. are used to identify names, but not any particular order.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks identified in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The Processor in this embodiment may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

In this embodiment, the module/unit integrated with the apparatus/terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application.

In this embodiment, the module/unit integrated with the apparatus/terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction.

Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A method for three-dimensional model identification based on visual saliency sharing, characterized in that the method for three-dimensional model identification based on visual saliency sharing comprises:

acquiring a three-dimensional model to be retrieved;

acquiring a visual feature vector of the two-dimensional view sequence;

2. The method for three-dimensional model recognition based on visual saliency sharing of claim 1, wherein said inputting the visual features into an MVCNN branch and a visual saliency branch, fusing complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features comprises:

3. The method for three-dimensional model identification based on visual saliency sharing of claim 2, wherein said MVCNN branch comprises a convolutional neural network with a view attention pooling layer.

4. The method for three-dimensional model identification based on visual saliency sharing of claim 3, wherein said visual saliency branch comprises two LSTM layers, feature learning through said two LSTM layers and a soft attention mechanism, a first LSTM module and a soft attention mechanism are employed to weight each visual feature vector, and view weights are input to MVCNN branch for guiding fusion to form said three-dimensional model complex features.

5. The method for three-dimensional model recognition based on visual saliency sharing of claim 4, wherein said obtaining a sequence of two-dimensional views from said three-dimensional model to be retrieved comprises:

6. The method for three-dimensional model recognition based on visual saliency sharing of claim 5, wherein said extracting a set of views from a three-dimensional model to be retrieved comprises:

7. An apparatus for three-dimensional model recognition based on visual saliency sharing, characterized in that the apparatus for three-dimensional model recognition based on visual saliency sharing comprises:

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method for three-dimensional model recognition based on visual saliency sharing of any one of claims 1 to 6.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is capable of carrying out the method for three-dimensional model recognition based on visual saliency sharing according to one of claims 1 to 6.