CN111625667A - Three-dimensional model cross-domain retrieval method and system based on complex background image - Google Patents

Three-dimensional model cross-domain retrieval method and system based on complex background image Download PDF

Info

Publication number
CN111625667A
CN111625667A CN202010417173.9A CN202010417173A CN111625667A CN 111625667 A CN111625667 A CN 111625667A CN 202010417173 A CN202010417173 A CN 202010417173A CN 111625667 A CN111625667 A CN 111625667A
Authority
CN
China
Prior art keywords
dimensional model
image
view
network
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010417173.9A
Other languages
Chinese (zh)
Inventor
李海生
杜雨佳
李勇
姚春莲
李楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202010417173.9A priority Critical patent/CN111625667A/en
Publication of CN111625667A publication Critical patent/CN111625667A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a three-dimensional model cross-domain retrieval method and a three-dimensional model cross-domain retrieval system based on a complex background image. The method comprises the steps of designing a cross-domain retrieval triple depth network, completing effective feature extraction on input data by utilizing an image accurate feature extraction network and a three-dimensional model grouping view feature extraction network, constructing a feature joint embedding space, mapping features of different domains into the same high-dimensional space, and enabling the feature distance of data of the same class to be smaller and the feature distance of data of different classes to be larger. And finally, measuring the similarity of the image and the three-dimensional model in a feature combined embedding space by adopting the Euclidean distance to complete cross-domain retrieval. The invention can search and obtain the corresponding three-dimensional model according to the input single RGB image with complex background information.

Description

Three-dimensional model cross-domain retrieval method and system based on complex background image
Technical Field
The invention relates to the field of computer graphics and computer vision, in particular to a three-dimensional model cross-domain retrieval method and a three-dimensional model cross-domain retrieval system based on a complex background image.
Background
The coming of the information age provides strong assistance for the development of computer hardware, and various media data such as audio, video, images, three-dimensional data and the like are in blowout type growth. Nowadays, three-dimensional models are widely used in the fields of computer graphics and computer vision, such as 3D printing, computer aided design, movie animation, medical diagnosis, and the like. In order to adapt to the huge and growing three-dimensional data involved in many applications, designing a fast and effective three-dimensional model retrieval method becomes a hot problem at present.
The current retrieval work mostly belongs to the retrieval of three-dimensional models based on examples, the method needs to provide a three-dimensional model to be queried, the three-dimensional model is represented by a voxel, point cloud, grid or multi-view method, a feature descriptor is extracted, and the similarity is compared with model feature descriptors in a three-dimensional model library to return a similar three-dimensional model. The example-based three-dimensional model retrieval belongs to the same domain retrieval problem, and the accuracy of the three-dimensional model retrieval is higher because the three-dimensional model contains more characteristic information. However, in real life, a three-dimensional model used for query is not easy to obtain, and in contrast, a two-dimensional image is convenient to obtain in practical application, so that retrieving the three-dimensional model based on a single two-dimensional image has important research significance and use value.
The method for searching the three-dimensional model based on the two-dimensional image belongs to a cross-domain searching problem, and the input of the method can be an RGB image, a hand-drawn sketch and an RGB-D image, and the three-dimensional model is output as a three-dimensional model corresponding to the image. The current related research can be divided into a traditional model retrieval method based on manual characteristics and a model retrieval method based on deep learning characteristics. The idea of the manual characteristic method is to obtain bottom layer descriptors of the image and the three-dimensional model in a manual design mode and then measure similarity in a distance calculation mode and the like. Such as methods based on the eigenbag model (Bronstein A M, Bronstein M M, Guibas L J, actual. shape google: geometrical words and expressions for innovative shape hierarchy [ J ]. ACM Transactions on Graphics, 2011, 30 (1): 1-20.), methods based on Gabor local linear features (Eitzmatis, Richterrald, Boubekeupitamy, et al. Sketch-based shape hierarchy [ J ]. ACM Transactions on Graphics, 2012.). However, such methods are difficult in the feature extraction stage and are not suitable for the case of large-scale data sets.
Deep learning is a sub-field of machine learning, and the champion of ImageNet competition is obtained from the deep learning represented by convolutional neural networks in 2012, and the deep networks attract great attention in the field of computer vision. The advent of various 3D sensors has made it easier to obtain three-dimensional models, such as Microsoft Kinect, Google Project Tango, etc., and there are many large-scale common three-dimensional model datasets, such as sharenet, ModelNet, etc. Compared with a three-dimensional model data set, the data set of the two-dimensional image is large in scale and more in variety, such as ImageNet. Under the support of abundant data, the deep learning is popularized to three-dimensional data processing to become a current research hotspot, some achievements are obtained by utilizing the deep learning to solve the problem of cross-domain retrieval of a model, the idea is to take a deep neural network as technical support, firstly obtain the feature representation of an image and a three-dimensional model, then construct a cross-domain data sharing space, measure the distance of data feature descriptors of two modes and compare the data feature descriptors to complete the retrieval process. The deep neural network can rapidly learn from a large amount of data to obtain effective feature representation, compared with a traditional manual method, the cross-domain retrieval method using the deep learning three-dimensional model is strong in applicability, and the retrieval effect is greatly improved. Twin Networks were used for sketch-based three-dimensional model retrieval as in Wang et al (Wang F, Kang L, Li Y, et al Sketch-based3D shape retrieval using volumetric Neural Networks [ C ]. ComputerVision and Pattern Recognition 2015: 1875-1883.). For example, the DCML method proposed by generation et al uses a deep neural network to perform Metric Learning on features obtained from a Sketch and a three-dimensional model using discriminant and correlation loss functions (Dai G, Xie J, Zhu F, et al deep corrected method Learning for Sketch-based3D Shape Retrieval [ C ]// third-First AAAI Conference on intellectual analysis, 2017, 4002-4008.). On the basis, they propose DCHML method, which adds loss training to the hidden layer of the neural network to improve the Retrieval performance (Dai G, Xie J, Fang Y, et al deep corrected Holistic Learning for Sketch-Based3D Shape Retrieval [ J ]. IEEE Transactions on image Processing, 2018, 27 (7): 3374-3386.). Such as Li et al, filter background noise in images using a pre-trained image convolutional neural network and propose cross-domain nested space framework to reduce the difference in features between images and models (Li Y, SuH, Qi CR, et al. Joint embedding of maps and images via CNN imaging visualization [ J ]. ACM Transformations On Graphics (TOG), 2015, 34 (6): 1-12.). However, most of the existing works are to search a three-dimensional model by taking a hand-drawn sketch as an input, and in real life, people are exposed to more images in a real environment, and the images often have complex background information including illumination, background pixels except for a search object and the like. Such noise information is not relevant to the retrieval task, which together with the effective information adulterates new challenges for cross-domain retrieval. Therefore, the direct application of the sketch and three-dimensional model cross-domain retrieval method to the real image retrieval three-dimensional model task can cause the retrieval accuracy to be reduced due to the noise information of the real image. Although the convolutional neural network can filter noise information of part of real images, the effect is very limited, so that the design of the image convolutional neural network for filtering the noise information of the real images and the application of the image convolutional neural network to a cross-domain retrieval task of images and three-dimensional models are of great significance. The three-dimensional model itself contains rich information, and the mainstream method of the current research is to use a group of multi-angle projection views to represent the three-dimensional model, and to use the existing well-developed image depth network to complete the feature extraction of the three-dimensional model, such as the MVCNN method proposed by Su et al (SuH, Maji S, Kalogerakis E, et al. Multi-view computational Networks for 3DShape registration [ C ]// Proceedings of the IEEE International Conference computer Vision, 2015: 945-953.). Because the view is similar to the image, the multi-view method can reduce the semantic difference between the image and the three-dimensional model in the feature extraction part, thereby having good effect. Su et al, using MVCNN, obtain Feature descriptors of a three-dimensional model, and perform a cross-domain search task by aligning the Image with the Feature Distribution of the three-dimensional model (Su Y, Li Y, NieW, et al. Joint Heterogeneous feed Learning and Distribution Alignment for2D Image-Based 3D Object Retrieval [ J ]. IEEE Transactions on Circuits and systems for Video Technology, 2019: 1-1.). For example, Wu et al project a three-dimensional model into multiple views for representation, design a convolutional neural network model to jointly analyze images and three-dimensional model features (Wu Z, Zhang Y, Zeng M, et. Joint analysis of maps and images via depth domain adaptation [ J ]. Computers & Graphics, 2018: 140-147.). However, such a retrieval method directly fuses each projection view feature, so that a large amount of information is lost in representing the three-dimensional model. Moreover, when a cross-domain feature joint embedding space is constructed through feature learning, a semantic gap between a real image and a "clean" projected image also causes a reduction in retrieval accuracy. Therefore, how to reduce the feature loss, extract more complete and effective features of the three-dimensional model, and reduce the semantic gap between different modal data is still a difficult problem to realize the high-precision retrieval from the image to the three-dimensional model.
In summary, the prior art has the following disadvantages: (1) most of the existing three-dimensional model retrieval work is the three-dimensional model retrieval based on examples, cross-domain retrieval research between the three-dimensional model and different modal data is relatively less, and a high-precision retrieval technology aiming at a real image with complex background information to the three-dimensional model is lacked. (2) The existing image and three-dimensional model cross-domain retrieval technology usually ignores the filtering of real image noise information, so that the acquired image characteristics contain invalid information and generate a large semantic difference with the three-dimensional model projection view characteristics, and a large error is generated in a real image and three-dimensional model cross-domain retrieval task with a complex background. (3) At present, a three-dimensional model representation method based on multiple views has a good effect in a cross-domain retrieval task of images and models. However, such cross-domain retrieval methods tend to directly fuse each projection view feature, thereby resulting in a large information loss, and resulting in limited retrieval accuracy.
In a word, the existing cross-domain retrieval technology of the three-dimensional model of the image lacks accurate feature extraction aiming at a real image, noise information irrelevant to a retrieval task is filtered, and more three-dimensional model information is lost in the process of representing the three-dimensional model by using a multi-view, so that the cross-domain retrieval accuracy is influenced.
Disclosure of Invention
The technical problem of the invention is solved: the method and the system for searching the three-dimensional model based on the complex background image overcome the technical defects that the prior art lacks accurate feature extraction and three-dimensional model information loss of a real image in the field of three-dimensional model cross-domain searching, the cross-domain searching method and the system for searching the three-dimensional model based on the complex background image are provided, the cross-domain searching triple depth network is designed to construct a feature combined embedding space to reduce the distribution difference of different modal data features, the searching accuracy is improved by paying attention to the effective feature extraction of the three-dimensional model and an RGB image with complex background information, and the similar three-dimensional model based on single RGB image searching is.
The technical solution adopted by the invention is as follows:
in order to solve the technical problems, the invention adopts the technical scheme that: a three-dimensional model cross-domain retrieval method based on a complex background image comprises the following steps:
(1) constructing an original dataset D comprising several different three-dimensional models M and an image I with a complex background, said original dataset D being represented by a triplet T ═ (I)A,Mpos,Mneg) Is shown in the formula IARepresenting an image as Anchor, MposRepresentation and said image IAHomogeneous three-dimensional models of the same kind, MnegRepresentation and said image IAA heterogeneous inverse three-dimensional model;
(2) to pairThe image I in the triplet TAIs pretreated to obtain a treated image I'AFor the positive three-dimensional model M in the triplet TposAnd the inverse three-dimensional model MnegRespectively carrying out projection processing to obtain the projection video group V of the positive three-dimensional modelposAnd the inverse three-dimensional model projection television group VnegAnd respectively carrying out pretreatment to obtain a processed normal three-dimensional model projection view group V'posAnd an inverse three-dimensional model projection view group V'negObtaining the processed triplet T '═ I'A,V′pos,V′neg) A standard data set D' of representations;
(3) aiming at the processed triple T', a cross-domain retrieval triple depth network model N is constructed, wherein the cross-domain retrieval triple depth network comprises 3 branch networks which are 1 accurate image feature extraction network N I2 three-dimensional model grouping view feature extraction network N with same structure and shared weightMWherein the image precise feature extraction network NIIs the processed image I 'of the processed triplet T'AOutput as image feature vectors
Figure BDA0002495479180000056
The three-dimensional model grouping view feature extraction network NMIs the processed positive three-dimensional model projection view group V 'in the processed triplet T'posAnd an inverse three-dimensional model projection view group V'negRespectively outputting the feature vectors of the three-dimensional model of the positive type
Figure BDA0002495479180000051
And inverse three-dimensional model feature vector
Figure BDA0002495479180000052
The triple network is the fusion of deep learning and metric learning, and can directly learn the mapping relation from a sample space to a compact Euclidean space, thereby constructing a feature joint embedded space to measure different modal numbersAccordingly. Moreover, the three-element network can better model details through two input difference metrics, so that the retrieval accuracy is improved;
(4) for the image feature vector
Figure BDA0002495479180000053
The feature vector of the positive three-dimensional model
Figure BDA0002495479180000054
And the inverse three-dimensional model feature vector
Figure BDA0002495479180000055
Carrying out regularization treatment to obtain the regularized image feature vector
Figure BDA0002495479180000061
The feature vector of the positive three-dimensional model
Figure BDA0002495479180000062
And the inverse three-dimensional model feature vector
Figure BDA0002495479180000063
Defining a loss function L of the cross-domain retrieval triple depth network model N;
(5) using the processed triple T ', iteratively training parameters of a cross-domain retrieval triple deep network model N until the loss function L is smaller than a set threshold value, stopping training, and obtaining a trained cross-domain retrieval triple deep network model N', wherein the trained cross-domain retrieval triple deep network model N 'comprises 3 trained branch networks which are respectively a trained accurate image feature extraction network N'IAnd two trained three-dimensional model grouping view feature extraction networks N'MThe feature joint embedding space construction of the image and the three-dimensional model can be completed through the trained cross-domain retrieval triple depth network model N', and different domain data measurement bases are provided for retrieval tasks;
(6) when executing the search task, givenQuery image q and target three-dimensional model set S, the query data q is preprocessed to obtain processed query image q', and each target three-dimensional model S in the target three-dimensional data set S is subjected toiThe projection processing is carried out to obtain the three-dimensional model projection view group SViThen, the preprocessing is carried out to obtain a processed three-dimensional model projection television group SVi' inputting the processed query image q ' into a trained image exact feature extraction network N 'IIn the method, the image characteristic vector F corresponding to the query image q is obtained through outputqRegularization processing is carried out to obtain a regularized image feature vector F'qThe processed projection view set SVi'input trained three-dimensional model grouping view feature extraction network N'MIn the method, a target three-dimensional model S is obtained through outputiCorresponding three-dimensional model feature vectors
Figure BDA0002495479180000064
Regularizing to obtain regularized three-dimensional model feature vector
Figure BDA0002495479180000065
Calculating the regularized image feature vector F'qAnd the regularized three-dimensional model feature vector
Figure BDA0002495479180000066
A distance D (q, S) therebetweeni) At said distance D (q, S)i) Weighing the query image q and each target three-dimensional model SiSimilarity between the three-dimensional models is sorted in descending order, and a plurality of top-ranked target three-dimensional models S are selectedtopAs a result of a search similar to the query image q and output.
In the step (2) and the step (6), the preprocessing includes uniform size processing and gray scale processing, and since a network structure includes a full connection layer, the uniform size processing is required to perform uniform size processing on input data, the uniform size processing is to unify the size of the image or the projection view of the three-dimensional model, and the gray scale processing is to convert an RGB color image of the image into a gray scale image, so as to eliminate interference of image colors on a retrieval task.
In the step (2) and the step (6), a virtual camera array is arranged, the three-dimensional model is rendered by Phong coloring, the depth information of the view is increased by adding illumination, the semantic difference between the view and a target real image is reduced, the multi-angle projection view of the three-dimensional model is obtained, each three-dimensional model has 12 corresponding three-dimensional model projection views, and the three-dimensional model can be completely represented to the maximum extent while the efficiency is considered through 12 projection views.
In the step (3), the image precise feature extraction network is an AlexNet network including an attention block, the network basic structure is consistent with the AlexNet network, the network basic structure includes 5 convolutional layers and 3 full-connection layers, the attention block is located between every two convolutional layers connected in front and back, and is formed by connecting 1 channel attention module and 1 space attention module in series, the attention block is suitable for being used between the two convolutional structures connected in front and back, the precise feature extraction capability of the network on the input image can be improved by overlapping the attention block between all convolutional layers of the image feature extraction network, and the influence of complex background information of a real image on a retrieval task is eliminated.
In the step (3), the three-dimensional model grouped view feature extraction network uses a convolution structure of an AlexNet network as a basic network and includes a grouped sub-network, the three-dimensional model grouped view feature extraction network includes 5 total convolutional layers of the AlexNet network, the grouped sub-network is connected after the last 1 convolutional layer, after the view feature vectors are output by the last 1 convolutional layer, the grouped sub-network fuses the view feature vectors into group-level feature vectors, then fuses the group-level feature vectors into shape-level feature vectors, and finally fuses the shape-level feature vectors into the three-dimensional model feature vectors and outputs the three-dimensional model feature vectors.
In step (3), the grouping sub-network includes a grouping weight module, a view pooling layer, a group pooling layer and a full connection layer, and the grouping sub-network fuses the view feature vectors into group-level feature vectors, then fuses the group-level feature vectors into shape-level feature vectors, and finally fuses the shape-level feature vectors into the three-dimensional model feature vectors and outputs the three-dimensional model feature vectors by the specific implementation:
firstly, the grouping sub-network calculates the view discrimination according to the view feature vector and sets view groups; secondly, the grouping weight module calculates view grouping weight according to the view grouping and the view distinguishing degree; thirdly, according to the view feature vector and the view grouping, the view pooling layer is used for fusing the view feature vector into a group-level feature vector and outputting the group-level feature vector; then, according to the group-level feature vector and the view grouping weight, the group-level feature vector is fused into the shape-level feature vector by using the group-level pooling layer and is output; and finally, inputting the shape-level feature vectors into the full-connection layer, fusing the shape-level feature vectors into the three-dimensional model feature vectors and outputting the three-dimensional model feature vectors. The grouping sub-network focuses on the similarity and the difference between different views, and introduces the grouping weight to distinguish the contribution degree of the views from different perspectives to the model representation. The three-dimensional model feature extraction process is divided into three stages of view features, group-level features and shape-level features, so that the network focuses on the relation among different views while extracting each view feature, and the representation capability and robustness of the generated three-dimensional model feature vector are improved.
In steps (4) and (6), the regularization process is L2 regularization. Compared with other regularization functions, the L2 regularization function is simpler and more convenient to calculate, and can simply and effectively control the complexity of a model and prevent overfitting.
In step (4), the loss function:
L=max(dpos-dneg+margin,0),
wherein d isposRepresenting the distance, d, between a positive sample pair, i.e. a positive-like three-dimensional model, and an Anchor imagenegRepresents the distance between the negative sample pair, i.e., the inverse class three-dimensional model, and the Anchor image, margin represents the set relative distance,
Figure BDA0002495479180000081
Figure BDA0002495479180000082
in the step (5) and the step (6), the distance is an euclidean distance. Compared with other distance measurement methods, the Euclidean distance measurement method is simpler and more intuitive, and can effectively measure the similarity between the features in a high-dimensional mapping space.
The retrieval system comprises a target three-dimensional model library, an input module, a projection processing module, a preprocessing module, a retrieval module and an output module;
the target three-dimensional model library comprises a target three-dimensional model set S;
the input module is used for inputting a query image q and sending the query image q to the preprocessing module;
the projection processing module is used for processing each target three-dimensional model S in the target three-dimensional data set SiThe projection processing is carried out to obtain the three-dimensional model projection view group SViAnd sending the data to the preprocessing module;
the preprocessing module is used for respectively preprocessing the query data q sent by the input module and the three-dimensional model projection view group SVi sent by the projection processing module to obtain a processed query image q' and a processed three-dimensional model projection view group SVi', and send to the retrieval module;
the retrieval module comprises a trained image precise feature extraction network N'IAnd three-dimensional model grouping view feature extraction network N'MInputting the processed query image q 'sent by the preprocessing module into a trained image precise feature extraction network N'IIn the method, the image characteristic vector F corresponding to the query image q is obtained through outputqRegularization processing is carried out to obtain a regularized image feature vector F'qThe processed projection view set SV sent by the pre-processing modulei' input trained three-dimensional model visual groupingGraph feature extraction network N'MIn the method, a target three-dimensional model S is obtained through outputiCorresponding three-dimensional model feature vectors
Figure BDA0002495479180000091
Regularizing to obtain regularized three-dimensional model feature vector
Figure BDA0002495479180000092
Calculating the regularized image feature vector F'qAnd the regularized three-dimensional model feature vector
Figure BDA0002495479180000093
A distance D (q, S) therebetweeni) At said distance D (q, S)i) Weighing the query image q and each target three-dimensional model SiThe similarity between the two modules is sorted in a descending order to obtain a sorting result, and the sorting result is sent to the output module;
the output module selects a plurality of top-ranked target three-dimensional models S in the ranking result sent by the retrieval moduletopAs a result of a search similar to the query image q and output.
Compared with the prior art, the invention has the beneficial effects that:
compared with the existing mainstream method, the method provided by the invention allows a user to take a single RGB image with complex background information as query input to finish the accurate retrieval of the corresponding three-dimensional model. The method focuses on effective feature extraction of different modal data, aiming at the characteristics of an image with a complex background and a three-dimensional model view group, an image accurate feature extraction network and a three-dimensional model grouping view feature extraction network are respectively designed to extract effective features of different modal data, and then a basis is provided for cross-domain retrieval based on distance calculation by constructing a feature joint embedding space. The advantage of this is that the targeted design of the feature extraction network can improve the feature vector representation capability and robustness, thereby improving the retrieval accuracy.
Therefore, the shortcomings of the present invention relative to the prior art have the following technical advantages:
(1) an end-to-end depth measurement learning framework is provided, a triple network is designed to map two-dimensional images and three-dimensional model data to the same high-dimensional space, the similarity of the same type of data in a domain and the sparseness of heterogeneous data are ensured while the difference of the characteristics of the inter-domain data is reduced, and therefore accurate cross-domain retrieval from a single RGB image with complex background information to a three-dimensional model is achieved.
(2) Aiming at the complex background information of the query image, an image accurate feature extraction network branch is designed, an attention mechanism is introduced to realize image self-adaptive feature refinement learning, and the influence of noise information on cross-domain retrieval accuracy is eliminated.
(3) And representing the three-dimensional model by using 12 projection views with different visual angles, reducing the characteristic difference of data in different modes, and rendering the projection views to further reduce the semantic difference between the views and the target real image. A grouping mechanism is introduced in a multi-view feature fusion stage, and the expression capability of the three-dimensional model feature vector is improved, so that the cross-domain retrieval accuracy is improved.
In summary, the invention provides an end-to-end cross-domain retrieval technology and system based on a triple network in the field of three-dimensional model retrieval, in order to solve the problem of cross-domain retrieval of a three-dimensional model of a single RGB image. Compared with the prior art, the method focuses on the problems of noise information of image data and loss of three-dimensional model features, designs the feature extraction network in a targeted manner, and improves the feature vector representation capability and robustness, so that the method has a good retrieval effect.
Drawings
FIG. 1 is a schematic flow chart of a complex background image-based three-dimensional model retrieval method;
FIG. 2 is a diagram of a cross-domain retrieval triple depth network framework;
FIG. 3 is a schematic diagram of an image exact feature extraction network;
FIG. 4 is a schematic view of a channel attention module;
FIG. 5 is a schematic view of a spatial attention module
FIG. 6 is a schematic diagram of a three-dimensional model group view feature extraction network;
fig. 7 is a schematic structural diagram of a three-dimensional model cross-domain retrieval system based on a complex background image.
Detailed Description
The invention is described in detail below with reference to the figures and the detailed description. Wherein, fig. 1 describes the implementation process of the three-dimensional model retrieval method based on a single complex background image. FIG. 2 depicts a process for constructing a feature joint embedding space using a cross-domain retrieval triple depth network. FIG. 3 depicts the use of an image exact feature extraction network to perform the extraction of image features with complex backgrounds. FIG. 4 shows a channel attention module structure in the attention block of the image accurate feature extraction network. FIG. 5 shows an in-space attention module structure in the attention block of an image accurate feature extraction network. FIG. 6 depicts the completion of a feature extraction process for a three-dimensional model using a three-dimensional model group view feature extraction network. FIG. 7 depicts the structure of a three-dimensional model cross-domain retrieval system based on complex background images.
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the method of the present invention includes the following steps:
(1) constructing an original data set D for cross-domain retrieval of three-dimensional models, the original data set comprising several different three-dimensional models M and an image I with a complex background, in terms of a triplet T ═ I (I)A,Mpos,Mneg) Is shown in the formula IARepresenting an image as Anchor, MposRepresentation and image IAHomogeneous three-dimensional models of the same kind, MnegRepresentation and image IAA heterogeneous inverse three-dimensional model;
(2) for image data I in a triple data set TAProcessing the image data I 'with a uniform size of 256 × 256 to be suitable for a full connection layer in the deep neural network, converting the image data from the RGB image into a gray scale map, and removing the interference of color information in the image on retrieval to obtain processed image data I'A. For positive three-dimensional model M in triple TposInverse three-dimensional model MnegRespectively carrying out projection processing to obtain three-dimensional positive typeModel projection television set VposProjection video group V of inverse three-dimensional modelneg. Specifically, the invention provides a group of virtual camera arrays, wherein the virtual camera arrays comprise 12 virtual cameras, the virtual cameras are placed around the three-dimensional model, point to the mass center of the three-dimensional model and form an included angle of 30 degrees with the horizontal plane, the camera intervals are 30 degrees, and 12 projection visual angles can completely represent the three-dimensional model to the maximum extent while considering the efficiency. After a multi-angle projection view of the three-dimensional model is obtained, the three-dimensional model is rendered by adopting Phong coloring, the depth information of the view is increased by adding illumination, meanwhile, the semantic difference between the view and a target real image is reduced, and a processed projection view group V 'of the normal three-dimensional model is obtained'posAnd an inverse three-dimensional model projection view group V'neg. Finally, the preprocessed standard dataset D ' is preprocessed to process the triplet T ' ═ I 'A,V′pos,V′neg) Represents;
(3) and taking the processed triple T 'obtained through the steps as input, and constructing a cross-domain retrieval triple deep network N to train the cross-domain retrieval triple T'. The triple network is the fusion of deep learning and metric learning, and can directly learn the mapping relation from a sample space to a compact Euclidean space. In addition, the three-element network can better model details through two input difference metrics, so that the retrieval accuracy is improved. The cross-domain search triple depth network has three branches corresponding to the input data, wherein the branch I is an image precise feature extraction network NIThe input is image I'ACompleting self-adaptive thinning learning of important features of the image through a neural network, and outputting the important features as image feature vectors
Figure BDA0002495479180000123
Branch two and branch three-dimensional model grouping view feature extraction network NMThe inputs are respectively a positive three-dimensional model projection view group V'posAnd an inverse three-dimensional model projection view group V'negThe network structures are the same, the weights are shared, the relation among different views is concerned while the features of each view are extracted, and the feature vector of the three-dimensional model of the positive type is obtained
Figure BDA0002495479180000121
And inverse three-dimensional model feature vector
Figure BDA0002495479180000122
The cross-domain retrieval triple depth network provided by the invention focuses on the complete extraction of effective characteristics of input real images and three-dimensional model projection views aiming at the characteristics of the input real images and the three-dimensional model projection views, reduces the semantic difference between the real images and the projection images and improves the retrieval accuracy.
The real image contains complex background information irrelevant to the retrieval, and the retrieval precision is reduced due to the noise characteristics. Moreover, when a cross-domain feature joint embedding space is constructed through feature learning, a semantic gap between a real image and a "clean" projected image also causes a reduction in retrieval accuracy. Although partial noise features can be weakened by directly using the AlexNet network for feature extraction, the effect is not ideal. Therefore, the accurate image feature extraction network designed by the invention is based on the AlexNet network, and by adding an attention mechanism, the network can pay more attention to object information when learning the image features, so that the influence of complex background information on a retrieval task is eliminated, and the accurate features of the image are obtained.
Image precise feature extraction network NIThe structure is shown in fig. 3, the basic network structure is consistent with an AlexNet network, and the network comprises 5 convolutional layers and 3 fully-connected layers, wherein an attention block is positioned between every two convolutional layers connected in front of and behind, 4 attention blocks are provided, and only one attention block is shown in fig. 3 as an example. Where convolution layer 1 has a convolution kernel size of 11 x 11, step size of 4, followed by an LRN layer (local response normalization), followed by a maximum pooling layer of size 3 x 3, step size of 2. Convolution layer 2 has a convolution kernel size of 5 x 5, step size of 1, followed by one LRN layer, and then connects the largest pooling layer of size 3 x 3, step size of 2. Convolution kernel sizes of convolution layers 3, 4 and 5 are all 3 x 3, and step sizes are all 1. The convolutional layers 1 and 2, 2 and 3, 3 and 4, 4 and 5 are respectively connected by attention blocks 1, 2, 3 and 4, each attention block is noticed by one channelThe force module and a spatial attention module are connected in series. The convolutional layer 5 is followed by the largest pooling layer of size 3 x 3 with a step size of 2. And three full connection layers are connected behind the maximum pooling layer, wherein the dimensionalities of the full connection layer 1 and the full connection layer 2 are 4096, the full connection layer 3 is an output layer, and the dimensionality is 128. An activation function layer is connected behind the 5 convolutional layers, the full connection layer 1 and the full connection layer 2 of the network, wherein the activation function adopts a Relu function.
Image precise feature extraction network NIFeature extraction of important objects in the image is achieved through an attention mechanism, in an attention block, a channel attention module focuses on what features are meaningful, and a spatial attention module focuses on where features are meaningful. After obtaining the output feature F of the front convolution layer, attention weighting graph A is passed through the channelcMultiplying to obtain channel adaptive characteristic F1Then on the basis of the above-mentioned relationship, it is related to the space attention weight graph AsMultiplying to obtain a spatially adaptive feature F2The overall process is defined as follows:
Figure BDA0002495479180000131
Figure BDA0002495479180000132
wherein
Figure BDA0002495479180000133
Representing element by element multiplication.
channel attention As shown in FIG. 4, the input features F are a H × W × C tensor, where H represents height, W represents width, and C represents the number of channels, the channel attention module performs max-pooling (max-pooling) and average-pooling (average-pooling) on F in spatial dimension to obtain two 1 × 1 × C channel feature vectors, and the two pooling methods have the advantages that the max-pooling can retain more texture information, and the average-pooling has feedback on each pixel point on the feature map, and can completely transfer information while reducing dimensionsending the data into a multilayer perceptron with a hidden layer, wherein the activation function is Relu, the parameter compression rate is 16, then adding the two obtained feature vectors, and processing the sum through a Sigmoid activation function to obtain a 1 multiplied by C channel attention weight graph Ac
Feature F obtained by filtering image feature channel through channel attention module1is also an H × wxc tensor, similar to the channel attention module principle, the spatial attention module is aligned to F in the channel dimension1the maximum pooling and average pooling to obtain two H × W × 1 space eigenvectors are shown in FIG. 5. the eigenvectors with 2 channels are obtained by splicing according to the channel dimensions, and then a 7 × 7 convolution layer and a Sigmoid activation function are applied to generate H × W × 1 space attention weight graph As. Finally, the new characteristic F is obtained by regional filtering2The next convolution layer can be input for further feature extraction, and finally the accurate features of the image are obtained.
The invention adopts a multi-view method to extract the characteristics of the three-dimensional model, and the existing cross-domain retrieval work usually ignores the interconnection among different views, thereby causing the characteristic loss. The invention designs a three-dimensional model grouping view characteristic extraction network NMThe model feature extraction process is divided into three stages of view features, group-level features and shape-level features, and information among views is mined on the basis of feature extraction of each view. Aiming at the characteristics of different views, such as high similarity among the views, large difference among the views and different contribution of the views in different view angles to model representation, a grouping mechanism and grouping weight are introduced, so that the extracted feature vectors have better representation capability and robustness, and the cross-domain retrieval accuracy is improved, and the three-dimensional model grouping view feature extraction network structure is shown in figure 6.
After the three-dimensional model grouping view feature extraction network finishes feature extraction of each view, a grouping sub-network is added to carry out grouping division on view features, corresponding weights are calculated, the view feature vectors are fused into group-level feature vectors, the group-level feature vectors are fused into shape-level feature vectors, and finally the shape-level feature vectors are fused into the three-dimensional model feature vectors and output. The view feature vectors are extracted through a 5-layer convolution network, the network structure is consistent with the first 5 convolution layer structures of the AlexNet network in the image precise feature extraction network, and as the view image is a 'clean' image without a background, an attention layer does not need to be added.
And after the view characteristics are obtained, putting the view characteristics into a grouping sub-network for grouping division to obtain a grouping scheme and a grouping weight. Specifically, view differentiation degrees are calculated according to view features, and then grouping is performed according to differentiation degrees, wherein the calculation of the differentiation degrees is defined as follows:
D(Vi)=Sigmoid(log(abs(f(Vi))))
wherein, ViView representing input, f (V)i) Representing the feature vector of the view obtained by 5-layer convolutional network extraction, D (V)i) Representing the discrimination score of the view. After the mapping of the Sigmoid function, the range of the discrimination scores of the views is (0, 1), and the discrimination scores are uniformly distributed by adding functions of log (. lamda.) and abs (. lamda.).
After the discrimination of each view is obtained, dividing the interval (0, 1) into 4 equal-length sub-intervals, sequentially checking the discrimination of each view, and dividing the views in the same interval into a group, thereby obtaining a grouping scheme GjWhere j is 1, 2, 3, 4, the output of the grouping scheme contains the view number and the degree of discrimination for each group.
The grouping weight module calculates the weight of each group according to the grouping scheme and the discrimination of the views in the group, and is used for the step of fusing group-level features, wherein the group weight with high discrimination sum is larger, otherwise, the weight is smaller, and the calculation definition is as follows:
Figure BDA0002495479180000151
where Ceil (.) represents an ceiling function, | GjL represents the number of views projected within each group.
The views in the same group are similar in distinguishing degree, the image characteristics of the views in the group are similar, and the representation capability of the models is similar, so that the intra-group view characteristics can be fused according to the information provided by the grouping scheme. The view feature fusion of the view in the group is completed through a view pooling layer, the view pooling layer is a pooling layer in the multi-view convolutional neural network and is specially used for fusing a plurality of view feature vectors, and the view feature fusion process is defined as follows:
Figure BDA0002495479180000152
Figure BDA0002495479180000153
wherein, F (G)j) Representing group level eigenvectors, lambda being used to determine whether a view is in the group, N representing the number of views in the group.
And fusing the group-level features according to the grouping weight module result to obtain a shape-level feature vector, and outputting a final three-dimensional model feature vector through a full-connection layer with 128-dimensional dimension. The group-level feature fusion process is defined as follows:
Figure BDA0002495479180000161
where F (S) represents a shape-level feature vector obtained by fusing group-level features, and M is the number of groups (M ═ 4).
(4) Through the step (3), the image feature vector is obtained through the cross-domain retrieval triple depth network
Figure BDA0002495479180000162
Feature vector of three-dimensional model of positive type
Figure BDA0002495479180000163
And the feature vector of the positive three-dimensional model
Figure BDA0002495479180000164
Then respectively normalized by using an L2 regularization functionObtaining the regularized image feature vector
Figure BDA0002495479180000165
The feature vector of the positive three-dimensional model
Figure BDA0002495479180000166
And the inverse three-dimensional model feature vector
Figure BDA0002495479180000167
Compared with other regularization functions, the L2 regularization function is simpler and more convenient to calculate, and can simply and effectively control the complexity of the model and prevent overfitting. The process is defined as follows:
Figure BDA0002495479180000168
where v represents an element in the feature vector, 1 e-12.
(5) Image feature vector pair by using loss function L of cross-domain retrieval triple depth network
Figure BDA0002495479180000169
And three-dimensional model feature vectors
Figure BDA00024954791800001610
To know
Figure BDA00024954791800001611
And constructing a joint embedded feature space, measuring the similarity between the feature vectors by adopting Euclidean distance, mapping the data feature vectors of different domains into the same high-dimensional space, and reducing the data difference among different domains, wherein the same class data in the same domain are similar and the heterogeneous data are distant. Assume that the distance between a positive sample pair is
Figure BDA00024954791800001612
The distance between the negative sample pair is
Figure BDA00024954791800001613
The loss function is defined as follows:
L=max(dpos-dneg+margin,0)
the mark in is a relative distance set for avoiding a shortcut of the cross-domain retrieval triple deep network model in a training process to cause an error result.
And using the processed triple T ', iteratively training the parameters of the cross-domain retrieval triple depth network model N until the loss function L is smaller than a set threshold value, and stopping training to obtain the trained cross-domain retrieval triple depth network model N'. The triple network is utilized to construct a joint embedding feature space of cross-domain data, so that image data features and three-dimensional model data features are distributed in the same space in a cluster form according to categories, and the similarity of different modal data can be measured by directly calculating the distance between the features.
(6) When a retrieval task is executed, a query image q and a target three-dimensional model set S are given, firstly, the query data q are preprocessed to obtain an image q', and each target three-dimensional model S in the target three-dimensional data set S is subjected toiProjection processing is carried out to obtain view group SViThen preprocessing the image to obtain view group SVi' then inputting the query image q ' into the trained image precise feature extraction network N ' I, and outputting an image feature vector F corresponding to the query image qqObtaining an image feature vector F 'after regularization'qWill view group SVi'input trained three-dimensional model grouping view feature extraction network N'MIn the method, a target three-dimensional model S is obtained through outputiCorresponding feature vector
Figure BDA0002495479180000171
Carrying out regularization treatment to obtain a feature vector
Figure BDA0002495479180000172
Finally, calculating image feature vector F 'by using Euclidean distance'qAnd the feature vector of the three-dimensional model
Figure BDA0002495479180000173
A distance D (q, S) therebetweeni) Passing distance D (q, S)i) Measuring query image q and each target three-dimensional model SiSimilarity between the three-dimensional models is sorted in descending order, and 5 top-ranked target three-dimensional models S are selectedtopAs a result of the search similar to the query image q and output. Compared with other distance measurement methods, the Euclidean distance measurement method is simpler and more intuitive, and can effectively measure the similarity between the features in the high-dimensional mapping space. The calculation formula is defined as follows:
Figure BDA0002495479180000174
where Q is the query image, S is the three-dimensional model to which it is compared,
Figure BDA0002495479180000175
and
Figure BDA0002495479180000176
and (3) respectively embedding the corresponding features into elements in the image and three-dimensional model feature vectors in the space in a combined mode, wherein n is the dimension of the feature vectors.
The three-dimensional model cross-domain retrieval system based on the complex background image is shown in figure 7 and comprises a target three-dimensional model library, an input module, a projection processing module, a preprocessing module, a retrieval module and an output module,
a target three-dimensional model library comprising a target three-dimensional model set S;
the input module is used for inputting the query image q and sending the query image q to the preprocessing module;
a projection processing module for each target three-dimensional model S in the target three-dimensional data set SiProjection processing is carried out to obtain a three-dimensional model projection television group SViAnd sending the data to a preprocessing module;
a preprocessing module for processing the query data q sent by the input module and the three-dimensional model projection view group SV sent by the projection processing moduleiRespectively carrying out pretreatment to obtain a processed query image q' and a three-dimensional model projection television group SVi' and sent to the retrieval module;
A retrieval module comprising a trained image exact feature extraction network N'IAnd three-dimensional model grouping view feature extraction network N'MInputting the processed query image q 'sent by the preprocessing module into a trained image accurate feature extraction network N'IIn the method, the image characteristic vector F corresponding to the query image q is outputqRegularization processing is carried out to obtain a regularized image feature vector F'qThe processed projection view group SV sent by the preprocessing modulei'input trained three-dimensional model grouping view feature extraction network N'MIn the method, a target three-dimensional model S is obtained through outputiCorresponding three-dimensional model feature vectors
Figure BDA0002495479180000181
Regularizing to obtain regularized three-dimensional model feature vector
Figure BDA0002495479180000182
Computing a regularized image feature vector F'qAnd regularized three-dimensional model feature vectors
Figure BDA0002495479180000183
A distance D (q, S) therebetweeni) At a distance D (q, S)i) Measuring query image q and each target three-dimensional model SiThe similarity between the two modules is sorted in a descending order to obtain a sorting result, and the sorting result is sent to an output module;
the output module selects the top 5 target three-dimensional models S in the sequencing result sent by the retrieval moduletopAs a result of the search similar to the query image q and output.

Claims (8)

1. A three-dimensional model cross-domain retrieval method based on a complex background image is characterized by comprising the following steps:
step 1) constructing an original data set D comprising a number of different three-dimensional models M and an image I with a complex background, the original data set D being in tripletsT=(IA,Mpos,Mneg) Is shown in the formula IARepresenting an image as Anchor, MposRepresentation and said image IAHomogeneous three-dimensional models of the same kind, MnegRepresentation and said image IAA heterogeneous inverse three-dimensional model;
step 2) for the image I in the triplet TAIs pretreated to obtain a treated image I'AFor the positive three-dimensional model M in the triplet TposAnd the inverse three-dimensional model MnegRespectively carrying out projection processing to obtain the projection video group V of the positive three-dimensional modelposAnd the inverse three-dimensional model projection television group VnegAnd respectively carrying out pretreatment to obtain a processed normal three-dimensional model projection view group V'posAnd an inverse three-dimensional model projection view group V'negObtaining a processed triplet T '═ I'A,V′pos,V′neg) A standard data set D' of representations;
step 3) constructing a cross-domain retrieval triple depth network model N aiming at the processed triple T', wherein the cross-domain retrieval triple depth network comprises 3 branch networks which are 1 image accurate feature extraction network N respectivelyI2 three-dimensional model grouping view feature extraction network N with same structure and shared weightMWherein the image precise feature extraction network NIIs the processed image I 'of the processed triplet T'AOutput as image feature vectors
Figure FDA0002495479170000013
The three-dimensional model grouping view feature extraction network NMIs the processed positive three-dimensional model projection view group V 'in the processed triplet T'posAnd an inverse three-dimensional model projection view group V'negRespectively outputting the feature vectors of the three-dimensional model of the positive type
Figure FDA0002495479170000011
And the inverse classThree-dimensional model feature vector
Figure FDA0002495479170000012
The image accurate feature extraction network is an AlexNet network comprising attention blocks, the network comprises 5 convolutional layers and 3 full-connection layers, and the attention blocks are positioned between every two convolutional layers connected in front and back and are formed by connecting 1 channel attention module and 1 space attention module in series;
the three-dimensional model grouping view feature extraction network takes a convolution structure of an AlexNet network as a basic network and comprises a grouping sub-network, the three-dimensional model grouping view feature extraction network comprises 5 all convolution layers of the AlexNet network, the grouping sub-network is connected after the last 1 convolution layer, after the last 1 convolution layer outputs a view feature vector, the grouping sub-network fuses the view feature vector into a group-level feature vector, then fuses the group-level feature vector into a shape-level feature vector, and finally fuses the shape-level feature vector into the three-dimensional model feature vector and outputs the three-dimensional model feature vector;
step 4) for the image feature vector
Figure FDA0002495479170000021
The feature vector of the positive three-dimensional model
Figure FDA0002495479170000022
And the inverse three-dimensional model feature vector
Figure FDA0002495479170000023
Carrying out regularization treatment to obtain the regularized image feature vector
Figure FDA0002495479170000024
The feature vector of the positive three-dimensional model
Figure FDA0002495479170000025
And the inverse ofThree-dimensional model feature vector
Figure FDA0002495479170000026
Defining a loss function L of the cross-domain retrieval triple depth network model N;
step 5) using the processed triple T ', iteratively training parameters of the cross-domain retrieval triple depth network model N until the loss function L is smaller than a set threshold value, stopping training to obtain the trained cross-domain retrieval triple depth network model N', and finishing the image IAAnd jointly embedding the three-dimensional model M with the features to construct a space, wherein the trained cross-domain retrieval triple depth network model N 'comprises 3 trained branch networks which are respectively a trained image accurate feature extraction network N'IAnd two trained three-dimensional model grouping view feature extraction networks N'M
Step 6) when a retrieval task is executed, giving a query image q and a target three-dimensional model set S, preprocessing the query data q to obtain a processed query image q', and processing each target three-dimensional model S in the target three-dimensional data set SiThe projection processing is carried out to obtain the three-dimensional model projection view group SViAnd then the preprocessing is carried out to obtain a processed three-dimensional model projection view group SV'iInputting the processed query image q 'into a trained image precise feature extraction network N'IIn the method, the image characteristic vector F corresponding to the query image q is obtained through outputqRegularization processing is carried out to obtain a regularized image feature vector F'qThe processed projection view set SV 'is'iInputting trained three-dimensional model grouping view feature extraction network N'MIn the method, a target three-dimensional model S is obtained through outputiCorresponding three-dimensional model feature vectors
Figure FDA0002495479170000027
Regularizing to obtain regularized three-dimensional model feature vector
Figure FDA0002495479170000028
Calculating the regularized image feature vector F'qAnd the regularized three-dimensional model feature vector
Figure FDA0002495479170000029
A distance D (q, S) therebetweeni) At said distance D (q, S)i) Weighing the query image q and each target three-dimensional model SiSimilarity between the three-dimensional models is sorted in descending order, and a plurality of top-ranked target three-dimensional models S are selectedtopAs a result of a search similar to the query image q and output.
2. The method as claimed in claim 1, wherein in the steps 2) and 6), the preprocessing includes a uniform size processing and a gray scale processing, the uniform size processing is to uniform the size of the image or the projection view of the three-dimensional model, and the gray scale processing is to convert the RGB color image of the image into a gray scale image.
3. The method for cross-domain searching of three-dimensional models based on complex background images as claimed in claim 1, wherein in the steps 2) and 6), the projection process is implemented by setting up a virtual camera array, rendering the three-dimensional models with Phong rendering, and obtaining multi-angle projection views of the three-dimensional models, wherein each three-dimensional model has 12 corresponding projection views of the three-dimensional model.
4. The method for cross-domain retrieval of three-dimensional models based on complex background images as claimed in claim 1, wherein the grouping sub-network comprises a grouping weight module, a view pooling layer, a grouping pooling layer and a full link layer, and the grouping sub-network fuses the view feature vectors into group-level feature vectors, then fuses the group-level feature vectors into shape-level feature vectors, and finally fuses the shape-level feature vectors into the three-dimensional model feature vectors and outputs the three-dimensional model feature vectors by:
firstly, the grouping sub-network calculates the view discrimination according to the view feature vector and sets view groups; secondly, the grouping weight module calculates view grouping weight according to the view grouping and the view distinguishing degree; thirdly, according to the view feature vector and the view grouping, the view pooling layer is used for fusing the view feature vector into a group-level feature vector and outputting the group-level feature vector; then, according to the group-level feature vector and the view grouping weight, the group-level feature vector is fused into the shape-level feature vector by using the group-level pooling layer and is output; and finally, inputting the shape-level feature vectors into the full-connection layer, fusing the shape-level feature vectors into the three-dimensional model feature vectors and outputting the three-dimensional model feature vectors.
5. The method for cross-domain retrieval of three-dimensional models based on complex background images as claimed in claim 1, wherein in the step 4) and the step 6), the regularization process is L2 regularization.
6. The method for cross-domain retrieval of three-dimensional models based on complex background images as claimed in claim 1, wherein in the step 4), the loss function:
L=max(dpos-dneg+margin,0),
wherein d isposRepresenting the distance between pairs of positive samples, said pairs being a three-dimensional model of the positive type and an Anchor image, dnegRepresenting the distance between a pair of negative samples, the pair of negative samples being an inverse-class three-dimensional model and an Anchor image, margin representing a set relative distance,
Figure FDA0002495479170000041
Figure FDA0002495479170000042
7. the method of claim 8, wherein the distance is Euclidean distance.
8. A three-dimensional model cross-domain retrieval system based on a complex background image is characterized by comprising: the system comprises a target three-dimensional model library, an input module, a projection processing module, a preprocessing module, a retrieval module and an output module;
the target three-dimensional model library comprises a target three-dimensional model set S;
the input module is used for inputting a query image q and sending the query image q to the preprocessing module;
the projection processing module is used for processing each target three-dimensional model S in the target three-dimensional data set SiThe projection processing is carried out to obtain the three-dimensional model projection view group SViAnd sending the data to the preprocessing module;
the preprocessing module is used for processing the query data q sent by the input module and the three-dimensional model projection television set SV sent by the projection processing moduleiRespectively carrying out the preprocessing to obtain a processed query image q 'and a three-dimensional model projection view group SV'iAnd sending the data to the retrieval module;
the retrieval module comprises a trained image precise feature extraction network N'IAnd three-dimensional model grouping view feature extraction network N'MInputting the processed query image q 'sent by the preprocessing module into a trained image precise feature extraction network N'IIn the method, the image characteristic vector F corresponding to the query image q is obtained through outputqRegularization processing is carried out to obtain a regularized image feature vector F'qThe processed projection view set SV sent by the pre-processing modulei'input trained three-dimensional model grouping view feature extraction network N'MIn the method, a target three-dimensional model S is obtained through outputiCorresponding three-dimensional model feature vectors
Figure FDA0002495479170000051
Regularizing to obtain regularized three-dimensional model feature vector
Figure FDA0002495479170000052
Calculating the regularized image feature vector F'qAnd the regularized three-dimensional model feature vector
Figure FDA0002495479170000053
A distance D (q, S) therebetweeni) At said distance D (q, S)i) Weighing the query image q and each target three-dimensional model SiThe similarity between the two modules is sorted in a descending order to obtain a sorting result, and the sorting result is sent to the output module;
the output module selects a plurality of top-ranked target three-dimensional models S in the ranking result sent by the retrieval moduletopAs a result of a search similar to the query image q and output.
CN202010417173.9A 2020-05-18 2020-05-18 Three-dimensional model cross-domain retrieval method and system based on complex background image Pending CN111625667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010417173.9A CN111625667A (en) 2020-05-18 2020-05-18 Three-dimensional model cross-domain retrieval method and system based on complex background image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010417173.9A CN111625667A (en) 2020-05-18 2020-05-18 Three-dimensional model cross-domain retrieval method and system based on complex background image

Publications (1)

Publication Number Publication Date
CN111625667A true CN111625667A (en) 2020-09-04

Family

ID=72259810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010417173.9A Pending CN111625667A (en) 2020-05-18 2020-05-18 Three-dimensional model cross-domain retrieval method and system based on complex background image

Country Status (1)

Country Link
CN (1) CN111625667A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270762A (en) * 2020-11-18 2021-01-26 天津大学 Three-dimensional model retrieval method based on multi-mode fusion
CN112308153A (en) * 2020-11-02 2021-02-02 创新奇智(广州)科技有限公司 Smoke and fire detection method and device
CN112686884A (en) * 2021-01-12 2021-04-20 李成龙 Automatic modeling system and method for imaging marking characteristics
CN112905832A (en) * 2021-05-07 2021-06-04 广东众聚人工智能科技有限公司 Complex background fine-grained image retrieval system and method
CN113032613A (en) * 2021-03-12 2021-06-25 哈尔滨理工大学 Three-dimensional model retrieval method based on interactive attention convolution neural network
CN113177525A (en) * 2021-05-27 2021-07-27 杭州有赞科技有限公司 AI electronic scale system and weighing method
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN113779287A (en) * 2021-09-02 2021-12-10 天津大学 Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
WO2022097302A1 (en) * 2020-11-09 2022-05-12 富士通株式会社 Generation program, generation method, and information processing device
CN115985509A (en) * 2022-12-14 2023-04-18 广东省人民医院 Medical imaging data retrieval system, method, device and storage medium
CN117540043A (en) * 2023-12-11 2024-02-09 济南大学 Three-dimensional model retrieval method and system based on cross-instance and category comparison

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411794A (en) * 2011-07-29 2012-04-11 南京大学 Output method of two-dimensional (2D) projection of three-dimensional (3D) model based on spherical harmonic transform
US20180011620A1 (en) * 2016-07-11 2018-01-11 The Boeing Company Viewpoint Navigation Control for Three-Dimensional Visualization Using Two-Dimensional Layouts
CN110188228A (en) * 2019-05-28 2019-08-30 北方民族大学 Cross-module state search method based on Sketch Searching threedimensional model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411794A (en) * 2011-07-29 2012-04-11 南京大学 Output method of two-dimensional (2D) projection of three-dimensional (3D) model based on spherical harmonic transform
US20180011620A1 (en) * 2016-07-11 2018-01-11 The Boeing Company Viewpoint Navigation Control for Three-Dimensional Visualization Using Two-Dimensional Layouts
CN110188228A (en) * 2019-05-28 2019-08-30 北方民族大学 Cross-module state search method based on Sketch Searching threedimensional model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜雨佳等: ""基于三元组网络的单图三维模型检索"", 《北京航空航天大学学报》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308153A (en) * 2020-11-02 2021-02-02 创新奇智(广州)科技有限公司 Smoke and fire detection method and device
CN112308153B (en) * 2020-11-02 2023-11-24 创新奇智(广州)科技有限公司 Firework detection method and device
WO2022097302A1 (en) * 2020-11-09 2022-05-12 富士通株式会社 Generation program, generation method, and information processing device
JP7452695B2 (en) 2020-11-09 2024-03-19 富士通株式会社 Generation program, generation method, and information processing device
CN112270762A (en) * 2020-11-18 2021-01-26 天津大学 Three-dimensional model retrieval method based on multi-mode fusion
CN112686884A (en) * 2021-01-12 2021-04-20 李成龙 Automatic modeling system and method for imaging marking characteristics
CN113032613A (en) * 2021-03-12 2021-06-25 哈尔滨理工大学 Three-dimensional model retrieval method based on interactive attention convolution neural network
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN112905832B (en) * 2021-05-07 2021-08-03 广东众聚人工智能科技有限公司 Complex background fine-grained image retrieval system and method
CN112905832A (en) * 2021-05-07 2021-06-04 广东众聚人工智能科技有限公司 Complex background fine-grained image retrieval system and method
CN113177525A (en) * 2021-05-27 2021-07-27 杭州有赞科技有限公司 AI electronic scale system and weighing method
CN113779287A (en) * 2021-09-02 2021-12-10 天津大学 Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
CN113779287B (en) * 2021-09-02 2023-09-15 天津大学 Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
CN115985509A (en) * 2022-12-14 2023-04-18 广东省人民医院 Medical imaging data retrieval system, method, device and storage medium
CN117540043A (en) * 2023-12-11 2024-02-09 济南大学 Three-dimensional model retrieval method and system based on cross-instance and category comparison
CN117540043B (en) * 2023-12-11 2024-04-12 济南大学 Three-dimensional model retrieval method and system based on cross-instance and category comparison

Similar Documents

Publication Publication Date Title
CN111625667A (en) Three-dimensional model cross-domain retrieval method and system based on complex background image
Cong et al. Going from RGB to RGBD saliency: A depth-guided transformation model
Qi et al. Review of multi-view 3D object recognition methods based on deep learning
Lin et al. CODE: Coherence based decision boundaries for feature correspondence
Liu et al. Multi-modal clique-graph matching for view-based 3d model retrieval
Feng et al. Relation graph network for 3D object detection in point clouds
Cong et al. Global-and-local collaborative learning for co-salient object detection
Li et al. Multi-scale neighborhood feature extraction and aggregation for point cloud segmentation
CN110674741B (en) Gesture recognition method in machine vision based on double-channel feature fusion
Tian et al. Densely connected attentional pyramid residual network for human pose estimation
Liu et al. TreePartNet: neural decomposition of point clouds for 3D tree reconstruction
Han et al. Weakly-supervised learning of category-specific 3D object shapes
Gao et al. Multi-level view associative convolution network for view-based 3D model retrieval
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN114067075A (en) Point cloud completion method and device based on generation of countermeasure network
CN112784782A (en) Three-dimensional object identification method based on multi-view double-attention network
Cao et al. Accurate 3-D reconstruction under IoT environments and its applications to augmented reality
Lee et al. Connectivity-based convolutional neural network for classifying point clouds
Lu et al. Large-scale tracking for images with few textures
Lee et al. Learning semantic correspondence exploiting an object-level prior
Zhou et al. Learning transferable and discriminative representations for 2D image-based 3D model retrieval
Lei et al. Mesh convolution with continuous filters for 3-d surface parsing
Mallis et al. From keypoints to object landmarks via self-training correspondence: A novel approach to unsupervised landmark discovery
CN111797269A (en) Multi-view three-dimensional model retrieval method based on multi-level view associated convolutional network
Yuan et al. SHREC 2020 track: 6D object pose estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200904

WD01 Invention patent application deemed withdrawn after publication