CN111625667A

CN111625667A - Three-dimensional model cross-domain retrieval method and system based on complex background image

Info

Publication number: CN111625667A
Application number: CN202010417173.9A
Authority: CN
Inventors: 李海生; 杜雨佳; 李勇; 姚春莲; 李楠
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2020-09-04

Abstract

The invention discloses a three-dimensional model cross-domain retrieval method and a three-dimensional model cross-domain retrieval system based on a complex background image. The method comprises the steps of designing a cross-domain retrieval triple depth network, completing effective feature extraction on input data by utilizing an image accurate feature extraction network and a three-dimensional model grouping view feature extraction network, constructing a feature joint embedding space, mapping features of different domains into the same high-dimensional space, and enabling the feature distance of data of the same class to be smaller and the feature distance of data of different classes to be larger. And finally, measuring the similarity of the image and the three-dimensional model in a feature combined embedding space by adopting the Euclidean distance to complete cross-domain retrieval. The invention can search and obtain the corresponding three-dimensional model according to the input single RGB image with complex background information.

Description

Three-dimensional model cross-domain retrieval method and system based on complex background image

Technical Field

The invention relates to the field of computer graphics and computer vision, in particular to a three-dimensional model cross-domain retrieval method and a three-dimensional model cross-domain retrieval system based on a complex background image.

Background

The coming of the information age provides strong assistance for the development of computer hardware, and various media data such as audio, video, images, three-dimensional data and the like are in blowout type growth. Nowadays, three-dimensional models are widely used in the fields of computer graphics and computer vision, such as 3D printing, computer aided design, movie animation, medical diagnosis, and the like. In order to adapt to the huge and growing three-dimensional data involved in many applications, designing a fast and effective three-dimensional model retrieval method becomes a hot problem at present.

The current retrieval work mostly belongs to the retrieval of three-dimensional models based on examples, the method needs to provide a three-dimensional model to be queried, the three-dimensional model is represented by a voxel, point cloud, grid or multi-view method, a feature descriptor is extracted, and the similarity is compared with model feature descriptors in a three-dimensional model library to return a similar three-dimensional model. The example-based three-dimensional model retrieval belongs to the same domain retrieval problem, and the accuracy of the three-dimensional model retrieval is higher because the three-dimensional model contains more characteristic information. However, in real life, a three-dimensional model used for query is not easy to obtain, and in contrast, a two-dimensional image is convenient to obtain in practical application, so that retrieving the three-dimensional model based on a single two-dimensional image has important research significance and use value.

The method for searching the three-dimensional model based on the two-dimensional image belongs to a cross-domain searching problem, and the input of the method can be an RGB image, a hand-drawn sketch and an RGB-D image, and the three-dimensional model is output as a three-dimensional model corresponding to the image. The current related research can be divided into a traditional model retrieval method based on manual characteristics and a model retrieval method based on deep learning characteristics. The idea of the manual characteristic method is to obtain bottom layer descriptors of the image and the three-dimensional model in a manual design mode and then measure similarity in a distance calculation mode and the like. Such as methods based on the eigenbag model (Bronstein A M, Bronstein M M, Guibas L J, actual. shape google: geometrical words and expressions for innovative shape hierarchy [ J ]. ACM Transactions on Graphics, 2011, 30 (1): 1-20.), methods based on Gabor local linear features (Eitzmatis, Richterrald, Boubekeupitamy, et al. Sketch-based shape hierarchy [ J ]. ACM Transactions on Graphics, 2012.). However, such methods are difficult in the feature extraction stage and are not suitable for the case of large-scale data sets.

Deep learning is a sub-field of machine learning, and the champion of ImageNet competition is obtained from the deep learning represented by convolutional neural networks in 2012, and the deep networks attract great attention in the field of computer vision. The advent of various 3D sensors has made it easier to obtain three-dimensional models, such as Microsoft Kinect, Google Project Tango, etc., and there are many large-scale common three-dimensional model datasets, such as sharenet, ModelNet, etc. Compared with a three-dimensional model data set, the data set of the two-dimensional image is large in scale and more in variety, such as ImageNet. Under the support of abundant data, the deep learning is popularized to three-dimensional data processing to become a current research hotspot, some achievements are obtained by utilizing the deep learning to solve the problem of cross-domain retrieval of a model, the idea is to take a deep neural network as technical support, firstly obtain the feature representation of an image and a three-dimensional model, then construct a cross-domain data sharing space, measure the distance of data feature descriptors of two modes and compare the data feature descriptors to complete the retrieval process. The deep neural network can rapidly learn from a large amount of data to obtain effective feature representation, compared with a traditional manual method, the cross-domain retrieval method using the deep learning three-dimensional model is strong in applicability, and the retrieval effect is greatly improved. Twin Networks were used for sketch-based three-dimensional model retrieval as in Wang et al (Wang F, Kang L, Li Y, et al Sketch-based3D shape retrieval using volumetric Neural Networks [ C ]. ComputerVision and Pattern Recognition 2015: 1875-1883.). For example, the DCML method proposed by generation et al uses a deep neural network to perform Metric Learning on features obtained from a Sketch and a three-dimensional model using discriminant and correlation loss functions (Dai G, Xie J, Zhu F, et al deep corrected method Learning for Sketch-based3D Shape Retrieval [ C ]// third-First AAAI Conference on intellectual analysis, 2017, 4002-4008.). On the basis, they propose DCHML method, which adds loss training to the hidden layer of the neural network to improve the Retrieval performance (Dai G, Xie J, Fang Y, et al deep corrected Holistic Learning for Sketch-Based3D Shape Retrieval [ J ]. IEEE Transactions on image Processing, 2018, 27 (7): 3374-3386.). Such as Li et al, filter background noise in images using a pre-trained image convolutional neural network and propose cross-domain nested space framework to reduce the difference in features between images and models (Li Y, SuH, Qi CR, et al. Joint embedding of maps and images via CNN imaging visualization [ J ]. ACM Transformations On Graphics (TOG), 2015, 34 (6): 1-12.). However, most of the existing works are to search a three-dimensional model by taking a hand-drawn sketch as an input, and in real life, people are exposed to more images in a real environment, and the images often have complex background information including illumination, background pixels except for a search object and the like. Such noise information is not relevant to the retrieval task, which together with the effective information adulterates new challenges for cross-domain retrieval. Therefore, the direct application of the sketch and three-dimensional model cross-domain retrieval method to the real image retrieval three-dimensional model task can cause the retrieval accuracy to be reduced due to the noise information of the real image. Although the convolutional neural network can filter noise information of part of real images, the effect is very limited, so that the design of the image convolutional neural network for filtering the noise information of the real images and the application of the image convolutional neural network to a cross-domain retrieval task of images and three-dimensional models are of great significance. The three-dimensional model itself contains rich information, and the mainstream method of the current research is to use a group of multi-angle projection views to represent the three-dimensional model, and to use the existing well-developed image depth network to complete the feature extraction of the three-dimensional model, such as the MVCNN method proposed by Su et al (SuH, Maji S, Kalogerakis E, et al. Multi-view computational Networks for 3DShape registration [ C ]// Proceedings of the IEEE International Conference computer Vision, 2015: 945-953.). Because the view is similar to the image, the multi-view method can reduce the semantic difference between the image and the three-dimensional model in the feature extraction part, thereby having good effect. Su et al, using MVCNN, obtain Feature descriptors of a three-dimensional model, and perform a cross-domain search task by aligning the Image with the Feature Distribution of the three-dimensional model (Su Y, Li Y, NieW, et al. Joint Heterogeneous feed Learning and Distribution Alignment for2D Image-Based 3D Object Retrieval [ J ]. IEEE Transactions on Circuits and systems for Video Technology, 2019: 1-1.). For example, Wu et al project a three-dimensional model into multiple views for representation, design a convolutional neural network model to jointly analyze images and three-dimensional model features (Wu Z, Zhang Y, Zeng M, et. Joint analysis of maps and images via depth domain adaptation [ J ]. Computers & Graphics, 2018: 140-147.). However, such a retrieval method directly fuses each projection view feature, so that a large amount of information is lost in representing the three-dimensional model. Moreover, when a cross-domain feature joint embedding space is constructed through feature learning, a semantic gap between a real image and a "clean" projected image also causes a reduction in retrieval accuracy. Therefore, how to reduce the feature loss, extract more complete and effective features of the three-dimensional model, and reduce the semantic gap between different modal data is still a difficult problem to realize the high-precision retrieval from the image to the three-dimensional model.

In summary, the prior art has the following disadvantages: (1) most of the existing three-dimensional model retrieval work is the three-dimensional model retrieval based on examples, cross-domain retrieval research between the three-dimensional model and different modal data is relatively less, and a high-precision retrieval technology aiming at a real image with complex background information to the three-dimensional model is lacked. (2) The existing image and three-dimensional model cross-domain retrieval technology usually ignores the filtering of real image noise information, so that the acquired image characteristics contain invalid information and generate a large semantic difference with the three-dimensional model projection view characteristics, and a large error is generated in a real image and three-dimensional model cross-domain retrieval task with a complex background. (3) At present, a three-dimensional model representation method based on multiple views has a good effect in a cross-domain retrieval task of images and models. However, such cross-domain retrieval methods tend to directly fuse each projection view feature, thereby resulting in a large information loss, and resulting in limited retrieval accuracy.

In a word, the existing cross-domain retrieval technology of the three-dimensional model of the image lacks accurate feature extraction aiming at a real image, noise information irrelevant to a retrieval task is filtered, and more three-dimensional model information is lost in the process of representing the three-dimensional model by using a multi-view, so that the cross-domain retrieval accuracy is influenced.

Disclosure of Invention

The technical problem of the invention is solved: the method and the system for searching the three-dimensional model based on the complex background image overcome the technical defects that the prior art lacks accurate feature extraction and three-dimensional model information loss of a real image in the field of three-dimensional model cross-domain searching, the cross-domain searching method and the system for searching the three-dimensional model based on the complex background image are provided, the cross-domain searching triple depth network is designed to construct a feature combined embedding space to reduce the distribution difference of different modal data features, the searching accuracy is improved by paying attention to the effective feature extraction of the three-dimensional model and an RGB image with complex background information, and the similar three-dimensional model based on single RGB image searching is.

The technical solution adopted by the invention is as follows:

in order to solve the technical problems, the invention adopts the technical scheme that: a three-dimensional model cross-domain retrieval method based on a complex background image comprises the following steps:

(1) constructing an original dataset D comprising several different three-dimensional models M and an image I with a complex background, said original dataset D being represented by a triplet T ═ (I)_A，M_pos，M_neg) Is shown in the formula I_ARepresenting an image as Anchor, M_posRepresentation and said image I_AHomogeneous three-dimensional models of the same kind, M_negRepresentation and said image I_AA heterogeneous inverse three-dimensional model;

(2) to pairThe image I in the triplet T_AIs pretreated to obtain a treated image I'_AFor the positive three-dimensional model M in the triplet T_posAnd the inverse three-dimensional model M_negRespectively carrying out projection processing to obtain the projection video group V of the positive three-dimensional model_posAnd the inverse three-dimensional model projection television group V_negAnd respectively carrying out pretreatment to obtain a processed normal three-dimensional model projection view group V'_posAnd an inverse three-dimensional model projection view group V'_negObtaining the processed triplet T '═ I'_A，V′_pos，V′_neg) A standard data set D' of representations;

(3) aiming at the processed triple T', a cross-domain retrieval triple depth network model N is constructed, wherein the cross-domain retrieval triple depth network comprises 3 branch networks which are 1 accurate image feature extraction network N _I2 three-dimensional model grouping view feature extraction network N with same structure and shared weight_MWherein the image precise feature extraction network N_IIs the processed image I 'of the processed triplet T'_AOutput as image feature vectors

The three-dimensional model grouping view feature extraction network N_MIs the processed positive three-dimensional model projection view group V 'in the processed triplet T'_posAnd an inverse three-dimensional model projection view group V'_negRespectively outputting the feature vectors of the three-dimensional model of the positive type

And inverse three-dimensional model feature vector

The triple network is the fusion of deep learning and metric learning, and can directly learn the mapping relation from a sample space to a compact Euclidean space, thereby constructing a feature joint embedded space to measure different modal numbersAccordingly. Moreover, the three-element network can better model details through two input difference metrics, so that the retrieval accuracy is improved;

(4) for the image feature vector

The feature vector of the positive three-dimensional model

And the inverse three-dimensional model feature vector

Carrying out regularization treatment to obtain the regularized image feature vector

The feature vector of the positive three-dimensional model

And the inverse three-dimensional model feature vector

Defining a loss function L of the cross-domain retrieval triple depth network model N;

(5) using the processed triple T ', iteratively training parameters of a cross-domain retrieval triple deep network model N until the loss function L is smaller than a set threshold value, stopping training, and obtaining a trained cross-domain retrieval triple deep network model N', wherein the trained cross-domain retrieval triple deep network model N 'comprises 3 trained branch networks which are respectively a trained accurate image feature extraction network N'_IAnd two trained three-dimensional model grouping view feature extraction networks N'_MThe feature joint embedding space construction of the image and the three-dimensional model can be completed through the trained cross-domain retrieval triple depth network model N', and different domain data measurement bases are provided for retrieval tasks;

(6) when executing the search task, givenQuery image q and target three-dimensional model set S, the query data q is preprocessed to obtain processed query image q', and each target three-dimensional model S in the target three-dimensional data set S is subjected to_iThe projection processing is carried out to obtain the three-dimensional model projection view group SV_iThen, the preprocessing is carried out to obtain a processed three-dimensional model projection television group SV_i' inputting the processed query image q ' into a trained image exact feature extraction network N '_IIn the method, the image characteristic vector F corresponding to the query image q is obtained through output_qRegularization processing is carried out to obtain a regularized image feature vector F'_qThe processed projection view set SV_i'input trained three-dimensional model grouping view feature extraction network N'_MIn the method, a target three-dimensional model S is obtained through output_iCorresponding three-dimensional model feature vectors

Regularizing to obtain regularized three-dimensional model feature vector

Calculating the regularized image feature vector F'_qAnd the regularized three-dimensional model feature vector

A distance D (q, S) therebetween_i) At said distance D (q, S)_i) Weighing the query image q and each target three-dimensional model S_iSimilarity between the three-dimensional models is sorted in descending order, and a plurality of top-ranked target three-dimensional models S are selected_topAs a result of a search similar to the query image q and output.

In the step (2) and the step (6), the preprocessing includes uniform size processing and gray scale processing, and since a network structure includes a full connection layer, the uniform size processing is required to perform uniform size processing on input data, the uniform size processing is to unify the size of the image or the projection view of the three-dimensional model, and the gray scale processing is to convert an RGB color image of the image into a gray scale image, so as to eliminate interference of image colors on a retrieval task.

In the step (2) and the step (6), a virtual camera array is arranged, the three-dimensional model is rendered by Phong coloring, the depth information of the view is increased by adding illumination, the semantic difference between the view and a target real image is reduced, the multi-angle projection view of the three-dimensional model is obtained, each three-dimensional model has 12 corresponding three-dimensional model projection views, and the three-dimensional model can be completely represented to the maximum extent while the efficiency is considered through 12 projection views.

In the step (3), the image precise feature extraction network is an AlexNet network including an attention block, the network basic structure is consistent with the AlexNet network, the network basic structure includes 5 convolutional layers and 3 full-connection layers, the attention block is located between every two convolutional layers connected in front and back, and is formed by connecting 1 channel attention module and 1 space attention module in series, the attention block is suitable for being used between the two convolutional structures connected in front and back, the precise feature extraction capability of the network on the input image can be improved by overlapping the attention block between all convolutional layers of the image feature extraction network, and the influence of complex background information of a real image on a retrieval task is eliminated.

In the step (3), the three-dimensional model grouped view feature extraction network uses a convolution structure of an AlexNet network as a basic network and includes a grouped sub-network, the three-dimensional model grouped view feature extraction network includes 5 total convolutional layers of the AlexNet network, the grouped sub-network is connected after the last 1 convolutional layer, after the view feature vectors are output by the last 1 convolutional layer, the grouped sub-network fuses the view feature vectors into group-level feature vectors, then fuses the group-level feature vectors into shape-level feature vectors, and finally fuses the shape-level feature vectors into the three-dimensional model feature vectors and outputs the three-dimensional model feature vectors.

In step (3), the grouping sub-network includes a grouping weight module, a view pooling layer, a group pooling layer and a full connection layer, and the grouping sub-network fuses the view feature vectors into group-level feature vectors, then fuses the group-level feature vectors into shape-level feature vectors, and finally fuses the shape-level feature vectors into the three-dimensional model feature vectors and outputs the three-dimensional model feature vectors by the specific implementation:

firstly, the grouping sub-network calculates the view discrimination according to the view feature vector and sets view groups; secondly, the grouping weight module calculates view grouping weight according to the view grouping and the view distinguishing degree; thirdly, according to the view feature vector and the view grouping, the view pooling layer is used for fusing the view feature vector into a group-level feature vector and outputting the group-level feature vector; then, according to the group-level feature vector and the view grouping weight, the group-level feature vector is fused into the shape-level feature vector by using the group-level pooling layer and is output; and finally, inputting the shape-level feature vectors into the full-connection layer, fusing the shape-level feature vectors into the three-dimensional model feature vectors and outputting the three-dimensional model feature vectors. The grouping sub-network focuses on the similarity and the difference between different views, and introduces the grouping weight to distinguish the contribution degree of the views from different perspectives to the model representation. The three-dimensional model feature extraction process is divided into three stages of view features, group-level features and shape-level features, so that the network focuses on the relation among different views while extracting each view feature, and the representation capability and robustness of the generated three-dimensional model feature vector are improved.

In steps (4) and (6), the regularization process is L2 regularization. Compared with other regularization functions, the L2 regularization function is simpler and more convenient to calculate, and can simply and effectively control the complexity of a model and prevent overfitting.

In step (4), the loss function:

L＝max(d_pos-d_neg+margin，0)，

wherein d is_posRepresenting the distance, d, between a positive sample pair, i.e. a positive-like three-dimensional model, and an Anchor image_negRepresents the distance between the negative sample pair, i.e., the inverse class three-dimensional model, and the Anchor image, margin represents the set relative distance,

in the step (5) and the step (6), the distance is an euclidean distance. Compared with other distance measurement methods, the Euclidean distance measurement method is simpler and more intuitive, and can effectively measure the similarity between the features in a high-dimensional mapping space.

The retrieval system comprises a target three-dimensional model library, an input module, a projection processing module, a preprocessing module, a retrieval module and an output module;

the target three-dimensional model library comprises a target three-dimensional model set S;

the input module is used for inputting a query image q and sending the query image q to the preprocessing module;

the projection processing module is used for processing each target three-dimensional model S in the target three-dimensional data set S_iThe projection processing is carried out to obtain the three-dimensional model projection view group SV_iAnd sending the data to the preprocessing module;

the preprocessing module is used for respectively preprocessing the query data q sent by the input module and the three-dimensional model projection view group SVi sent by the projection processing module to obtain a processed query image q' and a processed three-dimensional model projection view group SV_i', and send to the retrieval module;

the retrieval module comprises a trained image precise feature extraction network N'_IAnd three-dimensional model grouping view feature extraction network N'_MInputting the processed query image q 'sent by the preprocessing module into a trained image precise feature extraction network N'_IIn the method, the image characteristic vector F corresponding to the query image q is obtained through output_qRegularization processing is carried out to obtain a regularized image feature vector F'_qThe processed projection view set SV sent by the pre-processing module_i' input trained three-dimensional model visual groupingGraph feature extraction network N'_MIn the method, a target three-dimensional model S is obtained through output_iCorresponding three-dimensional model feature vectors

Regularizing to obtain regularized three-dimensional model feature vector

A distance D (q, S) therebetween_i) At said distance D (q, S)_i) Weighing the query image q and each target three-dimensional model S_iThe similarity between the two modules is sorted in a descending order to obtain a sorting result, and the sorting result is sent to the output module;

the output module selects a plurality of top-ranked target three-dimensional models S in the ranking result sent by the retrieval module_topAs a result of a search similar to the query image q and output.

Compared with the prior art, the invention has the beneficial effects that:

compared with the existing mainstream method, the method provided by the invention allows a user to take a single RGB image with complex background information as query input to finish the accurate retrieval of the corresponding three-dimensional model. The method focuses on effective feature extraction of different modal data, aiming at the characteristics of an image with a complex background and a three-dimensional model view group, an image accurate feature extraction network and a three-dimensional model grouping view feature extraction network are respectively designed to extract effective features of different modal data, and then a basis is provided for cross-domain retrieval based on distance calculation by constructing a feature joint embedding space. The advantage of this is that the targeted design of the feature extraction network can improve the feature vector representation capability and robustness, thereby improving the retrieval accuracy.

Therefore, the shortcomings of the present invention relative to the prior art have the following technical advantages:

(1) an end-to-end depth measurement learning framework is provided, a triple network is designed to map two-dimensional images and three-dimensional model data to the same high-dimensional space, the similarity of the same type of data in a domain and the sparseness of heterogeneous data are ensured while the difference of the characteristics of the inter-domain data is reduced, and therefore accurate cross-domain retrieval from a single RGB image with complex background information to a three-dimensional model is achieved.

(2) Aiming at the complex background information of the query image, an image accurate feature extraction network branch is designed, an attention mechanism is introduced to realize image self-adaptive feature refinement learning, and the influence of noise information on cross-domain retrieval accuracy is eliminated.

(3) And representing the three-dimensional model by using 12 projection views with different visual angles, reducing the characteristic difference of data in different modes, and rendering the projection views to further reduce the semantic difference between the views and the target real image. A grouping mechanism is introduced in a multi-view feature fusion stage, and the expression capability of the three-dimensional model feature vector is improved, so that the cross-domain retrieval accuracy is improved.

In summary, the invention provides an end-to-end cross-domain retrieval technology and system based on a triple network in the field of three-dimensional model retrieval, in order to solve the problem of cross-domain retrieval of a three-dimensional model of a single RGB image. Compared with the prior art, the method focuses on the problems of noise information of image data and loss of three-dimensional model features, designs the feature extraction network in a targeted manner, and improves the feature vector representation capability and robustness, so that the method has a good retrieval effect.

Drawings

FIG. 1 is a schematic flow chart of a complex background image-based three-dimensional model retrieval method;

FIG. 2 is a diagram of a cross-domain retrieval triple depth network framework;

FIG. 3 is a schematic diagram of an image exact feature extraction network;

FIG. 4 is a schematic view of a channel attention module;

FIG. 5 is a schematic view of a spatial attention module

FIG. 6 is a schematic diagram of a three-dimensional model group view feature extraction network;

fig. 7 is a schematic structural diagram of a three-dimensional model cross-domain retrieval system based on a complex background image.

Detailed Description

The invention is described in detail below with reference to the figures and the detailed description. Wherein, fig. 1 describes the implementation process of the three-dimensional model retrieval method based on a single complex background image. FIG. 2 depicts a process for constructing a feature joint embedding space using a cross-domain retrieval triple depth network. FIG. 3 depicts the use of an image exact feature extraction network to perform the extraction of image features with complex backgrounds. FIG. 4 shows a channel attention module structure in the attention block of the image accurate feature extraction network. FIG. 5 shows an in-space attention module structure in the attention block of an image accurate feature extraction network. FIG. 6 depicts the completion of a feature extraction process for a three-dimensional model using a three-dimensional model group view feature extraction network. FIG. 7 depicts the structure of a three-dimensional model cross-domain retrieval system based on complex background images.

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the method of the present invention includes the following steps:

(1) constructing an original data set D for cross-domain retrieval of three-dimensional models, the original data set comprising several different three-dimensional models M and an image I with a complex background, in terms of a triplet T ═ I (I)_A，M_pos，M_neg) Is shown in the formula I_ARepresenting an image as Anchor, M_posRepresentation and image I_AHomogeneous three-dimensional models of the same kind, M_negRepresentation and image I_AA heterogeneous inverse three-dimensional model;

(2) for image data I in a triple data set T_AProcessing the image data I 'with a uniform size of 256 × 256 to be suitable for a full connection layer in the deep neural network, converting the image data from the RGB image into a gray scale map, and removing the interference of color information in the image on retrieval to obtain processed image data I'_A. For positive three-dimensional model M in triple T_posInverse three-dimensional model M_negRespectively carrying out projection processing to obtain three-dimensional positive typeModel projection television set V_posProjection video group V of inverse three-dimensional model_neg. Specifically, the invention provides a group of virtual camera arrays, wherein the virtual camera arrays comprise 12 virtual cameras, the virtual cameras are placed around the three-dimensional model, point to the mass center of the three-dimensional model and form an included angle of 30 degrees with the horizontal plane, the camera intervals are 30 degrees, and 12 projection visual angles can completely represent the three-dimensional model to the maximum extent while considering the efficiency. After a multi-angle projection view of the three-dimensional model is obtained, the three-dimensional model is rendered by adopting Phong coloring, the depth information of the view is increased by adding illumination, meanwhile, the semantic difference between the view and a target real image is reduced, and a processed projection view group V 'of the normal three-dimensional model is obtained'_posAnd an inverse three-dimensional model projection view group V'_neg. Finally, the preprocessed standard dataset D ' is preprocessed to process the triplet T ' ═ I '_A，V′_pos，V′_neg) Represents;

(3) and taking the processed triple T 'obtained through the steps as input, and constructing a cross-domain retrieval triple deep network N to train the cross-domain retrieval triple T'. The triple network is the fusion of deep learning and metric learning, and can directly learn the mapping relation from a sample space to a compact Euclidean space. In addition, the three-element network can better model details through two input difference metrics, so that the retrieval accuracy is improved. The cross-domain search triple depth network has three branches corresponding to the input data, wherein the branch I is an image precise feature extraction network N_IThe input is image I'_ACompleting self-adaptive thinning learning of important features of the image through a neural network, and outputting the important features as image feature vectors

Branch two and branch three-dimensional model grouping view feature extraction network N_MThe inputs are respectively a positive three-dimensional model projection view group V'_posAnd an inverse three-dimensional model projection view group V'_negThe network structures are the same, the weights are shared, the relation among different views is concerned while the features of each view are extracted, and the feature vector of the three-dimensional model of the positive type is obtained

And inverse three-dimensional model feature vector

The cross-domain retrieval triple depth network provided by the invention focuses on the complete extraction of effective characteristics of input real images and three-dimensional model projection views aiming at the characteristics of the input real images and the three-dimensional model projection views, reduces the semantic difference between the real images and the projection images and improves the retrieval accuracy.

The real image contains complex background information irrelevant to the retrieval, and the retrieval precision is reduced due to the noise characteristics. Moreover, when a cross-domain feature joint embedding space is constructed through feature learning, a semantic gap between a real image and a "clean" projected image also causes a reduction in retrieval accuracy. Although partial noise features can be weakened by directly using the AlexNet network for feature extraction, the effect is not ideal. Therefore, the accurate image feature extraction network designed by the invention is based on the AlexNet network, and by adding an attention mechanism, the network can pay more attention to object information when learning the image features, so that the influence of complex background information on a retrieval task is eliminated, and the accurate features of the image are obtained.

Image precise feature extraction network N_IThe structure is shown in fig. 3, the basic network structure is consistent with an AlexNet network, and the network comprises 5 convolutional layers and 3 fully-connected layers, wherein an attention block is positioned between every two convolutional layers connected in front of and behind, 4 attention blocks are provided, and only one attention block is shown in fig. 3 as an example. Where convolution layer 1 has a convolution kernel size of 11 x 11, step size of 4, followed by an LRN layer (local response normalization), followed by a maximum pooling layer of size 3 x 3, step size of 2. Convolution layer 2 has a convolution kernel size of 5 x 5, step size of 1, followed by one LRN layer, and then connects the largest pooling layer of size 3 x 3, step size of 2. Convolution kernel sizes of convolution layers 3, 4 and 5 are all 3 x 3, and step sizes are all 1. The

convolutional layers

1 and 2, 2 and 3, 3 and 4, 4 and 5 are respectively connected by

attention blocks

1, 2, 3 and 4, each attention block is noticed by one channelThe force module and a spatial attention module are connected in series. The convolutional layer 5 is followed by the largest pooling layer of size 3 x 3 with a step size of 2. And three full connection layers are connected behind the maximum pooling layer, wherein the dimensionalities of the full connection layer 1 and the full connection layer 2 are 4096, the full connection layer 3 is an output layer, and the dimensionality is 128. An activation function layer is connected behind the 5 convolutional layers, the full connection layer 1 and the full connection layer 2 of the network, wherein the activation function adopts a Relu function.

Image precise feature extraction network N_IFeature extraction of important objects in the image is achieved through an attention mechanism, in an attention block, a channel attention module focuses on what features are meaningful, and a spatial attention module focuses on where features are meaningful. After obtaining the output feature F of the front convolution layer, attention weighting graph A is passed through the channel_cMultiplying to obtain channel adaptive characteristic F₁Then on the basis of the above-mentioned relationship, it is related to the space attention weight graph A_sMultiplying to obtain a spatially adaptive feature F₂The overall process is defined as follows:

wherein

Representing element by element multiplication.

channel attention As shown in FIG. 4, the input features F are a H × W × C tensor, where H represents height, W represents width, and C represents the number of channels, the channel attention module performs max-pooling (max-pooling) and average-pooling (average-pooling) on F in spatial dimension to obtain two 1 × 1 × C channel feature vectors, and the two pooling methods have the advantages that the max-pooling can retain more texture information, and the average-pooling has feedback on each pixel point on the feature map, and can completely transfer information while reducing dimensionsending the data into a multilayer perceptron with a hidden layer, wherein the activation function is Relu, the parameter compression rate is 16, then adding the two obtained feature vectors, and processing the sum through a Sigmoid activation function to obtain a 1 multiplied by C channel attention weight graph A_c。

Feature F obtained by filtering image feature channel through channel attention module₁is also an H × wxc tensor, similar to the channel attention module principle, the spatial attention module is aligned to F in the channel dimension₁the maximum pooling and average pooling to obtain two H × W × 1 space eigenvectors are shown in FIG. 5. the eigenvectors with 2 channels are obtained by splicing according to the channel dimensions, and then a 7 × 7 convolution layer and a Sigmoid activation function are applied to generate H × W × 1 space attention weight graph A_s. Finally, the new characteristic F is obtained by regional filtering₂The next convolution layer can be input for further feature extraction, and finally the accurate features of the image are obtained.

The invention adopts a multi-view method to extract the characteristics of the three-dimensional model, and the existing cross-domain retrieval work usually ignores the interconnection among different views, thereby causing the characteristic loss. The invention designs a three-dimensional model grouping view characteristic extraction network N_MThe model feature extraction process is divided into three stages of view features, group-level features and shape-level features, and information among views is mined on the basis of feature extraction of each view. Aiming at the characteristics of different views, such as high similarity among the views, large difference among the views and different contribution of the views in different view angles to model representation, a grouping mechanism and grouping weight are introduced, so that the extracted feature vectors have better representation capability and robustness, and the cross-domain retrieval accuracy is improved, and the three-dimensional model grouping view feature extraction network structure is shown in figure 6.

After the three-dimensional model grouping view feature extraction network finishes feature extraction of each view, a grouping sub-network is added to carry out grouping division on view features, corresponding weights are calculated, the view feature vectors are fused into group-level feature vectors, the group-level feature vectors are fused into shape-level feature vectors, and finally the shape-level feature vectors are fused into the three-dimensional model feature vectors and output. The view feature vectors are extracted through a 5-layer convolution network, the network structure is consistent with the first 5 convolution layer structures of the AlexNet network in the image precise feature extraction network, and as the view image is a 'clean' image without a background, an attention layer does not need to be added.

And after the view characteristics are obtained, putting the view characteristics into a grouping sub-network for grouping division to obtain a grouping scheme and a grouping weight. Specifically, view differentiation degrees are calculated according to view features, and then grouping is performed according to differentiation degrees, wherein the calculation of the differentiation degrees is defined as follows:

D(V_i)＝Sigmoid(log(abs(f(V_i))))

wherein, V_iView representing input, f (V)_i) Representing the feature vector of the view obtained by 5-layer convolutional network extraction, D (V)_i) Representing the discrimination score of the view. After the mapping of the Sigmoid function, the range of the discrimination scores of the views is (0, 1), and the discrimination scores are uniformly distributed by adding functions of log (. lamda.) and abs (. lamda.).

After the discrimination of each view is obtained, dividing the interval (0, 1) into 4 equal-length sub-intervals, sequentially checking the discrimination of each view, and dividing the views in the same interval into a group, thereby obtaining a grouping scheme G_jWhere j is 1, 2, 3, 4, the output of the grouping scheme contains the view number and the degree of discrimination for each group.

The grouping weight module calculates the weight of each group according to the grouping scheme and the discrimination of the views in the group, and is used for the step of fusing group-level features, wherein the group weight with high discrimination sum is larger, otherwise, the weight is smaller, and the calculation definition is as follows:

where Ceil (.) represents an ceiling function, | G_jL represents the number of views projected within each group.

The views in the same group are similar in distinguishing degree, the image characteristics of the views in the group are similar, and the representation capability of the models is similar, so that the intra-group view characteristics can be fused according to the information provided by the grouping scheme. The view feature fusion of the view in the group is completed through a view pooling layer, the view pooling layer is a pooling layer in the multi-view convolutional neural network and is specially used for fusing a plurality of view feature vectors, and the view feature fusion process is defined as follows:

wherein, F (G)_j) Representing group level eigenvectors, lambda being used to determine whether a view is in the group, N representing the number of views in the group.

And fusing the group-level features according to the grouping weight module result to obtain a shape-level feature vector, and outputting a final three-dimensional model feature vector through a full-connection layer with 128-dimensional dimension. The group-level feature fusion process is defined as follows:

where F (S) represents a shape-level feature vector obtained by fusing group-level features, and M is the number of groups (M ═ 4).

(4) Through the step (3), the image feature vector is obtained through the cross-domain retrieval triple depth network

Feature vector of three-dimensional model of positive type

And the feature vector of the positive three-dimensional model

Then respectively normalized by using an L2 regularization functionObtaining the regularized image feature vector

The feature vector of the positive three-dimensional model

And the inverse three-dimensional model feature vector

Compared with other regularization functions, the L2 regularization function is simpler and more convenient to calculate, and can simply and effectively control the complexity of the model and prevent overfitting. The process is defined as follows:

where v represents an element in the feature vector, 1 e-12.

(5) Image feature vector pair by using loss function L of cross-domain retrieval triple depth network

And three-dimensional model feature vectors

To know

And constructing a joint embedded feature space, measuring the similarity between the feature vectors by adopting Euclidean distance, mapping the data feature vectors of different domains into the same high-dimensional space, and reducing the data difference among different domains, wherein the same class data in the same domain are similar and the heterogeneous data are distant. Assume that the distance between a positive sample pair is

The distance between the negative sample pair is

The loss function is defined as follows:

L＝max(d_pos-d_neg+margin，0)

the mark in is a relative distance set for avoiding a shortcut of the cross-domain retrieval triple deep network model in a training process to cause an error result.

And using the processed triple T ', iteratively training the parameters of the cross-domain retrieval triple depth network model N until the loss function L is smaller than a set threshold value, and stopping training to obtain the trained cross-domain retrieval triple depth network model N'. The triple network is utilized to construct a joint embedding feature space of cross-domain data, so that image data features and three-dimensional model data features are distributed in the same space in a cluster form according to categories, and the similarity of different modal data can be measured by directly calculating the distance between the features.

(6) When a retrieval task is executed, a query image q and a target three-dimensional model set S are given, firstly, the query data q are preprocessed to obtain an image q', and each target three-dimensional model S in the target three-dimensional data set S is subjected to_iProjection processing is carried out to obtain view group SV_iThen preprocessing the image to obtain view group SV_i' then inputting the query image q ' into the trained image precise feature extraction network N ' I, and outputting an image feature vector F corresponding to the query image q_qObtaining an image feature vector F 'after regularization'_qWill view group SV_i'input trained three-dimensional model grouping view feature extraction network N'_MIn the method, a target three-dimensional model S is obtained through output_iCorresponding feature vector

Carrying out regularization treatment to obtain a feature vector

Finally, calculating image feature vector F 'by using Euclidean distance'_qAnd the feature vector of the three-dimensional model

A distance D (q, S) therebetween_i) Passing distance D (q, S)_i) Measuring query image q and each target three-dimensional model S_iSimilarity between the three-dimensional models is sorted in descending order, and 5 top-ranked target three-dimensional models S are selected_topAs a result of the search similar to the query image q and output. Compared with other distance measurement methods, the Euclidean distance measurement method is simpler and more intuitive, and can effectively measure the similarity between the features in the high-dimensional mapping space. The calculation formula is defined as follows:

where Q is the query image, S is the three-dimensional model to which it is compared,

and

and (3) respectively embedding the corresponding features into elements in the image and three-dimensional model feature vectors in the space in a combined mode, wherein n is the dimension of the feature vectors.

The three-dimensional model cross-domain retrieval system based on the complex background image is shown in figure 7 and comprises a target three-dimensional model library, an input module, a projection processing module, a preprocessing module, a retrieval module and an output module,

a target three-dimensional model library comprising a target three-dimensional model set S;

the input module is used for inputting the query image q and sending the query image q to the preprocessing module;

a projection processing module for each target three-dimensional model S in the target three-dimensional data set S_iProjection processing is carried out to obtain a three-dimensional model projection television group SV_iAnd sending the data to a preprocessing module;

a preprocessing module for processing the query data q sent by the input module and the three-dimensional model projection view group SV sent by the projection processing module_iRespectively carrying out pretreatment to obtain a processed query image q' and a three-dimensional model projection television group SV_i' and sent to the retrieval module；

A retrieval module comprising a trained image exact feature extraction network N'_IAnd three-dimensional model grouping view feature extraction network N'_MInputting the processed query image q 'sent by the preprocessing module into a trained image accurate feature extraction network N'_IIn the method, the image characteristic vector F corresponding to the query image q is output_qRegularization processing is carried out to obtain a regularized image feature vector F'_qThe processed projection view group SV sent by the preprocessing module_i'input trained three-dimensional model grouping view feature extraction network N'_MIn the method, a target three-dimensional model S is obtained through output_iCorresponding three-dimensional model feature vectors

Regularizing to obtain regularized three-dimensional model feature vector

Computing a regularized image feature vector F'_qAnd regularized three-dimensional model feature vectors

A distance D (q, S) therebetween_i) At a distance D (q, S)_i) Measuring query image q and each target three-dimensional model S_iThe similarity between the two modules is sorted in a descending order to obtain a sorting result, and the sorting result is sent to an output module;

the output module selects the top 5 target three-dimensional models S in the sequencing result sent by the retrieval module_topAs a result of the search similar to the query image q and output.

Claims

1. A three-dimensional model cross-domain retrieval method based on a complex background image is characterized by comprising the following steps:

step 1) constructing an original data set D comprising a number of different three-dimensional models M and an image I with a complex background, the original data set D being in tripletsT＝(I_A，M_pos，M_neg) Is shown in the formula I_ARepresenting an image as Anchor, M_posRepresentation and said image I_AHomogeneous three-dimensional models of the same kind, M_negRepresentation and said image I_AA heterogeneous inverse three-dimensional model;

step 2) for the image I in the triplet T_AIs pretreated to obtain a treated image I'_AFor the positive three-dimensional model M in the triplet T_posAnd the inverse three-dimensional model M_negRespectively carrying out projection processing to obtain the projection video group V of the positive three-dimensional model_posAnd the inverse three-dimensional model projection television group V_negAnd respectively carrying out pretreatment to obtain a processed normal three-dimensional model projection view group V'_posAnd an inverse three-dimensional model projection view group V'_negObtaining a processed triplet T '═ I'_A，V′_pos，V′_neg) A standard data set D' of representations;

step 3) constructing a cross-domain retrieval triple depth network model N aiming at the processed triple T', wherein the cross-domain retrieval triple depth network comprises 3 branch networks which are 1 image accurate feature extraction network N respectively_I2 three-dimensional model grouping view feature extraction network N with same structure and shared weight_MWherein the image precise feature extraction network N_IIs the processed image I 'of the processed triplet T'_AOutput as image feature vectors

And the inverse classThree-dimensional model feature vector

The image accurate feature extraction network is an AlexNet network comprising attention blocks, the network comprises 5 convolutional layers and 3 full-connection layers, and the attention blocks are positioned between every two convolutional layers connected in front and back and are formed by connecting 1 channel attention module and 1 space attention module in series;

the three-dimensional model grouping view feature extraction network takes a convolution structure of an AlexNet network as a basic network and comprises a grouping sub-network, the three-dimensional model grouping view feature extraction network comprises 5 all convolution layers of the AlexNet network, the grouping sub-network is connected after the last 1 convolution layer, after the last 1 convolution layer outputs a view feature vector, the grouping sub-network fuses the view feature vector into a group-level feature vector, then fuses the group-level feature vector into a shape-level feature vector, and finally fuses the shape-level feature vector into the three-dimensional model feature vector and outputs the three-dimensional model feature vector;

step 4) for the image feature vector

The feature vector of the positive three-dimensional model

And the inverse three-dimensional model feature vector

The feature vector of the positive three-dimensional model

And the inverse ofThree-dimensional model feature vector

step 5) using the processed triple T ', iteratively training parameters of the cross-domain retrieval triple depth network model N until the loss function L is smaller than a set threshold value, stopping training to obtain the trained cross-domain retrieval triple depth network model N', and finishing the image I_AAnd jointly embedding the three-dimensional model M with the features to construct a space, wherein the trained cross-domain retrieval triple depth network model N 'comprises 3 trained branch networks which are respectively a trained image accurate feature extraction network N'_IAnd two trained three-dimensional model grouping view feature extraction networks N'_M；

Step 6) when a retrieval task is executed, giving a query image q and a target three-dimensional model set S, preprocessing the query data q to obtain a processed query image q', and processing each target three-dimensional model S in the target three-dimensional data set S_iThe projection processing is carried out to obtain the three-dimensional model projection view group SV_iAnd then the preprocessing is carried out to obtain a processed three-dimensional model projection view group SV'_iInputting the processed query image q 'into a trained image precise feature extraction network N'_IIn the method, the image characteristic vector F corresponding to the query image q is obtained through output_qRegularization processing is carried out to obtain a regularized image feature vector F'_qThe processed projection view set SV 'is'_iInputting trained three-dimensional model grouping view feature extraction network N'_MIn the method, a target three-dimensional model S is obtained through output_iCorresponding three-dimensional model feature vectors

Regularizing to obtain regularized three-dimensional model feature vector

2. The method as claimed in claim 1, wherein in the steps 2) and 6), the preprocessing includes a uniform size processing and a gray scale processing, the uniform size processing is to uniform the size of the image or the projection view of the three-dimensional model, and the gray scale processing is to convert the RGB color image of the image into a gray scale image.

3. The method for cross-domain searching of three-dimensional models based on complex background images as claimed in claim 1, wherein in the steps 2) and 6), the projection process is implemented by setting up a virtual camera array, rendering the three-dimensional models with Phong rendering, and obtaining multi-angle projection views of the three-dimensional models, wherein each three-dimensional model has 12 corresponding projection views of the three-dimensional model.

4. The method for cross-domain retrieval of three-dimensional models based on complex background images as claimed in claim 1, wherein the grouping sub-network comprises a grouping weight module, a view pooling layer, a grouping pooling layer and a full link layer, and the grouping sub-network fuses the view feature vectors into group-level feature vectors, then fuses the group-level feature vectors into shape-level feature vectors, and finally fuses the shape-level feature vectors into the three-dimensional model feature vectors and outputs the three-dimensional model feature vectors by:

firstly, the grouping sub-network calculates the view discrimination according to the view feature vector and sets view groups; secondly, the grouping weight module calculates view grouping weight according to the view grouping and the view distinguishing degree; thirdly, according to the view feature vector and the view grouping, the view pooling layer is used for fusing the view feature vector into a group-level feature vector and outputting the group-level feature vector; then, according to the group-level feature vector and the view grouping weight, the group-level feature vector is fused into the shape-level feature vector by using the group-level pooling layer and is output; and finally, inputting the shape-level feature vectors into the full-connection layer, fusing the shape-level feature vectors into the three-dimensional model feature vectors and outputting the three-dimensional model feature vectors.

5. The method for cross-domain retrieval of three-dimensional models based on complex background images as claimed in claim 1, wherein in the step 4) and the step 6), the regularization process is L2 regularization.

6. The method for cross-domain retrieval of three-dimensional models based on complex background images as claimed in claim 1, wherein in the step 4), the loss function:

L＝max(d_pos-d_neg+margin，0)，

wherein d is_posRepresenting the distance between pairs of positive samples, said pairs being a three-dimensional model of the positive type and an Anchor image, d_negRepresenting the distance between a pair of negative samples, the pair of negative samples being an inverse-class three-dimensional model and an Anchor image, margin representing a set relative distance,

7. the method of claim 8, wherein the distance is Euclidean distance.

8. A three-dimensional model cross-domain retrieval system based on a complex background image is characterized by comprising: the system comprises a target three-dimensional model library, an input module, a projection processing module, a preprocessing module, a retrieval module and an output module;

the preprocessing module is used for processing the query data q sent by the input module and the three-dimensional model projection television set SV sent by the projection processing module_iRespectively carrying out the preprocessing to obtain a processed query image q 'and a three-dimensional model projection view group SV'_iAnd sending the data to the retrieval module;

the retrieval module comprises a trained image precise feature extraction network N'_IAnd three-dimensional model grouping view feature extraction network N'_MInputting the processed query image q 'sent by the preprocessing module into a trained image precise feature extraction network N'_IIn the method, the image characteristic vector F corresponding to the query image q is obtained through output_qRegularization processing is carried out to obtain a regularized image feature vector F'_qThe processed projection view set SV sent by the pre-processing module_i'input trained three-dimensional model grouping view feature extraction network N'_MIn the method, a target three-dimensional model S is obtained through output_iCorresponding three-dimensional model feature vectors

Regularizing to obtain regularized three-dimensional model feature vector