CN113240012A

CN113240012A - Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device

Info

Publication number: CN113240012A
Application number: CN202110529135.7A
Authority: CN
Inventors: 宋丹; 杨悦; 赵小倩; 刘安安; 聂为之
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-10
Anticipated expiration: 2041-05-14
Also published as: CN113240012B

Abstract

The invention discloses an unsupervised multi-view three-dimensional target retrieval method and a device based on two-dimensional images, wherein the method comprises the following steps: respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target; according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning; obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics; through the iterative weighted updating of the memory base used for storing the representative view characteristics, high-quality negative samples are obtained and used for the comparison learning. The device comprises: the system comprises a feature extraction module, a domain confrontation learning module, an acquisition module and an updating module. The invention optimizes the retrieval performance of the retrieval frame and provides a negative sample with high enough quality for comparison learning.

Description

Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device

Technical Field

The invention relates to a multi-view target, and belongs to the field of multi-view three-dimensional target retrieval, self-supervision learning, contrast learning and domain adaptation based on images, in particular to an unsupervised multi-view three-dimensional target retrieval method and an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images.

Background

In recent years, the task of multi-view object retrieval has gradually become a promising research topic in line with the development of the high-speed information age, and has attracted more and more interest because it links the two-dimensional image data generated and propagated by the current mass and the future mass three-dimensional object data, and involves two different modalities. While many approaches have made great progress in multi-perspective three-dimensional object retrieval tasks, it remains challenging due to the gap between the two-dimensional and three-dimensional modalities.

The multi-view three-dimensional object retrieval task aims at searching a gallery for similar models for a given query model. Generally, existing multi-view three-dimensional target retrieval methods can be divided into three categories: model-based method^[1][2]View-based methods^[3][4]And feature fusion of the two types^[5]. The model-based method directly takes a multi-view three-dimensional target as input to generate three-dimensional characteristics containing multi-view three-dimensional target space and structural information. Three-dimensional representations of these methods have three main forms: mesh, point cloud, and voxel.

To mitigate the negative effects of domain gaps, a number of domain adaptation methods have been proposed. Typical methods for reducing the difference between the two domains can be divided into two categories: distance based metric^[6]And domain-based confrontation learning^[7]. The first type reduces domain differences by minimizing the statistical distance of the feature distribution. The second type of adoption originates in GAN^[8]Is trained against to limit the differences between the two domains.

Self-supervised learning first learns a general visual representation by building a relatively simple auxiliary task called an agent task, and then applies the learned representation to real downstream tasks, such as: object detection, classification and semantic segmentation. How to design an effective proxy task is crucial to the downstream task to work. Can be used forTo simply classify existing proxy tasks into two categories according to task type, for example: restoring an input image at a preset loss^[9][10]And a pseudo tag for forming an input image^[11][12]。

Although much work has been done in the field of image-based multi-view three-dimensional object retrieval, there are some shortcomings in the research for better zooming in the inter-modal distance and the inter-class distance. Based on the current situation, the challenges currently face mainly include the following two aspects:

1. how to better utilize the structural information of the unsupervised multi-view three-dimensional target;

2. how to more accurately perform inter-domain alignment and inter-class alignment between a three-dimensional domain and a two-dimensional domain.

Disclosure of Invention

The invention provides an unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images, which constructs a multi-view target retrieval network framework through a visual feature learning module, a resistance domain adaptation module, a contrast learning module and a retrieval module, extracts multi-view views of multi-view three-dimensional targets, and extracts visual features of the two-dimensional images and the multi-view three-dimensional targets; the cross-domain distribution alignment is realized by using label information of the two-dimensional image and domain confrontation learning; and contrast learning is utilized to enhance the representation capability of the multi-view three-dimensional target view and the heterogeneity of different multi-view three-dimensional targets, which is described in detail in the following description:

in a first aspect, a method for retrieving an unsupervised multi-view three-dimensional target based on a two-dimensional image includes:

respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;

according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;

obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics;

through the iterative weighted updating of the memory base used for storing the representative view characteristics, high-quality negative samples are obtained and used for the comparison learning.

In an embodiment, the obtaining of the visual characteristics of the multi-view three-dimensional object with more category differences according to the comparison learning and the obtained visual characteristics specifically includes:

selecting view characteristics of an ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;

respectively calculating the similarity between the anchor point and the positive sample and the negative sample, and calculating the contrast loss based on 2 similarities;

and (3) combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train a feature extractor, a domain discriminator, a classifier and nonlinear mapping to obtain a total loss function adaptive to the self-supervision domain.

In one embodiment, the similarity between the anchor point and the positive and negative samples is:

the similarity between the anchor point and the positive sample is calculated as follows:

wherein j is more than or equal to 1, j 'is more than or equal to N and j is not equal to j', s is the similarity, g (-) represents the mapping function for mapping the feature vector to the low-dimensional space;

for the low-dimensional visual features of the jth view of the ith sample of the target domain,

the low-dimensional visual characteristic of the jth view of the ith sample of the target domain, d is a function for calculating cosine distance, and tau is a linear scale factor used for adjusting the dynamic range of similarity;

the similarity calculation formula between the anchor point and the negative sample is as follows:

wherein j is more than or equal to 1 and less than or equal to N and i is not equal to i ', m is a representative view characteristic of the multi-view three-dimensional target, and i' is a serial number of the multi-view three-dimensional target.

In one embodiment, the memory library for storing the representative view features, which is updated by the iterative weighting, is specifically:

selecting the view with the highest classification prediction accuracy as a multi-view three-dimensional target

Will be based on the principle of entropy minimization

The N views are sent into a classifier G to generate K paths of classification results;

multi-view three-dimensional object

The prediction entropy calculation formula of the jth view of (1) is as follows:

wherein the content of the first and second substances,

representing multi-perspective three-dimensional objects

Is classified into a kth class; k is the total number of categories;

a representative view update formula is as follows:

wherein, the value range of mu is [0,1 ]]Is the update coefficient; m isⁱIs the view feature stored in the memory bank;

view features with minimal entropy;

is the low-dimensional visual feature of the view with the minimum entropy.

In a second aspect, an unsupervised multi-view three-dimensional object retrieving apparatus based on two-dimensional images, the apparatus comprising:

the characteristic extraction module is used for respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by utilizing the characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;

the domain confrontation learning module is used for obtaining visual characteristics which are distributed and aligned in a cross-domain mode through domain confrontation learning according to the visual characteristics of the two-dimensional image, label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target;

the acquisition module is used for acquiring visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the acquired visual characteristics;

and the updating module is used for obtaining high-quality negative samples through the memory base which is updated by iterative weighting and used for storing the representative view characteristics, and the negative samples are used for the comparative learning.

Wherein the acquisition module comprises:

the selecting submodule is used for selecting the view characteristics of the ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;

the calculation submodule is used for calculating the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample respectively and calculating the contrast loss based on 2 similarities;

and the obtaining submodule is used for combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train the feature extractor, the domain discriminator, the classifier and the nonlinear mapping to obtain the self-supervision domain adaptive total loss function.

In a third aspect, an unsupervised multi-view three-dimensional object retrieving apparatus based on two-dimensional images, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.

In a fourth aspect, a computer-readable storage medium, storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of the first aspects.

The technical scheme provided by the invention has the beneficial effects that:

1. the multi-view target retrieval network framework is constructed through a visual feature learning module, an antitarnish domain adaptation module, a contrast learning module and a retrieval module; rendering views of the multi-view three-dimensional target model from a plurality of views, and sending the two-dimensional images for retrieval and the rendered three-dimensional views into a feature extractor for feature extraction;

2. the method realizes inter-domain alignment of cross-domain distribution by using supervised label information of the two-dimensional image and domain confrontation learning; mapping the view to be retrieved by contrast learning to acquire the structural information of the multi-view three-dimensional target model, integrating the structural information into a memory library and optimizing the retrieval performance of a retrieval frame;

3. in the comparison learning, the memory base which can be updated by iterative weighting and is used for storing the representative view characteristics is utilized to provide a negative sample with high enough quality for the comparison learning;

4. the method can extract key features of the two-dimensional image and the multi-view of the multi-view three-dimensional target, and better realize the association and retrieval of the two-dimensional image and the multi-view three-dimensional target through technologies such as contrast learning, domain adaptation, self-supervision learning and the like.

Drawings

FIG. 1 is a flow chart of an unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images;

FIG. 2 is a schematic structural diagram of an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images;

FIG. 3 is a schematic diagram of an acquisition module;

fig. 4 is another structural diagram of an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

An unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images, referring to fig. 1, comprises the following steps:

101: respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;

102: according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;

103: obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics;

104: by means of a memory bank for storing representative view features, which can be iteratively weighted updated, negative examples of sufficient quality are obtained for the comparative learning in step 103 described above.

In conclusion, the unsupervised multi-view three-dimensional target retrieval method based on the two-dimensional image is realized through the visual feature learning module, the anti-domain adaptation module, the contrast learning module and the retrieval module, and the multi-view three-dimensional target retrieval precision is improved.

Example 2

The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:

201: respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;

wherein, the step 201 mainly comprises:

in the embodiment of the invention, N viewpoints are set, namely, a virtual camera is placed around the centroid of the multi-view three-dimensional target every 360/N degrees, and the viewpoints are completely and uniformly distributed around the target object. Different angle views of the multi-view three-dimensional target are obtained clockwise by selecting different interval angles, and a view sequence is generated to represent each multi-view three-dimensional target.

Then, the two-dimensional image and the multi-view set are input into a feature extractor CNN to obtain corresponding visual features. For multiple view features of each multi-perspective three-dimensional object, a pooling operation is applied to aggregate compact three-dimensional descriptors.

Wherein the source domain D is represented by_SIn n_sIndividual labeled two-dimensional image samples:

the target domain D is represented by_TIn n_tAn unsupervised multi-view three-dimensional target sample:

wherein the content of the first and second substances,

for the ith sample of the source domain,

for the ith sample of the target domain,

is the label of the ith sample of the source domain, X_SFor the source domain sample set, X_TIs a set of target domain samples, Y_SIs a labelset of source domain samples.

Representing a two-dimensional image by

The visual characteristics of (1):

wherein, F represents a feature extractor,

visual features of the extracted source domain two-dimensional image sample.

A multi-view set of a multi-view three-dimensional object is represented by:

wherein N is the number of views of each multi-view three-dimensional target,

is the jth view of the ith multi-view three-dimensional object of the object domain.

202: according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;

wherein, the step 202 mainly includes:

1. minimizing classification errors on the source domain:

characterizing the source domain f_SEmbedding the K-dimensional class probability vector into a classifier G to obtain a K-dimensional class probability vector p_S。

In order to make the feature extractor F and the classifier G more discriminative for the source domain samples, the prediction result is compared with the actual real classification label Y of the source domain sample_SFor comparison, the representation form of the class label is a one-hot (one-hot) form:

source domain classification loss function:

wherein L (-) represents a cross entropy loss function; d_SRepresenting a source domain; (x, y) two-dimensional images representing the source domain-pairs of label exemplars;

representing a sample

A predicted likelihood of being classified into a kth class; k represents the total number of categories;

a label representing a corresponding kth class of an ith sample of the source domain; e represents entropy; CE denotes cross entropy.

2. Minimize the difference between the two domains:

the distribution difference between the source domain and the target domain is narrowed by domain confrontation.

Characterizing the source domain f_SAnd target domain characteristics f_TInput into a domain discriminator D, using the domain discriminator D to discriminate whether the features are from the source domain or the target domain, the feature extractor F is trained to learn a domain-invariant feature representation to obfuscate the domain discriminator D.

When this countermeasure reaches equilibrium, the distributions of the source and target domains will be aligned, eliminating the domain gap.

Formally, the domain confrontation loss can be calculated by the following equation:

L_ADV(X_S,X_T)＝-E[log(D(f_S))]-E[log(1-D(f_T))] (6)

wherein f is_SRepresenting source domain features; f. of_TRepresenting a target domain feature; d (f)_S) Output results representing the discriminator on the source domain feature input; d (f)_T) Output results representing the discriminator on the target domain feature input; l is_ADVADV is challenge as a function of challenge loss.

203: according to the comparison learning and the obtained visual features, the visual features of the multi-view three-dimensional target with more category difference are obtained:

one three-dimensional target has a plurality of views, one view feature is selected as an anchor point (reference), another view feature of the three-dimensional target is selected as a positive sample, and one view feature of other three-dimensional targets is selected as a negative sample, specifically:

in order to enable the similarity of the view features of the same multi-view three-dimensional target to be far larger than the similarity of the view features of different multi-view three-dimensional targets, the view of the ith multi-view three-dimensional target is selected as an anchor point, and other view features of the multi-view three-dimensional target are selected as a positive sample.

the low-dimensional visual characteristics of the jth view of the ith sample of the target domain, d is a function for calculating the cosine distance, and tau is a linear scale factor for adjusting the dynamic range of similarity.

Based on the similarity defined by the above two equations, the contrast loss is calculated according to the following equation:

wherein M is a memory bank.

Bound source classification loss L_CEDomain antagonism loss L_ADVAnd the contrast loss L_CLJointly training the feature extractor F, the domain discriminator D, the classifier G, and the non-linear mapping G (-) with the following total loss function for the supervised domain adaptation:

L_total ＝L_CE(X_S,Y_S)+λ₁·L_ADV(X_S,X_T)+λ₂·L_CL(X_T) (10)

wherein λ is₁To balance the over-parameter of source classification loss, λ₂To balance out the over-parameters of source classification loss.

204: by using a memory library for storing representative view features that can be iteratively weighted and updated, a negative sample of sufficient quality is obtained for the contrast learning in step 203 above:

for the contrast learning method described in step 203, in which the selection of negative samples plays a key role, a memory base based on the principle of entropy minimization is designed for storing and updating representative view features.

Will be based on the principle of entropy minimization

The N views are sent to a classifier G to generate K-way classification results. Multi-view three-dimensional object

wherein the content of the first and second substances,

representing multi-perspective three-dimensional objects

Is classified into a kth class; k is the total number of categories.

Taking the view with the minimum entropy as the most representative view; the memory bank M contains the features of all multi-view three-dimensional objects of the object domain and updates them iteratively using the corresponding representative views. The negative examples for the contrast learning are randomly selected from the memory base, and the representative view update formula is as follows:

view features with minimal entropy;

is the low-dimensional visual feature of the view with the minimum entropy.

Example 3

An unsupervised multi-view three-dimensional target retrieval apparatus based on two-dimensional images, referring to fig. 2, the apparatus comprising:

the characteristic extraction module 1 is used for respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;

the domain confrontation learning module 2 is used for obtaining the visual characteristics which are distributed and aligned in a cross-domain mode through domain confrontation learning according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target;

the acquisition module 3 is used for acquiring the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the acquired visual characteristics;

and the updating module 4 is used for obtaining high-quality negative samples through the memory base which is updated by iterative weighting and used for storing the representative view characteristics, and the negative samples are used for the comparative learning.

In one embodiment, referring to fig. 3, the obtaining module 3 comprises:

the selecting submodule 31 is used for selecting view characteristics of the ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;

the calculation submodule 32 is used for calculating the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample respectively, and calculating the contrast loss based on 2 similarities;

and the obtaining submodule 33 is used for combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train the feature extractor, the domain discriminator, the classifier and the nonlinear mapping to obtain the total loss function adaptive to the self-supervision domain.

It should be noted that the device description in the above embodiments corresponds to the description of the method embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the modules and units can be devices with calculation functions, such as a computer, a single chip microcomputer and a microcontroller, and in the specific implementation, the execution main bodies are not limited in the embodiment of the invention and are selected according to the requirements in practical application.

Based on the same inventive concept, an embodiment of the present invention further provides an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images, and referring to fig. 4, the device includes:

a processor 5 and a memory 6, the memory 6 having stored therein program instructions, the processor 5 calling upon the program instructions stored in the memory 6 to cause the apparatus to perform the following method steps in an embodiment:

In one embodiment, the memory library for storing the representative view features, which can be updated by iterative weighting, is embodied as:

Will be based on the principle of entropy minimization

The N views are fed into a classifier G, which generates KA way classification result;

multi-view three-dimensional object

wherein the content of the first and second substances,

representing multi-perspective three-dimensional objects

Is classified into a kth class; k is the total number of categories;

a representative view update formula is as follows:

view features with minimal entropy;

is the low-dimensional visual feature of the view with the minimum entropy.

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor 5 and the memory 6 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to the needs in the practical application.

The memory 6 and the processor 5 transmit data signals through the bus 7, which is not described in detail in the embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.

The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.

It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer.

The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.

Reference to the literature

[1]Z.Wu,S.Song,A.Khosla,F.Yu,L.Zhang,X.Tang,and J.Xiao,3d shapenets:A deep representation for volumetric shapes,in CVPR.IEEE Computer Society,2015,pp.1912–1920.

[2]C.R.Qi,H.Su,K.Mo,and L.J.Guibas,“Pointnet:Deep learning on point sets for 3d classification and segmentation,”in CVPR.IEEE Computer Society,2017,pp.77–85.

[3]H.Su,S.Maji,E.Kalogerakis,and E.G.Learned-Miller,Multi-view convolutional neural networks for 3d shape recognition,in ICCV.IEEE Computer Society,2015,pp.945–953.

[4]Y.Feng,Z.Zhang,X.Zhao,R.Ji,and Y.Gao,GVCNN:group-view convolutional neural networks for 3d shape recognition,in CVPR.IEEE Computer Society,2018,pp.264–272.

[5]B.Gong,C.Yan,J.Bai,C.Zou,and Y.Gao,Hamming embedding sensitivity guided fusion network for 3d shape representation,IEEE Trans.Image Process.,vol.29,pp.8381–8390,2020.[Online].Available:https://doi.org/10.1109/TIP.2020.3013138

[6]S.J.Pan,I.W.Tsang,J.T.Kwok,and Q.Yang,“Domain adaptation via transfer component analysis,”IEEE Trans.Neural Networks,vol.22,no.2,pp.199–210,2011.

[7]A.Chadha and Y.Andreopoulos,“Improved techniques for adversarial discriminative domain adaptation,”IEEE Trans.Image Process.,vol.29,pp.2622–2637,2020.

[8]I.J.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.C.Courville,and Y.Bengio,“Generative adversarial nets,”in NIPS,2014,pp.2672–2680

[9]R.Zhang,P.Isola,and A.A.Efros,“Colorful image colorization,”in ECCV(3),ser.Lecture Notes in Computer Science,vol.9907.Springer,2016,pp.649–666.

[10]D.Pathak,P.Kr..ahenb..uhl,J.Donahue,T.Darrell,and A.A.Efros,“Context encoders:Feature learning by inpainting,”in CVPR.IEEE Computer Society,2016,pp.2536–2544.

[11]S.Gidaris,P.Singh,and N.Komodakis,“Unsupervised representation learning by predicting image rotations,”in ICLR (Poster).OpenRe-view.net,2018.

[12]D.Kim,D.Cho,D.Yoo,and I.S.Kweon,“Learning image repre-sentations by completing damaged jigsaw puzzles,”in WACV.IEEE Computer Society,2018,pp.793–802.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining of the visual characteristics of the multi-view three-dimensional object with more category differences according to the comparison learning and the obtained visual characteristics is specifically:

3. The unsupervised multi-view three-dimensional target retrieval method based on the two-dimensional image as claimed in claim 1, wherein the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample are respectively as follows:

4. The two-dimensional image-based unsupervised multi-view three-dimensional object retrieval method according to claim 1, wherein the memory library for storing the representative view features, which is updated by iterative weighting, is specifically:

Will be based on the principle of entropy minimization

multi-view three-dimensional object

wherein the content of the first and second substances,

representing multi-perspective three-dimensional objects

Is classified into a kth class; k is the total number of categories;

a representative view update formula is as follows:

view features with minimal entropy;

is the low-dimensional visual feature of the view with the minimum entropy.

5. An unsupervised multi-view three-dimensional object retrieval device based on two-dimensional images, the device comprising:

6. The two-dimensional image-based unsupervised multi-view three-dimensional target retrieval device according to claim 5, wherein the acquiring module comprises:

7. An unsupervised multi-view three-dimensional object retrieval device based on two-dimensional images, the device comprising:

a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-6.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-6.