CN113240012A - Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device - Google Patents

Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device Download PDF

Info

Publication number
CN113240012A
CN113240012A CN202110529135.7A CN202110529135A CN113240012A CN 113240012 A CN113240012 A CN 113240012A CN 202110529135 A CN202110529135 A CN 202110529135A CN 113240012 A CN113240012 A CN 113240012A
Authority
CN
China
Prior art keywords
view
dimensional
domain
target
visual characteristics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110529135.7A
Other languages
Chinese (zh)
Other versions
CN113240012B (en
Inventor
宋丹
杨悦
赵小倩
刘安安
聂为之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110529135.7A priority Critical patent/CN113240012B/en
Publication of CN113240012A publication Critical patent/CN113240012A/en
Application granted granted Critical
Publication of CN113240012B publication Critical patent/CN113240012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unsupervised multi-view three-dimensional target retrieval method and a device based on two-dimensional images, wherein the method comprises the following steps: respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target; according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning; obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics; through the iterative weighted updating of the memory base used for storing the representative view characteristics, high-quality negative samples are obtained and used for the comparison learning. The device comprises: the system comprises a feature extraction module, a domain confrontation learning module, an acquisition module and an updating module. The invention optimizes the retrieval performance of the retrieval frame and provides a negative sample with high enough quality for comparison learning.

Description

Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device
Technical Field
The invention relates to a multi-view target, and belongs to the field of multi-view three-dimensional target retrieval, self-supervision learning, contrast learning and domain adaptation based on images, in particular to an unsupervised multi-view three-dimensional target retrieval method and an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images.
Background
In recent years, the task of multi-view object retrieval has gradually become a promising research topic in line with the development of the high-speed information age, and has attracted more and more interest because it links the two-dimensional image data generated and propagated by the current mass and the future mass three-dimensional object data, and involves two different modalities. While many approaches have made great progress in multi-perspective three-dimensional object retrieval tasks, it remains challenging due to the gap between the two-dimensional and three-dimensional modalities.
The multi-view three-dimensional object retrieval task aims at searching a gallery for similar models for a given query model. Generally, existing multi-view three-dimensional target retrieval methods can be divided into three categories: model-based method[1][2]View-based methods[3][4]And feature fusion of the two types[5]. The model-based method directly takes a multi-view three-dimensional target as input to generate three-dimensional characteristics containing multi-view three-dimensional target space and structural information. Three-dimensional representations of these methods have three main forms: mesh, point cloud, and voxel.
To mitigate the negative effects of domain gaps, a number of domain adaptation methods have been proposed. Typical methods for reducing the difference between the two domains can be divided into two categories: distance based metric[6]And domain-based confrontation learning[7]. The first type reduces domain differences by minimizing the statistical distance of the feature distribution. The second type of adoption originates in GAN[8]Is trained against to limit the differences between the two domains.
Self-supervised learning first learns a general visual representation by building a relatively simple auxiliary task called an agent task, and then applies the learned representation to real downstream tasks, such as: object detection, classification and semantic segmentation. How to design an effective proxy task is crucial to the downstream task to work. Can be used forTo simply classify existing proxy tasks into two categories according to task type, for example: restoring an input image at a preset loss[9][10]And a pseudo tag for forming an input image[11][12]
Although much work has been done in the field of image-based multi-view three-dimensional object retrieval, there are some shortcomings in the research for better zooming in the inter-modal distance and the inter-class distance. Based on the current situation, the challenges currently face mainly include the following two aspects:
1. how to better utilize the structural information of the unsupervised multi-view three-dimensional target;
2. how to more accurately perform inter-domain alignment and inter-class alignment between a three-dimensional domain and a two-dimensional domain.
Disclosure of Invention
The invention provides an unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images, which constructs a multi-view target retrieval network framework through a visual feature learning module, a resistance domain adaptation module, a contrast learning module and a retrieval module, extracts multi-view views of multi-view three-dimensional targets, and extracts visual features of the two-dimensional images and the multi-view three-dimensional targets; the cross-domain distribution alignment is realized by using label information of the two-dimensional image and domain confrontation learning; and contrast learning is utilized to enhance the representation capability of the multi-view three-dimensional target view and the heterogeneity of different multi-view three-dimensional targets, which is described in detail in the following description:
in a first aspect, a method for retrieving an unsupervised multi-view three-dimensional target based on a two-dimensional image includes:
respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;
obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics;
through the iterative weighted updating of the memory base used for storing the representative view characteristics, high-quality negative samples are obtained and used for the comparison learning.
In an embodiment, the obtaining of the visual characteristics of the multi-view three-dimensional object with more category differences according to the comparison learning and the obtained visual characteristics specifically includes:
selecting view characteristics of an ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;
respectively calculating the similarity between the anchor point and the positive sample and the negative sample, and calculating the contrast loss based on 2 similarities;
and (3) combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train a feature extractor, a domain discriminator, a classifier and nonlinear mapping to obtain a total loss function adaptive to the self-supervision domain.
In one embodiment, the similarity between the anchor point and the positive and negative samples is:
the similarity between the anchor point and the positive sample is calculated as follows:
Figure BDA0003066822200000021
wherein j is more than or equal to 1, j 'is more than or equal to N and j is not equal to j', s is the similarity, g (-) represents the mapping function for mapping the feature vector to the low-dimensional space;
Figure BDA0003066822200000022
for the low-dimensional visual features of the jth view of the ith sample of the target domain,
Figure BDA0003066822200000023
the low-dimensional visual characteristic of the jth view of the ith sample of the target domain, d is a function for calculating cosine distance, and tau is a linear scale factor used for adjusting the dynamic range of similarity;
the similarity calculation formula between the anchor point and the negative sample is as follows:
Figure BDA0003066822200000031
wherein j is more than or equal to 1 and less than or equal to N and i is not equal to i ', m is a representative view characteristic of the multi-view three-dimensional target, and i' is a serial number of the multi-view three-dimensional target.
In one embodiment, the memory library for storing the representative view features, which is updated by the iterative weighting, is specifically:
selecting the view with the highest classification prediction accuracy as a multi-view three-dimensional target
Figure BDA0003066822200000032
Will be based on the principle of entropy minimization
Figure BDA0003066822200000033
The N views are sent into a classifier G to generate K paths of classification results;
multi-view three-dimensional object
Figure BDA0003066822200000034
The prediction entropy calculation formula of the jth view of (1) is as follows:
Figure BDA0003066822200000035
wherein the content of the first and second substances,
Figure BDA0003066822200000036
representing multi-perspective three-dimensional objects
Figure BDA0003066822200000037
Is classified into a kth class; k is the total number of categories;
a representative view update formula is as follows:
Figure BDA0003066822200000038
wherein, the value range of mu is [0,1 ]]Is the update coefficient; m isiIs the view feature stored in the memory bank;
Figure BDA0003066822200000039
view features with minimal entropy;
Figure BDA00030668222000000310
is the low-dimensional visual feature of the view with the minimum entropy.
In a second aspect, an unsupervised multi-view three-dimensional object retrieving apparatus based on two-dimensional images, the apparatus comprising:
the characteristic extraction module is used for respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by utilizing the characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
the domain confrontation learning module is used for obtaining visual characteristics which are distributed and aligned in a cross-domain mode through domain confrontation learning according to the visual characteristics of the two-dimensional image, label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target;
the acquisition module is used for acquiring visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the acquired visual characteristics;
and the updating module is used for obtaining high-quality negative samples through the memory base which is updated by iterative weighting and used for storing the representative view characteristics, and the negative samples are used for the comparative learning.
Wherein the acquisition module comprises:
the selecting submodule is used for selecting the view characteristics of the ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;
the calculation submodule is used for calculating the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample respectively and calculating the contrast loss based on 2 similarities;
and the obtaining submodule is used for combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train the feature extractor, the domain discriminator, the classifier and the nonlinear mapping to obtain the self-supervision domain adaptive total loss function.
In a third aspect, an unsupervised multi-view three-dimensional object retrieving apparatus based on two-dimensional images, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.
In a fourth aspect, a computer-readable storage medium, storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of the first aspects.
The technical scheme provided by the invention has the beneficial effects that:
1. the multi-view target retrieval network framework is constructed through a visual feature learning module, an antitarnish domain adaptation module, a contrast learning module and a retrieval module; rendering views of the multi-view three-dimensional target model from a plurality of views, and sending the two-dimensional images for retrieval and the rendered three-dimensional views into a feature extractor for feature extraction;
2. the method realizes inter-domain alignment of cross-domain distribution by using supervised label information of the two-dimensional image and domain confrontation learning; mapping the view to be retrieved by contrast learning to acquire the structural information of the multi-view three-dimensional target model, integrating the structural information into a memory library and optimizing the retrieval performance of a retrieval frame;
3. in the comparison learning, the memory base which can be updated by iterative weighting and is used for storing the representative view characteristics is utilized to provide a negative sample with high enough quality for the comparison learning;
4. the method can extract key features of the two-dimensional image and the multi-view of the multi-view three-dimensional target, and better realize the association and retrieval of the two-dimensional image and the multi-view three-dimensional target through technologies such as contrast learning, domain adaptation, self-supervision learning and the like.
Drawings
FIG. 1 is a flow chart of an unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images;
FIG. 2 is a schematic structural diagram of an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images;
FIG. 3 is a schematic diagram of an acquisition module;
fig. 4 is another structural diagram of an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
An unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images, referring to fig. 1, comprises the following steps:
101: respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
102: according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;
103: obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics;
104: by means of a memory bank for storing representative view features, which can be iteratively weighted updated, negative examples of sufficient quality are obtained for the comparative learning in step 103 described above.
In conclusion, the unsupervised multi-view three-dimensional target retrieval method based on the two-dimensional image is realized through the visual feature learning module, the anti-domain adaptation module, the contrast learning module and the retrieval module, and the multi-view three-dimensional target retrieval precision is improved.
Example 2
The scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:
201: respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
wherein, the step 201 mainly comprises:
in the embodiment of the invention, N viewpoints are set, namely, a virtual camera is placed around the centroid of the multi-view three-dimensional target every 360/N degrees, and the viewpoints are completely and uniformly distributed around the target object. Different angle views of the multi-view three-dimensional target are obtained clockwise by selecting different interval angles, and a view sequence is generated to represent each multi-view three-dimensional target.
Then, the two-dimensional image and the multi-view set are input into a feature extractor CNN to obtain corresponding visual features. For multiple view features of each multi-perspective three-dimensional object, a pooling operation is applied to aggregate compact three-dimensional descriptors.
Wherein the source domain D is represented bySIn nsIndividual labeled two-dimensional image samples:
Figure BDA0003066822200000061
the target domain D is represented byTIn ntAn unsupervised multi-view three-dimensional target sample:
Figure BDA0003066822200000062
wherein the content of the first and second substances,
Figure BDA0003066822200000063
for the ith sample of the source domain,
Figure BDA0003066822200000064
for the ith sample of the target domain,
Figure BDA0003066822200000065
is the label of the ith sample of the source domain, XSFor the source domain sample set, XTIs a set of target domain samples, YSIs a labelset of source domain samples.
Representing a two-dimensional image by
Figure BDA0003066822200000066
The visual characteristics of (1):
Figure BDA0003066822200000067
wherein, F represents a feature extractor,
Figure BDA0003066822200000068
visual features of the extracted source domain two-dimensional image sample.
A multi-view set of a multi-view three-dimensional object is represented by:
Figure BDA0003066822200000069
wherein N is the number of views of each multi-view three-dimensional target,
Figure BDA00030668222000000610
is the jth view of the ith multi-view three-dimensional object of the object domain.
202: according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;
wherein, the step 202 mainly includes:
1. minimizing classification errors on the source domain:
characterizing the source domain fSEmbedding the K-dimensional class probability vector into a classifier G to obtain a K-dimensional class probability vector pS
In order to make the feature extractor F and the classifier G more discriminative for the source domain samples, the prediction result is compared with the actual real classification label Y of the source domain sampleSFor comparison, the representation form of the class label is a one-hot (one-hot) form:
source domain classification loss function:
Figure BDA0003066822200000071
wherein L (-) represents a cross entropy loss function; dSRepresenting a source domain; (x, y) two-dimensional images representing the source domain-pairs of label exemplars;
Figure BDA0003066822200000072
representing a sample
Figure BDA0003066822200000073
A predicted likelihood of being classified into a kth class; k represents the total number of categories;
Figure BDA0003066822200000074
a label representing a corresponding kth class of an ith sample of the source domain; e represents entropy; CE denotes cross entropy.
2. Minimize the difference between the two domains:
the distribution difference between the source domain and the target domain is narrowed by domain confrontation.
Characterizing the source domain fSAnd target domain characteristics fTInput into a domain discriminator D, using the domain discriminator D to discriminate whether the features are from the source domain or the target domain, the feature extractor F is trained to learn a domain-invariant feature representation to obfuscate the domain discriminator D.
When this countermeasure reaches equilibrium, the distributions of the source and target domains will be aligned, eliminating the domain gap.
Formally, the domain confrontation loss can be calculated by the following equation:
LADV(XS,XT)=-E[log(D(fS))]-E[log(1-D(fT))] (6)
wherein f isSRepresenting source domain features; f. ofTRepresenting a target domain feature; d (f)S) Output results representing the discriminator on the source domain feature input; d (f)T) Output results representing the discriminator on the target domain feature input; l isADVADV is challenge as a function of challenge loss.
203: according to the comparison learning and the obtained visual features, the visual features of the multi-view three-dimensional target with more category difference are obtained:
one three-dimensional target has a plurality of views, one view feature is selected as an anchor point (reference), another view feature of the three-dimensional target is selected as a positive sample, and one view feature of other three-dimensional targets is selected as a negative sample, specifically:
in order to enable the similarity of the view features of the same multi-view three-dimensional target to be far larger than the similarity of the view features of different multi-view three-dimensional targets, the view of the ith multi-view three-dimensional target is selected as an anchor point, and other view features of the multi-view three-dimensional target are selected as a positive sample.
The similarity between the anchor point and the positive sample is calculated as follows:
Figure BDA0003066822200000081
wherein j is more than or equal to 1, j 'is more than or equal to N and j is not equal to j', s is the similarity, g (-) represents the mapping function for mapping the feature vector to the low-dimensional space;
Figure BDA0003066822200000082
for the low-dimensional visual features of the jth view of the ith sample of the target domain,
Figure BDA0003066822200000083
the low-dimensional visual characteristics of the jth view of the ith sample of the target domain, d is a function for calculating the cosine distance, and tau is a linear scale factor for adjusting the dynamic range of similarity.
The similarity calculation formula between the anchor point and the negative sample is as follows:
Figure BDA0003066822200000084
wherein j is more than or equal to 1 and less than or equal to N and i is not equal to i ', m is a representative view characteristic of the multi-view three-dimensional target, and i' is a serial number of the multi-view three-dimensional target.
Based on the similarity defined by the above two equations, the contrast loss is calculated according to the following equation:
Figure BDA0003066822200000085
wherein M is a memory bank.
Bound source classification loss LCEDomain antagonism loss LADVAnd the contrast loss LCLJointly training the feature extractor F, the domain discriminator D, the classifier G, and the non-linear mapping G (-) with the following total loss function for the supervised domain adaptation:
Ltotal =LCE(XS,YS)+λ1·LADV(XS,XT)+λ2·LCL(XT) (10)
wherein λ is1To balance the over-parameter of source classification loss, λ2To balance out the over-parameters of source classification loss.
204: by using a memory library for storing representative view features that can be iteratively weighted and updated, a negative sample of sufficient quality is obtained for the contrast learning in step 203 above:
for the contrast learning method described in step 203, in which the selection of negative samples plays a key role, a memory base based on the principle of entropy minimization is designed for storing and updating representative view features.
Selecting the view with the highest classification prediction accuracy as a multi-view three-dimensional target
Figure BDA0003066822200000091
Will be based on the principle of entropy minimization
Figure BDA0003066822200000092
The N views are sent to a classifier G to generate K-way classification results. Multi-view three-dimensional object
Figure BDA0003066822200000093
The prediction entropy calculation formula of the jth view of (1) is as follows:
Figure BDA0003066822200000094
wherein the content of the first and second substances,
Figure BDA0003066822200000095
representing multi-perspective three-dimensional objects
Figure BDA0003066822200000096
Is classified into a kth class; k is the total number of categories.
Taking the view with the minimum entropy as the most representative view; the memory bank M contains the features of all multi-view three-dimensional objects of the object domain and updates them iteratively using the corresponding representative views. The negative examples for the contrast learning are randomly selected from the memory base, and the representative view update formula is as follows:
Figure BDA0003066822200000097
wherein, the value range of mu is [0,1 ]]Is the update coefficient; m isiIs the view feature stored in the memory bank;
Figure BDA0003066822200000098
view features with minimal entropy;
Figure BDA0003066822200000099
is the low-dimensional visual feature of the view with the minimum entropy.
Example 3
An unsupervised multi-view three-dimensional target retrieval apparatus based on two-dimensional images, referring to fig. 2, the apparatus comprising:
the characteristic extraction module 1 is used for respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
the domain confrontation learning module 2 is used for obtaining the visual characteristics which are distributed and aligned in a cross-domain mode through domain confrontation learning according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target;
the acquisition module 3 is used for acquiring the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the acquired visual characteristics;
and the updating module 4 is used for obtaining high-quality negative samples through the memory base which is updated by iterative weighting and used for storing the representative view characteristics, and the negative samples are used for the comparative learning.
In one embodiment, referring to fig. 3, the obtaining module 3 comprises:
the selecting submodule 31 is used for selecting view characteristics of the ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;
the calculation submodule 32 is used for calculating the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample respectively, and calculating the contrast loss based on 2 similarities;
and the obtaining submodule 33 is used for combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train the feature extractor, the domain discriminator, the classifier and the nonlinear mapping to obtain the total loss function adaptive to the self-supervision domain.
It should be noted that the device description in the above embodiments corresponds to the description of the method embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the modules and units can be devices with calculation functions, such as a computer, a single chip microcomputer and a microcontroller, and in the specific implementation, the execution main bodies are not limited in the embodiment of the invention and are selected according to the requirements in practical application.
Based on the same inventive concept, an embodiment of the present invention further provides an unsupervised multi-view three-dimensional target retrieval device based on two-dimensional images, and referring to fig. 4, the device includes:
a processor 5 and a memory 6, the memory 6 having stored therein program instructions, the processor 5 calling upon the program instructions stored in the memory 6 to cause the apparatus to perform the following method steps in an embodiment:
respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;
obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics;
through the iterative weighted updating of the memory base used for storing the representative view characteristics, high-quality negative samples are obtained and used for the comparison learning.
In an embodiment, the obtaining of the visual characteristics of the multi-view three-dimensional object with more category differences according to the comparison learning and the obtained visual characteristics specifically includes:
selecting view characteristics of an ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;
respectively calculating the similarity between the anchor point and the positive sample and the negative sample, and calculating the contrast loss based on 2 similarities;
and (3) combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train a feature extractor, a domain discriminator, a classifier and nonlinear mapping to obtain a total loss function adaptive to the self-supervision domain.
In one embodiment, the similarity between the anchor point and the positive and negative samples is:
the similarity between the anchor point and the positive sample is calculated as follows:
Figure BDA0003066822200000111
wherein j is more than or equal to 1, j 'is more than or equal to N and j is not equal to j', s is the similarity, g (-) represents the mapping function for mapping the feature vector to the low-dimensional space;
Figure BDA0003066822200000112
for the low-dimensional visual features of the jth view of the ith sample of the target domain,
Figure BDA0003066822200000113
the low-dimensional visual characteristic of the jth view of the ith sample of the target domain, d is a function for calculating cosine distance, and tau is a linear scale factor used for adjusting the dynamic range of similarity;
the similarity calculation formula between the anchor point and the negative sample is as follows:
Figure BDA0003066822200000114
wherein j is more than or equal to 1 and less than or equal to N and i is not equal to i ', m is a representative view characteristic of the multi-view three-dimensional target, and i' is a serial number of the multi-view three-dimensional target.
In one embodiment, the memory library for storing the representative view features, which can be updated by iterative weighting, is embodied as:
selecting the view with the highest classification prediction accuracy as a multi-view three-dimensional target
Figure BDA0003066822200000115
Will be based on the principle of entropy minimization
Figure BDA0003066822200000116
The N views are fed into a classifier G, which generates KA way classification result;
multi-view three-dimensional object
Figure BDA0003066822200000117
The prediction entropy calculation formula of the jth view of (1) is as follows:
Figure BDA0003066822200000118
wherein the content of the first and second substances,
Figure BDA0003066822200000119
representing multi-perspective three-dimensional objects
Figure BDA00030668222000001110
Is classified into a kth class; k is the total number of categories;
a representative view update formula is as follows:
Figure BDA00030668222000001111
wherein, the value range of mu is [0,1 ]]Is the update coefficient; m isiIs the view feature stored in the memory bank;
Figure BDA00030668222000001112
view features with minimal entropy;
Figure BDA00030668222000001113
is the low-dimensional visual feature of the view with the minimum entropy.
It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the processor 5 and the memory 6 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to the needs in the practical application.
The memory 6 and the processor 5 transmit data signals through the bus 7, which is not described in detail in the embodiment of the present invention.
Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.
The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.
It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer.
The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.
Reference to the literature
[1]Z.Wu,S.Song,A.Khosla,F.Yu,L.Zhang,X.Tang,and J.Xiao,3d shapenets:A deep representation for volumetric shapes,in CVPR.IEEE Computer Society,2015,pp.1912–1920.
[2]C.R.Qi,H.Su,K.Mo,and L.J.Guibas,“Pointnet:Deep learning on point sets for 3d classification and segmentation,”in CVPR.IEEE Computer Society,2017,pp.77–85.
[3]H.Su,S.Maji,E.Kalogerakis,and E.G.Learned-Miller,Multi-view convolutional neural networks for 3d shape recognition,in ICCV.IEEE Computer Society,2015,pp.945–953.
[4]Y.Feng,Z.Zhang,X.Zhao,R.Ji,and Y.Gao,GVCNN:group-view convolutional neural networks for 3d shape recognition,in CVPR.IEEE Computer Society,2018,pp.264–272.
[5]B.Gong,C.Yan,J.Bai,C.Zou,and Y.Gao,Hamming embedding sensitivity guided fusion network for 3d shape representation,IEEE Trans.Image Process.,vol.29,pp.8381–8390,2020.[Online].Available:https://doi.org/10.1109/TIP.2020.3013138
[6]S.J.Pan,I.W.Tsang,J.T.Kwok,and Q.Yang,“Domain adaptation via transfer component analysis,”IEEE Trans.Neural Networks,vol.22,no.2,pp.199–210,2011.
[7]A.Chadha and Y.Andreopoulos,“Improved techniques for adversarial discriminative domain adaptation,”IEEE Trans.Image Process.,vol.29,pp.2622–2637,2020.
[8]I.J.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.C.Courville,and Y.Bengio,“Generative adversarial nets,”in NIPS,2014,pp.2672–2680
[9]R.Zhang,P.Isola,and A.A.Efros,“Colorful image colorization,”in ECCV(3),ser.Lecture Notes in Computer Science,vol.9907.Springer,2016,pp.649–666.
[10]D.Pathak,P.Kr..ahenb..uhl,J.Donahue,T.Darrell,and A.A.Efros,“Context encoders:Feature learning by inpainting,”in CVPR.IEEE Computer Society,2016,pp.2536–2544.
[11]S.Gidaris,P.Singh,and N.Komodakis,“Unsupervised representation learning by predicting image rotations,”in ICLR (Poster).OpenRe-view.net,2018.
[12]D.Kim,D.Cho,D.Yoo,and I.S.Kweon,“Learning image repre-sentations by completing damaged jigsaw puzzles,”in WACV.IEEE Computer Society,2018,pp.793–802.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. An unsupervised multi-view three-dimensional target retrieval method based on two-dimensional images is characterized by comprising the following steps:
respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by using a characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
according to the visual characteristics of the two-dimensional image, the label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target, the visual characteristics after cross-domain distribution alignment are obtained through domain confrontation learning;
obtaining the visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the obtained visual characteristics;
through the iterative weighted updating of the memory base used for storing the representative view characteristics, high-quality negative samples are obtained and used for the comparison learning.
2. The method according to claim 1, wherein the obtaining of the visual characteristics of the multi-view three-dimensional object with more category differences according to the comparison learning and the obtained visual characteristics is specifically:
selecting view characteristics of an ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;
respectively calculating the similarity between the anchor point and the positive sample and the negative sample, and calculating the contrast loss based on 2 similarities;
and (3) combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train a feature extractor, a domain discriminator, a classifier and nonlinear mapping to obtain a total loss function adaptive to the self-supervision domain.
3. The unsupervised multi-view three-dimensional target retrieval method based on the two-dimensional image as claimed in claim 1, wherein the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample are respectively as follows:
the similarity between the anchor point and the positive sample is calculated as follows:
Figure FDA0003066822190000011
wherein j is more than or equal to 1, j 'is more than or equal to N and j is not equal to j', s is the similarity, g (-) represents the mapping function for mapping the feature vector to the low-dimensional space;
Figure FDA0003066822190000012
for the low-dimensional visual features of the jth view of the ith sample of the target domain,
Figure FDA0003066822190000013
the low-dimensional visual characteristic of the jth view of the ith sample of the target domain, d is a function for calculating cosine distance, and tau is a linear scale factor used for adjusting the dynamic range of similarity;
the similarity calculation formula between the anchor point and the negative sample is as follows:
Figure FDA0003066822190000021
wherein j is more than or equal to 1 and less than or equal to N and i is not equal to i ', m is a representative view characteristic of the multi-view three-dimensional target, and i' is a serial number of the multi-view three-dimensional target.
4. The two-dimensional image-based unsupervised multi-view three-dimensional object retrieval method according to claim 1, wherein the memory library for storing the representative view features, which is updated by iterative weighting, is specifically:
selecting the view with the highest classification prediction accuracy as a multi-view three-dimensional target
Figure FDA0003066822190000022
Will be based on the principle of entropy minimization
Figure FDA0003066822190000023
The N views are sent into a classifier G to generate K paths of classification results;
multi-view three-dimensional object
Figure FDA0003066822190000024
The prediction entropy calculation formula of the jth view of (1) is as follows:
Figure FDA0003066822190000025
wherein the content of the first and second substances,
Figure FDA0003066822190000026
representing multi-perspective three-dimensional objects
Figure FDA0003066822190000027
Is classified into a kth class; k is the total number of categories;
a representative view update formula is as follows:
Figure FDA0003066822190000028
wherein, the value range of mu is [0,1 ]]Is the update coefficient; m isiIs the view feature stored in the memory bank;
Figure FDA0003066822190000029
view features with minimal entropy;
Figure FDA00030668221900000210
is the low-dimensional visual feature of the view with the minimum entropy.
5. An unsupervised multi-view three-dimensional object retrieval device based on two-dimensional images, the device comprising:
the characteristic extraction module is used for respectively extracting the characteristics of the two-dimensional image domain and the multi-view three-dimensional target domain by utilizing the characteristic extractor to obtain the visual characteristics of the two-dimensional image and the multi-view three-dimensional target;
the domain confrontation learning module is used for obtaining visual characteristics which are distributed and aligned in a cross-domain mode through domain confrontation learning according to the visual characteristics of the two-dimensional image, label information of the two-dimensional image and the visual characteristics of the multi-view three-dimensional target;
the acquisition module is used for acquiring visual characteristics of the multi-view three-dimensional target with more category difference according to the comparison learning and the acquired visual characteristics;
and the updating module is used for obtaining high-quality negative samples through the memory base which is updated by iterative weighting and used for storing the representative view characteristics, and the negative samples are used for the comparative learning.
6. The two-dimensional image-based unsupervised multi-view three-dimensional target retrieval device according to claim 5, wherein the acquiring module comprises:
the selecting submodule is used for selecting the view characteristics of the ith multi-view three-dimensional target as an anchor point, selecting other view characteristics as a positive sample, and selecting one view characteristic of other three-dimensional targets as a negative sample;
the calculation submodule is used for calculating the similarity between the anchor point and the positive sample and the similarity between the anchor point and the negative sample respectively and calculating the contrast loss based on 2 similarities;
and the obtaining submodule is used for combining the source classification loss, the domain confrontation loss and the contrast loss to jointly train the feature extractor, the domain discriminator, the classifier and the nonlinear mapping to obtain the self-supervision domain adaptive total loss function.
7. An unsupervised multi-view three-dimensional object retrieval device based on two-dimensional images, the device comprising:
a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-6.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-6.
CN202110529135.7A 2021-05-14 2021-05-14 Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device Active CN113240012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110529135.7A CN113240012B (en) 2021-05-14 2021-05-14 Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110529135.7A CN113240012B (en) 2021-05-14 2021-05-14 Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device

Publications (2)

Publication Number Publication Date
CN113240012A true CN113240012A (en) 2021-08-10
CN113240012B CN113240012B (en) 2022-08-23

Family

ID=77134375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110529135.7A Active CN113240012B (en) 2021-05-14 2021-05-14 Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device

Country Status (1)

Country Link
CN (1) CN113240012B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779287A (en) * 2021-09-02 2021-12-10 天津大学 Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
CN114550098A (en) * 2022-02-28 2022-05-27 山东大学 Examination room monitoring video abnormal behavior detection method and system based on contrast learning
CN114969419A (en) * 2022-06-06 2022-08-30 金陵科技学院 Sketch-based three-dimensional model retrieval method guided by self-driven multi-view features
CN115082717A (en) * 2022-08-22 2022-09-20 成都不烦智能科技有限责任公司 Dynamic target identification and context memory cognition method and system based on visual perception
CN115640418A (en) * 2022-12-26 2023-01-24 天津师范大学 Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
CN117473105A (en) * 2023-12-28 2024-01-30 浪潮电子信息产业股份有限公司 Three-dimensional content generation method based on multi-mode pre-training model and related components

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205135A (en) * 2015-09-15 2015-12-30 天津大学 3D (three-dimensional) model retrieving method based on topic model and retrieving device thereof
CN110688515A (en) * 2019-09-25 2020-01-14 北京影谱科技股份有限公司 Text image semantic conversion method and device, computing equipment and storage medium
CN111191492A (en) * 2018-11-15 2020-05-22 北京三星通信技术研究有限公司 Information estimation, model retrieval and model alignment methods and apparatus
CN112330825A (en) * 2020-11-13 2021-02-05 天津大学 Three-dimensional model retrieval method based on two-dimensional image information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205135A (en) * 2015-09-15 2015-12-30 天津大学 3D (three-dimensional) model retrieving method based on topic model and retrieving device thereof
CN111191492A (en) * 2018-11-15 2020-05-22 北京三星通信技术研究有限公司 Information estimation, model retrieval and model alignment methods and apparatus
CN110688515A (en) * 2019-09-25 2020-01-14 北京影谱科技股份有限公司 Text image semantic conversion method and device, computing equipment and storage medium
CN112330825A (en) * 2020-11-13 2021-02-05 天津大学 Three-dimensional model retrieval method based on two-dimensional image information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DAN SONG ET AL.: "Monocular Image-Based 3-D Model Retrieval:A Benchmark", 《IEEE TRANSACTIONS ON CYBERNETICS》 *
周 燕等: "基于深度学习的三维形状特征提取方法", 《计算机科学》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779287A (en) * 2021-09-02 2021-12-10 天津大学 Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
CN113779287B (en) * 2021-09-02 2023-09-15 天津大学 Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
CN114550098A (en) * 2022-02-28 2022-05-27 山东大学 Examination room monitoring video abnormal behavior detection method and system based on contrast learning
CN114550098B (en) * 2022-02-28 2024-06-11 山东大学 Examination room monitoring video abnormal behavior detection method and system based on contrast learning
CN114969419A (en) * 2022-06-06 2022-08-30 金陵科技学院 Sketch-based three-dimensional model retrieval method guided by self-driven multi-view features
CN115082717A (en) * 2022-08-22 2022-09-20 成都不烦智能科技有限责任公司 Dynamic target identification and context memory cognition method and system based on visual perception
CN115082717B (en) * 2022-08-22 2022-11-08 成都不烦智能科技有限责任公司 Dynamic target identification and context memory cognition method and system based on visual perception
CN115640418A (en) * 2022-12-26 2023-01-24 天津师范大学 Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
CN117473105A (en) * 2023-12-28 2024-01-30 浪潮电子信息产业股份有限公司 Three-dimensional content generation method based on multi-mode pre-training model and related components
CN117473105B (en) * 2023-12-28 2024-04-05 浪潮电子信息产业股份有限公司 Three-dimensional content generation method based on multi-mode pre-training model and related components

Also Published As

Publication number Publication date
CN113240012B (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN113240012B (en) Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device
CN110059198B (en) Discrete hash retrieval method of cross-modal data based on similarity maintenance
CN111627065B (en) Visual positioning method and device and storage medium
CN105912611B (en) A kind of fast image retrieval method based on CNN
CN109918537B (en) HBase-based rapid retrieval method for ship monitoring video content
CN108681746B (en) Image identification method and device, electronic equipment and computer readable medium
WO2020114378A1 (en) Video watermark identification method and apparatus, device, and storage medium
CN111523621A (en) Image recognition method and device, computer equipment and storage medium
CN108132968A (en) Network text is associated with the Weakly supervised learning method of Semantic unit with image
CN110222218B (en) Image retrieval method based on multi-scale NetVLAD and depth hash
CN110175615B (en) Model training method, domain-adaptive visual position identification method and device
Bochinski et al. Deep active learning for in situ plankton classification
Yu et al. Stratified pooling based deep convolutional neural networks for human action recognition
CN106951551B (en) Multi-index image retrieval method combining GIST characteristics
Cheng et al. A data-driven point cloud simplification framework for city-scale image-based localization
Shrivastava et al. Unsupervised domain adaptation using parallel transport on Grassmann manifold
WO2023221790A1 (en) Image encoder training method and apparatus, device, and medium
CN113515656A (en) Multi-view target identification and retrieval method and device based on incremental learning
Ghahremani et al. Towards parameter-optimized vessel re-identification based on IORnet
Gao et al. SHREC’15 Track: 3D object retrieval with multimodal views
Yu et al. Hope: Hierarchical object prototype encoding for efficient object instance search in videos
Gao et al. Efficient view-based 3-D object retrieval via hypergraph learning
CN114155388B (en) Image recognition method and device, computer equipment and storage medium
CN114741549A (en) Image duplicate checking method and device based on LIRE, computer equipment and storage medium
Makadia Feature tracking for wide-baseline image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant