CN114742890A - 6D attitude estimation data set migration method based on image content and style decoupling - Google Patents

6D attitude estimation data set migration method based on image content and style decoupling Download PDF

Info

Publication number
CN114742890A
CN114742890A CN202210261360.1A CN202210261360A CN114742890A CN 114742890 A CN114742890 A CN 114742890A CN 202210261360 A CN202210261360 A CN 202210261360A CN 114742890 A CN114742890 A CN 114742890A
Authority
CN
China
Prior art keywords
image
domain
style
content
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210261360.1A
Other languages
Chinese (zh)
Inventor
赵国英
朱梦婕
赵万青
张少博
彭进业
彭先霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN202210261360.1A priority Critical patent/CN114742890A/en
Publication of CN114742890A publication Critical patent/CN114742890A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 6D attitude estimation data set migration method based on image content and style decoupling, which comprises the following steps: the method comprises the following steps: training a pseudo-paired image generation network; step two: and training the autonomously designed image migration network. According to the method, the image content and style representation are decoupled by using the encoder, the image is reconstructed by using the content representation space shared with the source domain data and the style representation space shared with the target domain data, and strong supervision of an object structure and refinement of a key point structure are realized by using the designed object structure feature extractor and key point attention feature extractor during reconstruction. The method can effectively make up the inter-domain gap between the real data and the synthetic data, uses the decoupling representation as input, can effectively improve the mode collapse, and does not increase the complexity of the 6D attitude estimation algorithm while fully utilizing the data of the label-free target domain.

Description

6D attitude estimation data set migration method based on image content and style decoupling
Technical Field
The invention belongs to the field of 6D attitude estimation, and relates to a 6D attitude estimation data set migration method based on image content and style decoupling, which can effectively make up for an inter-domain gap between real data and synthetic data.
Background
The 6D pose estimation task aims at estimating 6 degrees of freedom of a given object relative to the camera, including 3D rotation and 3D translation, which is a fundamental task in computer vision. It is widely applicable to many real-world tasks such as robotic manipulation, augmented reality, and autopilot.
In recent years, with the development of deep neural networks, many 6D attitude estimation algorithms based on convolutional neural networks have been proposed and achieve good performance. However, the convolutional neural network is extremely data-driven, so that a large amount of real data with 3D pose labels is often required for training to obtain a better effect. In fact, the 3D pose labeling of real images is extremely difficult to obtain, but the 3D pose labeling of composite images is easily obtained. However, due to the domain gap between the real data and the synthetic data, the performance of the 6D pose estimation network trained on the synthetic data set may be severely degraded when tested on real images. Therefore, it is gradually noticed by more and more researchers how to reduce the inter-domain gap between the non-labeled real data and the labeled synthetic data.
The image data migration method can be used for 6D posture estimation data set migration and is divided into paired image domain migration and unpaired image domain migration. Although the paired image domain migration has good effects in the aspects of structure preservation and style migration, the pairing conditions are too harsh, and the real data set and the synthetic data set of 6D attitude estimation cannot meet the pairing requirements. Although the unpaired image domain migration has good effects in the fields of target detection, classification and the like, the unpaired image domain migration does not perform well in pixel-level tasks such as 6D attitude estimation due to the fact that an image pair cannot be formed and strong supervision on an object structure is lacked.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a 6D attitude estimation data set migration method based on image content and style decoupling, the method can realize inter-domain migration of unpaired images while strongly supervising the object structure, and a migration network is independently designed for a 6D attitude estimation task, so that inter-domain gaps between real data and synthetic data in a 6D attitude estimation data set are effectively made up.
The technical scheme of the invention comprises the following steps:
A6D pose estimation dataset migration method based on image content and style decoupling, comprising:
step one, training a pseudo-pairing image generation network, namely, training a source domain image IsAnd a target domain image ItSending the pseudo-matching image into a generation network for training;
acquiring a domain-invariant content space and a domain-specific style space of cross-domain shared information by using an encoder; secondly, exchanging a domain-invariant content space, fixing a style space specific to the domain, and sending the representation space of the domain into an image generator to generate a pseudo-paired image of the domain; finally, decoupling the pseudo-pairing image and exchanging the domain invariant content space to obtain a reconstructed source domain image
Figure BDA0003550245820000023
And a reconstructed target domain image
Figure BDA0003550245820000024
To achieve training on unpaired data;
step two, training an autonomously designed image migration network: the image migration network is based on the trained pseudo-paired image generation network and adopts an object structure feature extractor HcStyle feature extractor HsAnd key point attention force feature extractor HkThe image generator is further refined for structures near object keypoints.
Optionally, the key point attention feature extractor HkUsing source domain images IsTraining to obtain a source domain image IsAttention feature extractor H for sending key pointskObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the objectcAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:
Figure BDA0003550245820000021
l represents loss, keypoint represents key point, and 2 represents L2 loss; hcA representation structural feature extractor; hkA representation key point feature extractor; i issRepresenting a source domain image; t is a unit oftA migration image representing a target domain; the circle in the formula represents the Hadamard product; the lower subscript 2 of the double vertical bars represents the two-norm.
Optionally, the total loss function of the image migration network is:
Figure BDA0003550245820000022
Ltotal: total loss; λ: a weight; l isKLLoss of KL;
Figure BDA0003550245820000025
loss of domain antagonism;
Figure BDA0003550245820000026
reconstruction loss, with the loss denoted as L1 loss at subscript 1;
Figure BDA0003550245820000027
structural losses, with the lower subscript 2 indicating losses as L2 losses);
Figure BDA0003550245820000028
loss of style;
Figure BDA0003550245820000029
key point structureLoss;
Figure BDA00035502458200000210
color is lost.
Optionally, the second step specifically includes:
(2.1) rendering the source domain image IsContent encoder for delivery to source domain
Figure BDA00035502458200000211
Deriving source domain image content coding
Figure BDA00035502458200000212
Target domain image ItStyle encoder for delivery to a target domain
Figure BDA00035502458200000213
Obtaining target domain image style coding
Figure BDA00035502458200000214
(2.2) encoding target Domain image styles
Figure BDA00035502458200000215
And source domain image content encoding
Figure BDA00035502458200000216
Input target domain image generator GtIn generating a migration image T of a source domaint
(2.3) Using the structural feature extractor HcExtracting object structural features fc: structural feature extractor HcUsing the pre-trained VGG-19 network, the object part T of the migration image T is obtained by using the mask image MobjectThen T is addedobjectSending the pre-trained VGG-19 network, and taking out a conv4_2 layer as an object structural feature fcThe structural loss of an object is defined as:
Figure BDA0003550245820000031
fcis a structural feature of the object, lower corner mark TsTransition image, subscript T, representing source domain imagetMigration image representing target Domain image, subscript ItRepresenting the target field image, subscript IsRepresenting a source domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;
(2.4) extraction of features by means of the StylesExtracting object structural features fs: style feature extractor HsUsing a pre-trained VGG-19 network which is the same as the structure feature extractor, firstly sending the migration image T into the pre-trained VGG-19 network, and taking out the conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 layers to calculate a gram matrix as the style feature fsThe weight of each layer is 1, 0.8, 0.5, 0.3 and 0.1 respectively; the style loss is defined as:
Figure BDA0003550245820000032
fsindicating style characteristics, lower corner mark TtMigration image representing target Domain image, subscript ItRepresenting a target domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;
(2.5) attention feature extractor H using key pointskExtracting heat maps of key points of the image: key point attention feature extractor HkUsing source domain images IsTraining to obtain the source domain image IsAttention feature extractor H for feeding key pointskObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the objectcAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:
Figure BDA0003550245820000033
l represents loss, keypoint represents key point, and 2 represents L2 loss; hcTo representA structural feature extractor; hkA representation key point feature extractor; I.C. AsRepresenting a source domain image; t istA migration image representing a target domain; the circle in the formula represents the Hadamard product; the lower corner mark 2 of the double vertical lines represents a two-norm;
(2.6) since the inter-domain differences are partly caused by light, decoupling the light when defining the color loss, representing the image from the RGB color model to the LAB color model by ρ, applying the L1 loss to the other two channels after removing the light channel:
Figure BDA0003550245820000034
Figure BDA0003550245820000035
l represents loss, color represents color loss, 1 represents L1 loss; ρ represents the image from the RGB color model to the LAB color model; i issRepresenting a source domain image; m is a group ofsRepresenting a source domain mask image; t istA migration image representing a target domain; the double vertical lines represent norms and the subscript 1 to the double vertical lines represents a norm.
Optionally, the feature extractor H using the key point attention forcekThe heatmap for extracting the key points of the image is specifically as follows:
2.5.1, extracting a feature map of the input image by using a feature pyramid network and ResNet 101;
2.5.2 inputting the extracted features into a Key Point extractor HkThe network comprises 4 continuous 3 x 3 convolution layers, each layer is followed by a Relu serving as an activation function, the last layer is up-sampled to obtain a feature map with the same size as the picture, and the extracted features are used for generating a pixel-level probability distribution map h by utilizing softmax to represent the probability that the pixel point is the key point.
Optionally, the first step specifically includes:
(1.1) source domain image IsFeed-to-source-domain style encoder
Figure BDA0003550245820000041
And source domain content encoder
Figure BDA0003550245820000042
Obtaining source-domain style coding
Figure BDA0003550245820000043
And source content encoding
Figure BDA0003550245820000044
Target domain image ItFeed target domain style encoder
Figure BDA0003550245820000045
And a target domain content encoder
Figure BDA0003550245820000046
Obtaining target domain style coding
Figure BDA0003550245820000047
And target domain content encoding
Figure BDA0003550245820000048
(1.2) encoding the Source Domain content
Figure BDA0003550245820000049
And target domain content encoding
Figure BDA00035502458200000410
Feed content discriminator DcContent encoding distinguishing two domains;
(1.3) Source Domain image Style coding
Figure BDA00035502458200000411
And content encoding of target domain images
Figure BDA00035502458200000412
Feed source domain image generator GsGenerating a pseudo-paired image F of a target Domains(ii) a Imaging the target domainStyle coding
Figure BDA00035502458200000413
And content encoding of source domain images
Figure BDA00035502458200000414
Input target domain image generator GtPseudo-paired image F of medium-generation source domaint
(1.4) pseudo-paired image F of Source DomaintDestination domain identifier DtDistinguishing a real image and a generated image of a target domain;
(1.5) pseudo-paired image F of target DomainsDestination domain identifier DsDistinguishing a real image of a source domain from a generated image;
(1.6) pseudo-paired image F of target DomainsStyle coding of
Figure BDA00035502458200000415
Pseudo-paired image F with source domaintContent encoding of
Figure BDA00035502458200000416
Feed source domain image generator GsTo generate a reconstructed source domain image
Figure BDA00035502458200000417
Pseudo-paired image F of source domaintStyle coding of
Figure BDA00035502458200000418
Pseudo-paired image F with target domainsContent encoding of
Figure BDA00035502458200000419
Input target domain image generator GtTo generate a reconstructed target field image
Figure BDA00035502458200000420
Optionally, the source content encoder
Figure BDA00035502458200000421
And target domain image generator GtThe last layer of (2) shares the weight;
the target domain content encoder
Figure BDA00035502458200000422
And source domain image generator GsThe last layer of (2) shares the weight.
Optionally, the method further includes a third step of testing the network:
step 3.1-Source Domain image IsInput source domain content encoder
Figure BDA00035502458200000423
Deriving source content coding
Figure BDA00035502458200000424
Target domain image ItInput target domain style encoder
Figure BDA00035502458200000425
Obtaining target domain style coding
Figure BDA00035502458200000426
Step 3.2 encoding the Source Domain content
Figure BDA00035502458200000427
And target domain style coding
Figure BDA00035502458200000428
Input target domain image generator GtObtaining a migration image Tt
Compared with the prior art, the invention has the following advantages:
1. the invention provides a method for realizing inter-domain migration of unpaired images while carrying out strong supervision on object structures.
2. A migration network is independently designed for the 6D attitude estimation task, and the inter-domain gap between real data and synthetic data in the 6D attitude estimation data set is effectively made up.
3. The invention uses decoupling representation as input, can effectively improve mode collapse and increase output diversity.
4. The invention solves the problem of inter-domain gaps from the aspect of data generation, and does not increase the complexity of a 6D attitude estimation algorithm while fully utilizing the data of the label-free target domain.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
fig. 1 shows an overall overview of the image content and style decoupling based 6D pose estimation dataset migration method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to embodiments, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the invention:
the domain-invariant content space refers to a space composed of structural features of objects that do not vary with the image domain.
The domain-specific style space means that each image domain has its own style characteristics, and the style characteristics of different domains constitute different style spaces.
The representation space of the domain is divided into a source domain representation space and a target domain representation space. The representation space of the source domain consists of a content space of the source domain and a style space of the source domain, the content space of the source domain consists of content characteristics of the source domain, and the style space of the source domain consists of style characteristics of the source domain; the representation space of the target domain is composed of a content space of the target domain and a style space of the target domain, the content space of the target domain is composed of content features of the target domain, and the style space of the target domain is composed of style features of the target domain.
The source domain image refers to a composite image, and the target domain image refers to a real image.
All the letter meanings related to the corner marks in the invention are uniformly defined as follows:
marking the upper corner: c represents a structure; s represents a style; k represents a key point;
lower corner mark: s represents a source domain; t represents a target domain;
h generally denotes an extractor; g generally denotes an image generator; e generally denotes an encoder; z generally represents a code; i generally represents the original image; t generally represents a migration image; m represents a mask image; l represents a loss; λ represents a weight; the letters, in combination with the upper and lower superscript meanings, denote corresponding content, e.g.
Figure BDA0003550245820000051
Denoted as source domain style encoder.
With reference to fig. 1, the method for migrating a 6D pose estimation data set for RGB images of the present invention includes:
the method comprises the following steps: training a pseudo-pairing image generation network: source domain image IsAnd a target domain image ItSending the pseudo-matching image into a pseudo-matching image generation network;
first acquired with the encoder (source domain image I)sAnd a target domain image It) A domain-invariant content space and a domain-specific style space that share information across domains; and secondly, exchanging the domain-invariant content space and fixing the domain-specific style space, and sending the representation space of the domain into respective image generators (of the source domain and the target domain) to generate a pseudo-paired image of the domain. Finally, decoupling the pseudo-matching image and exchanging the space of the unchanged content of the domain to reconstruct the original source domain image
Figure BDA0003550245820000061
And target domain image
Figure BDA0003550245820000062
To enable training of unpaired data. The method uses decoupling representation as input, can effectively improve mode collapse and increase output diversity. The method solves the problem of inter-domain gaps from the aspect of data generation, and does not increase the complexity of a 6D attitude estimation algorithm while fully utilizing the data of the label-free target domain.
Step two: training an autonomously designed image migration network: the image migration network is designed with an object structure feature extractor H on the basis of a trained pseudo-paired image generation networkcStyle feature extractor HsAnd key point attention feature extractor HkThe image generator is further tuned. And the tuning content is to refine the structures near the key points of the object through the key point structure loss in the step two. Wherein, the object structure characteristic extractor HcAnd style feature extractor HsAnd the image after the source domain image processing is the image after the source domain image migration.
In the first step, the pseudo-pairing image generation network is trained by using input 6D attitude estimation data, and the method comprises the following steps:
step 101: source domain image IsFeed source domain style encoder
Figure BDA0003550245820000063
And a content encoder
Figure BDA0003550245820000064
Obtaining a style code
Figure BDA0003550245820000065
And content coding
Figure BDA0003550245820000066
Target domain image ItStyle encoder for delivery to a target domain
Figure BDA0003550245820000067
And a content encoder
Figure BDA0003550245820000068
Obtaining a style code
Figure BDA0003550245820000069
And content coding
Figure BDA00035502458200000610
Step 102: encoding source domain image styles
Figure BDA00035502458200000611
And content encoding of target domain images
Figure BDA00035502458200000612
Input source domain image generator GsGenerating a pseudo-paired image F of a target Domains(ii) a Encoding target domain image styles
Figure BDA00035502458200000613
Content coding of source domain images
Figure BDA00035502458200000614
Input target domain image generator GtPseudo-paired image F of medium-generation source domaint
Step 103: pseudo-paired image F of target domainsFeed source domain style encoder
Figure BDA00035502458200000615
And a content encoder
Figure BDA00035502458200000616
Deriving a style code
Figure BDA00035502458200000617
And content coding
Figure BDA00035502458200000618
Pseudo-pairing image F of source domaintStyle encoder for delivery to a target domain
Figure BDA00035502458200000619
And a content encoder
Figure BDA00035502458200000620
Deriving a style code
Figure BDA00035502458200000621
And content coding
Figure BDA00035502458200000622
Step 104: pseudo-paired image F of target domainsStyle coding of
Figure BDA00035502458200000623
Pseudo-paired image F with source domaintContent encoding of
Figure BDA00035502458200000624
Input source domain image generator GsTo generate a reconstructed source domain image
Figure BDA00035502458200000625
Pseudo-pairing image F of source domaintStyle coding of
Figure BDA00035502458200000626
Pseudo-paired image F with target domainsContent coding of
Figure BDA00035502458200000627
Input target domain image generator GtTo generate a reconstructed target domain image
Figure BDA00035502458200000628
The training of the image migration network comprises the following steps:
step 201: source domain image IsContent encoder for delivery to source domain
Figure BDA00035502458200000629
Obtaining content coding
Figure BDA00035502458200000630
Target domain image ItStyle encoder for delivery to a target domain
Figure BDA00035502458200000631
Obtaining a style code
Figure BDA00035502458200000632
Step 202: encoding target domain image styles
Figure BDA00035502458200000633
Content coding of source domain images
Figure BDA00035502458200000634
Input target domain image generator GtIn generating a migration image T of a source domaint
Step 203: transferring image T of source domaintSource domain image IsAnd its mask image Ms(because of the migration of the source domain image TtContent and target domain image IsSame, therefore the migration image T of the source domaintThe mask image is the source field image IsMask image Ms) Sent into a pre-trained structural feature extractor HcTo obtain an image TtStructural features of the object
Figure BDA0003550245820000074
And source domain image IsStructural features of the object
Figure BDA0003550245820000075
Step 204: transferring image T of source domaintAnd a target domain image ItSent into a style feature extractor H which is trained in advancesObtaining an image TtStyle characteristics of
Figure BDA0003550245820000076
And a target domain image ItStyle characteristics of
Figure BDA0003550245820000077
Step 205: source domain image IsSending the pre-trained key point attention feature extractor HkObtaining a source domain image IsHeat map of
Figure BDA0003550245820000078
The invention is further elucidated with reference to the drawing.
a) The method is a pseudo-pairing image generation network, and the network training comprises the following steps:
step 301: source domain image IsFeed-to-source-domain style encoder
Figure BDA0003550245820000079
And a content encoder
Figure BDA00035502458200000710
Obtaining a style code
Figure BDA00035502458200000711
And content coding
Figure BDA00035502458200000712
Target domain image ItStyle encoder for delivery to a target domain
Figure BDA00035502458200000713
And a content encoder
Figure BDA00035502458200000714
Deriving a style code
Figure BDA00035502458200000715
And content coding
Figure BDA00035502458200000716
Step 302: applying KL penalties to style coding encourages style representations to be as close as possible to the previous gaussian distribution:
LKL=E[DKL((Zs)||N(0,1))];
wherein:
Figure BDA0003550245820000071
p represents the true sample distribution, q represents the estimated sample distribution, DKL(p | | q) represents the distance between p and q; zsRepresenting a style code and Z representing a code.
Step 303: encoding source content
Figure BDA00035502458200000717
And target domain content encoding
Figure BDA00035502458200000718
Feed content discriminator DcThe content antagonism losses are:
Figure BDA0003550245820000072
step 304: encoding source domain image styles
Figure BDA00035502458200000719
And content encoding of target domain images
Figure BDA00035502458200000720
Feed source domain image generator GsGenerating a pseudo-paired image F of a target Domains(ii) a Encoding target domain image styles
Figure BDA00035502458200000722
And content encoding of source domain images
Figure BDA00035502458200000721
Input target domain image generator GtPseudo-paired image F of medium-generation source domaint
Step 305: pseudo-paired image F of source domaintDestination domain identifier DtThe target domain resistance loss is:
Figure BDA0003550245820000073
step 306: pseudo-paired image F of target domainsFeed source domain discriminator DsThe loss of source domain antagonism is, the loss of domain antagonism is
Figure BDA0003550245820000081
Comprises the following steps:
Figure BDA0003550245820000082
step 307: encoding a style of a pseudo-paired image of a target domain
Figure BDA0003550245820000086
Content encoding of pseudo-paired images with source domain
Figure BDA0003550245820000087
Feed source domain image generator GsTo generate a reconstructed source domain image
Figure BDA0003550245820000088
Encoding a style of a pseudo-paired image of a source domain
Figure BDA0003550245820000089
Content encoding of pseudo-paired images with target domain
Figure BDA00035502458200000810
Input target domain image generator GtTo generate a reconstructed target field image
Figure BDA00035502458200000811
The reconstruction loss is defined as:
Figure BDA0003550245820000083
step 308: the overall loss function is:
Figure BDA0003550245820000084
Ltotal: total loss; λ: weights (corresponding to the same loss function as the trailing corner markers); l isKLLoss of KL;
Figure BDA00035502458200000812
loss of domain antagonism;
Figure BDA00035502458200000813
reconstruction loss, with the loss denoted as L1 loss at subscript 1;
Figure BDA00035502458200000814
structural losses, with the lower subscript 2 indicating losses as L2 losses);
Figure BDA00035502458200000815
loss of style;
Figure BDA00035502458200000816
loss of key point structure;
Figure BDA00035502458200000817
color is lost.
(b) The image migration network is designed autonomously, and a source content encoder trained in a pseudo-paired image generation network is used on the basis of the pseudo-paired image generation network
Figure BDA00035502458200000818
Target domain style encoder
Figure BDA00035502458200000819
And a target domain image generator GtAnd (4) forming. Designs a structural feature extractor HcStyle feature extractor HsAnd key point attention force feature extractor Hk. The method comprises the following specific steps:
step 401, source domain image IsContent encoder for delivery to source domain
Figure BDA00035502458200000820
Obtaining content coding
Figure BDA00035502458200000821
Target domain image ItStyle encoder for delivery to a target domain
Figure BDA00035502458200000822
Deriving a style code
Figure BDA00035502458200000823
Step 402, target domain image style coding
Figure BDA00035502458200000824
And content encoding of source domain images
Figure BDA00035502458200000825
Input target domain image generator GtIn generating a migration image T of a source domaint
Step 402, transferring the image T of the source domaintSource domain image IsAnd its mask image MsSent into a pre-trained structural feature extractor HcIn (1). HcUsing the pre-trained VGG-19 network, the object part T of the migration image T is obtained by using the mask image MobjectThen T is addedobjectSending into a pre-trained VGG-19 network, and taking out a conv4_2 layer as an object structural feature fcThe structural loss of an object is defined as:
Figure BDA0003550245820000085
f is a feature, the upper corner mark c indicates the structure, the lower corner mark TsTransition image, subscript T, representing source domain imagetMigration image representing target Domain image, subscript ItRepresenting the target field image, subscript IsRepresenting a source domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;
step 403, transferring the image T of the source domaintAnd a target domain image ItSent into a style feature extractor H which is trained in advancesIn (1). HsUsing a pre-trained VGG-19 network which is the same as the structure feature extractor, firstly sending the migration image T into the pre-trained VGG-19 network, and taking out the conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 layers to calculate a gram matrix as the style feature fsThe weights of each layer are 1, 0.8, 0.5, 0.3 and 0.1, respectively. The style loss is:
Figure BDA0003550245820000091
f is a feature, the upper corner mark s indicates the style, the lower corner mark TtMigration image representing target Domain image, subscript ItRepresenting a target domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;
step 404, the source domain image IsSource domain image I for inputsPre-trained key point attention feature extractor HkIn the above, obtaining a heat map of the key points, and performing attention weighting on the structural loss of the generated image by using the obtained heat map, the structural loss of the key points is:
Figure BDA0003550245820000092
in the formula: hcA representation structural feature extractor; hkA representation key point feature extractor; i issRepresenting a source domainAn image; t istA migration image representing a target domain; the circles in the formula represent the Hadamard products, i.e., the products of the corresponding elements of the matrix; the double vertical lines represent norms and the subscript 2 of the double vertical lines represents a two-norm.
Step 405, because the inter-domain difference part is caused by light, we decouple the light when defining the color loss, represent the image from the RGB color model to the LAB color model by rho, and apply the L1 loss to the other two channels after removing the light channel:
Figure BDA0003550245820000093
Figure BDA0003550245820000094
l represents loss, color represents color loss, 1 represents L1 loss (a common loss); ρ represents the image from the RGB color model to the LAB color model; i issRepresenting a source domain image; msRepresenting a source domain mask image; t is a unit oftA migration image representing a target domain; the double vertical lines represent the norm, and the subscript 1 of the double vertical lines represents a norm.
Step 406, the overall loss function of the image style migration network is:
Figure BDA0003550245820000095
l total: total loss; λ: weights (corresponding to the same loss function as the back corner markers); such as
Figure BDA0003550245820000096
λ weight, weight of content versus resistance loss; l is a radical of an alcoholKLLoss of KL;
Figure BDA0003550245820000097
loss of domain antagonism;
Figure BDA0003550245820000098
reconstruction loss (lower corner mark 1 tableLoss shown as L1 loss);
Figure BDA0003550245820000101
structural losses (lower subscript 1 denotes loss as L2 loss);
Figure BDA0003550245820000102
loss of style;
Figure BDA0003550245820000103
loss of key point structure;
Figure BDA0003550245820000104
color loss.
The extraction step in step 404 is specifically as follows:
and 501, extracting a feature map of the input image by using the feature pyramid network and the ResNet 101.
Step 502, inputting the extracted features into a key point extraction network, wherein the network comprises 4 continuous 3 x 3 convolution layers, each layer is followed by a Relu serving as an activation function, the last layer is subjected to up-sampling to obtain a feature map with the same size as the picture, and the extracted features are used for generating a pixel-level probability distribution map h by utilizing softmax to represent the probability that the pixel points are key points;
(c) the method is a test network and comprises the following specific steps:
step 601, the source domain image I is processedsInput source content encoder
Figure BDA0003550245820000105
Obtaining content coding
Figure BDA0003550245820000106
Target domain image ItInput target domain style encoder
Figure BDA0003550245820000107
Deriving a style code
Figure BDA0003550245820000108
Step 602, encoding content
Figure BDA0003550245820000109
And style coding
Figure BDA00035502458200001010
Input target domain image generator GtObtaining a migration image Tt
Experimental part:
to demonstrate the effectiveness of the method, tests were performed on the LINEMOD real dataset and the LINEMOD-PBR synthetic dataset. Firstly, inputting RGB images of a real data set and a synthetic data set into a network to obtain an RGB image after the synthetic data set is migrated; then the synthetic image, the migrated image and the label of the synthetic data set are respectively sent into a 6D attitude estimation network to obtain a 6D attitude estimation network model of the synthetic image and a 6D attitude estimation network model of the migrated image; and finally, respectively testing the performances of the two 6D posture estimation network models on the real data set.
Because the data volume of the LINEMOD real data set is small, only one thousand pictures are trained for testing. The 6D posture estimation network uses HRNet estimation key points proposed by the research institute of science and Microsoft Asia in 2019, and then calculates the object posture by one PnP. ADD indexes are tested on eight objects such as Cat, and as shown in the following table 1, the average ADD value of the method is ten percent higher than that of a LINEMOD-PBR synthetic data set, which shows that the method effectively makes up for the inter-domain gap between real data and synthetic data in a 6D attitude estimation data set.
TABLE 1
Model PBR Our
Cat 0.455 0.543
Cam 0.187 0.337
Phone 0.253 0.389
Iron 0.268 0.340
Driller 0.617 0.766
Can 0.592 0.727
Glue 0.211 0.255
Duck 0.139 0.164
Mean 0.340 0.440
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (8)

1. A6D attitude estimation data set migration method based on image content and style decoupling is characterized by comprising the following steps:
step one, training a pseudo-pairing image generation network, namely, training a source domain image IsAnd a target domain image ItSending the pseudo-matching image into a generation network for training;
acquiring a domain-invariant content space and a domain-specific style space of cross-domain shared information by using an encoder; secondly, exchanging a domain-invariant content space, fixing a domain-specific style space, and sending the domain representation space into an image generator to generate a domain pseudo-pairing image; finally, decoupling the pseudo-paired images and exchanging the domain-invariant content space to obtain the reconstructed source domain image
Figure FDA0003550245810000015
And a reconstructed target domain image
Figure FDA0003550245810000016
To enable training of unpaired data;
step two, training an autonomously designed image migration network: the image migration network is based on the trained pseudo-paired image generation network and adopts an object structure feature extractor HcStyle feature extractor HsAnd key point attention feature extractor HkThe image generator is further refined for structures near object keypoints.
2. The decoupled 6D pose based on image content and style of claim 1The method for migrating the estimation data set is characterized in that the key point attention feature extractor HkUsing source domain images IsTraining to obtain the source domain image IsAttention feature extractor H for feeding key pointskObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the objectcAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:
Figure FDA0003550245810000011
l represents loss, keypoint represents key point, and 2 represents L2 loss; hcA representation structural feature extractor; hkA representation key point feature extractor; i issRepresenting a source domain image; t istA migration image representing a target domain; the circle in the formula represents the Hadamard product; the double vertical bar subscript 2 represents the two-norm.
3. The image content and style decoupling based 6D pose estimation data set migration method according to claim 1 or 2, wherein the total loss function of the image migration network is:
Figure FDA0003550245810000012
Ltotal: total loss; λ: a weight; l isKLLoss of KL;
Figure FDA0003550245810000013
loss of domain antagonism;
Figure FDA0003550245810000014
reconstruction loss, with the loss denoted as L1 loss at subscript 1;
Figure FDA0003550245810000023
structural losses, with the lower subscript 2 indicating losses as L2 losses);
Figure FDA0003550245810000024
loss of style;
Figure FDA0003550245810000025
loss of key point structure;
Figure FDA0003550245810000026
color loss.
4. The image content and style decoupling based 6D pose estimation data set migration method of claim 1 or 2, wherein said step two specifically comprises:
(2.1) source domain image IsContent encoder for delivery to source domain
Figure FDA0003550245810000027
Deriving source domain image content coding
Figure FDA0003550245810000028
Target domain image ItStyle encoder for delivery to a target domain
Figure FDA00035502458100000210
Obtaining target domain image style coding
Figure FDA0003550245810000029
(2.2) encoding target Domain image styles
Figure FDA00035502458100000211
And source domain image content encoding
Figure FDA00035502458100000212
Input target domain image generator GtMigration of medium generation source domainsImage Tt
(2.3) Using the structural feature extractor HcExtracting object structural features fc: structural feature extractor HcUsing the pre-trained VGG-19 network, the object part T of the migration image T is obtained by using the mask image MobjectThen T is addedobjectSending the pre-trained VGG-19 network, and taking out a conv4_2 layer as an object structural feature fcThe structural loss of an object is defined as:
Figure FDA0003550245810000021
fcis a structural feature of the object, lower corner mark TsTransition image, subscript T, representing source domain imagetMigration image representing target Domain image, subscript ItRepresenting the target field image, subscript IsRepresenting a source domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;
(2.4) extraction of features by means of a Style extractor HsExtracting object structural features fs: style feature extractor HsUsing a pre-trained VGG-19 network which is the same as the structure feature extractor, firstly sending the migration image T into the pre-trained VGG-19 network, and taking out the conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 layers to calculate a gram matrix as the style feature fsThe weight of each layer is 1, 0.8, 0.5, 0.3 and 0.1 respectively; the style loss is defined as:
Figure FDA0003550245810000022
fsindicating style characteristics, lower corner mark TtMigration image representing target Domain image, subscript ItRepresenting a target domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;
(2.5) attention feature extractor H using key pointskExtracting heat maps of key points of the image: key point attention feature elevatorGet ware HkUsing source domain images IsTraining to obtain the source domain image IsAttention feature extractor H for feeding key pointskObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the objectcAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:
Figure FDA0003550245810000031
l represents loss, keypoint represents key point, and 2 represents L2 loss; hcA representation structural feature extractor; hkA representation key point feature extractor; i issRepresenting a source domain image; t istA migration image representing a target domain; the circle in the formula represents the Hadamard product; the lower corner mark 2 of the double vertical lines represents a two-norm;
(2.6) since the inter-domain differences are partly caused by light, decoupling the light when defining the color loss, representing the image from the RGB color model to the LAB color model by ρ, applying the L1 loss to the other two channels after removing the light channel:
Figure FDA0003550245810000032
Figure FDA0003550245810000033
l represents loss, color represents color loss, and 1 represents L1 loss; ρ represents the image from the RGB color model to the LAB color model; i issRepresenting a source domain image; msRepresenting a source domain mask image; t istA migration image representing a target domain; the double vertical lines represent the norm, and the subscript 1 of the double vertical lines represents a norm.
5. The image content and style decoupling based 6D pose estimation data set migration method according to claim 1 or 2, wherein said utilizing is relatedKey point attention force feature extractor HkThe heatmap for extracting key points of the image is specifically as follows:
2.5.1, extracting a feature map of the input image by using a feature pyramid network and ResNet 101;
2.5.2 inputting the extracted features into a Key Point extractor HkThe network comprises 4 continuous 3 x 3 convolution layers, each layer is followed by a Relu serving as an activation function, the last layer is up-sampled to obtain a feature map with the same size as the picture, and the extracted features are used for generating a pixel-level probability distribution map h by utilizing softmax to represent the probability that the pixel point is the key point.
6. The image content and style decoupling based 6D pose estimation data set migration method of claim 1 or 2, wherein said step one specifically comprises:
(1.1) source domain image IsFeed-to-source-domain style encoder
Figure FDA0003550245810000034
And source domain content encoder
Figure FDA0003550245810000035
Deriving source-domain style coding
Figure FDA0003550245810000038
And source content encoding
Figure FDA0003550245810000037
Target domain image ItFeed target domain style encoder
Figure FDA0003550245810000036
And a target domain content encoder
Figure FDA0003550245810000039
Obtaining target domain style coding
Figure FDA00035502458100000310
And target domain content encoding
Figure FDA00035502458100000311
(1.2) encoding the Source Domain content
Figure FDA0003550245810000041
And target domain content encoding
Figure FDA0003550245810000042
Feed content discriminator DcContent encoding distinguishing two domains;
(1.3) Source Domain image Style coding
Figure FDA0003550245810000043
And content encoding of target domain images
Figure FDA0003550245810000044
Feed source domain image generator GsGenerating a pseudo-paired image F of a target Domains(ii) a Encoding target domain image styles
Figure FDA0003550245810000045
Content coding of source domain images
Figure FDA0003550245810000046
Input target domain image generator GtPseudo-paired image F of medium-generation source domaint
(1.4) pseudo-paired image F of Source DomaintDestination domain identifier DtDistinguishing a real image and a generated image of a target domain;
(1.5) pseudo-paired image F of target DomainsDestination domain identifier DsDistinguishing a real image of a source domain from a generated image;
(1.6) pseudo-paired image F of target DomainsOf a genre code of Z's sAnd sourcePseudo-paired image F of domaintContent encoding of
Figure FDA0003550245810000047
Feed source domain image generator GsTo generate a reconstructed source domain image
Figure FDA00035502458100000410
Pseudo-paired image F of source domaintStyle coding of
Figure FDA0003550245810000048
Pseudo-paired image F with target domainsContent encoding of
Figure FDA0003550245810000049
Input target domain image generator GtTo generate a reconstructed target field image
Figure FDA00035502458100000411
7. The image content and style decoupling based 6D pose estimation data set migration method of claim 6, wherein said source domain content encoder
Figure FDA00035502458100000412
And target domain image generator GtThe last layer of (2) shares the weight;
the target domain content encoder
Figure FDA00035502458100000413
And source domain image generator GsThe last layer of (2) shares the weight.
8. The image content and style decoupling based 6D pose estimation data set migration method of claim 1 or 2, further comprising the step three of testing a network:
step (ii) of3.1 merging source Domain image IsInput source domain content encoder
Figure FDA00035502458100000418
Deriving source-domain content coding
Figure FDA00035502458100000419
Target domain image ItInput target domain style encoder
Figure FDA00035502458100000414
Obtaining target domain style coding
Figure FDA00035502458100000417
Step 3.2 encoding the Source Domain content
Figure FDA00035502458100000415
And target domain style coding
Figure FDA00035502458100000416
Input target domain image generator GtObtaining a migration image Tt
CN202210261360.1A 2022-03-16 2022-03-16 6D attitude estimation data set migration method based on image content and style decoupling Pending CN114742890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210261360.1A CN114742890A (en) 2022-03-16 2022-03-16 6D attitude estimation data set migration method based on image content and style decoupling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210261360.1A CN114742890A (en) 2022-03-16 2022-03-16 6D attitude estimation data set migration method based on image content and style decoupling

Publications (1)

Publication Number Publication Date
CN114742890A true CN114742890A (en) 2022-07-12

Family

ID=82276292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210261360.1A Pending CN114742890A (en) 2022-03-16 2022-03-16 6D attitude estimation data set migration method based on image content and style decoupling

Country Status (1)

Country Link
CN (1) CN114742890A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546295A (en) * 2022-08-26 2022-12-30 西北大学 Target 6D attitude estimation model training method and target 6D attitude estimation method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546295A (en) * 2022-08-26 2022-12-30 西北大学 Target 6D attitude estimation model training method and target 6D attitude estimation method
CN115546295B (en) * 2022-08-26 2023-11-07 西北大学 Target 6D gesture estimation model training method and target 6D gesture estimation method

Similar Documents

Publication Publication Date Title
Melekhov et al. Dgc-net: Dense geometric correspondence network
Nataraj et al. Detecting GAN generated fake images using co-occurrence matrices
Qian et al. Learning and transferring representations for image steganalysis using convolutional neural network
CN108182441B (en) Parallel multichannel convolutional neural network, construction method and image feature extraction method
CN105224942B (en) RGB-D image classification method and system
CN104268593B (en) The face identification method of many rarefaction representations under a kind of Small Sample Size
CN111783521B (en) Pedestrian re-identification method based on low-rank prior guidance and based on domain invariant information separation
CN107301643B (en) Well-marked target detection method based on robust rarefaction representation Yu Laplce's regular terms
CN106408037A (en) Image recognition method and apparatus
CN110705591A (en) Heterogeneous transfer learning method based on optimal subspace learning
CN112883826B (en) Face cartoon generation method based on learning geometry and texture style migration
CN109325513B (en) Image classification network training method based on massive single-class images
CN110765882A (en) Video tag determination method, device, server and storage medium
CN110751271B (en) Image traceability feature characterization method based on deep neural network
CN114742890A (en) 6D attitude estimation data set migration method based on image content and style decoupling
CN113706407B (en) Infrared and visible light image fusion method based on separation characterization
CN112990340B (en) Self-learning migration method based on feature sharing
CN114329031A (en) Fine-grained bird image retrieval method based on graph neural network and deep hash
CN107184224B (en) Lung nodule diagnosis method based on bimodal extreme learning machine
CN103927533B (en) The intelligent processing method of graph text information in a kind of scanned document for earlier patents
CN112561782A (en) Method for improving reality degree of simulation picture of offshore scene
CN112330639A (en) Significance detection method for color-thermal infrared image
Li et al. Facial age estimation by deep residual decision making
CN110956599A (en) Picture processing method and device, storage medium and electronic device
CN116486495A (en) Attention and generation countermeasure network-based face image privacy protection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination