CN114742890A

CN114742890A - 6D attitude estimation data set migration method based on image content and style decoupling

Info

Publication number: CN114742890A
Application number: CN202210261360.1A
Authority: CN
Inventors: 赵国英; 朱梦婕; 赵万青; 张少博; 彭进业; 彭先霖
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-07-12

Abstract

The invention discloses a 6D attitude estimation data set migration method based on image content and style decoupling, which comprises the following steps: the method comprises the following steps: training a pseudo-paired image generation network; step two: and training the autonomously designed image migration network. According to the method, the image content and style representation are decoupled by using the encoder, the image is reconstructed by using the content representation space shared with the source domain data and the style representation space shared with the target domain data, and strong supervision of an object structure and refinement of a key point structure are realized by using the designed object structure feature extractor and key point attention feature extractor during reconstruction. The method can effectively make up the inter-domain gap between the real data and the synthetic data, uses the decoupling representation as input, can effectively improve the mode collapse, and does not increase the complexity of the 6D attitude estimation algorithm while fully utilizing the data of the label-free target domain.

Description

6D attitude estimation data set migration method based on image content and style decoupling

Technical Field

The invention belongs to the field of 6D attitude estimation, and relates to a 6D attitude estimation data set migration method based on image content and style decoupling, which can effectively make up for an inter-domain gap between real data and synthetic data.

Background

The 6D pose estimation task aims at estimating 6 degrees of freedom of a given object relative to the camera, including 3D rotation and 3D translation, which is a fundamental task in computer vision. It is widely applicable to many real-world tasks such as robotic manipulation, augmented reality, and autopilot.

In recent years, with the development of deep neural networks, many 6D attitude estimation algorithms based on convolutional neural networks have been proposed and achieve good performance. However, the convolutional neural network is extremely data-driven, so that a large amount of real data with 3D pose labels is often required for training to obtain a better effect. In fact, the 3D pose labeling of real images is extremely difficult to obtain, but the 3D pose labeling of composite images is easily obtained. However, due to the domain gap between the real data and the synthetic data, the performance of the 6D pose estimation network trained on the synthetic data set may be severely degraded when tested on real images. Therefore, it is gradually noticed by more and more researchers how to reduce the inter-domain gap between the non-labeled real data and the labeled synthetic data.

The image data migration method can be used for 6D posture estimation data set migration and is divided into paired image domain migration and unpaired image domain migration. Although the paired image domain migration has good effects in the aspects of structure preservation and style migration, the pairing conditions are too harsh, and the real data set and the synthetic data set of 6D attitude estimation cannot meet the pairing requirements. Although the unpaired image domain migration has good effects in the fields of target detection, classification and the like, the unpaired image domain migration does not perform well in pixel-level tasks such as 6D attitude estimation due to the fact that an image pair cannot be formed and strong supervision on an object structure is lacked.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a 6D attitude estimation data set migration method based on image content and style decoupling, the method can realize inter-domain migration of unpaired images while strongly supervising the object structure, and a migration network is independently designed for a 6D attitude estimation task, so that inter-domain gaps between real data and synthetic data in a 6D attitude estimation data set are effectively made up.

The technical scheme of the invention comprises the following steps:

A6D pose estimation dataset migration method based on image content and style decoupling, comprising:

step one, training a pseudo-pairing image generation network, namely, training a source domain image I_sAnd a target domain image I_tSending the pseudo-matching image into a generation network for training;

acquiring a domain-invariant content space and a domain-specific style space of cross-domain shared information by using an encoder; secondly, exchanging a domain-invariant content space, fixing a style space specific to the domain, and sending the representation space of the domain into an image generator to generate a pseudo-paired image of the domain; finally, decoupling the pseudo-pairing image and exchanging the domain invariant content space to obtain a reconstructed source domain image

And a reconstructed target domain image

To achieve training on unpaired data;

step two, training an autonomously designed image migration network: the image migration network is based on the trained pseudo-paired image generation network and adopts an object structure feature extractor H^cStyle feature extractor H^sAnd key point attention force feature extractor H^kThe image generator is further refined for structures near object keypoints.

Optionally, the key point attention feature extractor H^kUsing source domain images I_sTraining to obtain a source domain image I_sAttention feature extractor H for sending key points^kObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the object^cAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:

l represents loss, keypoint represents key point, and 2 represents L2 loss; h^cA representation structural feature extractor; h^kA representation key point feature extractor; i is_sRepresenting a source domain image; t is a unit of_tA migration image representing a target domain; the circle in the formula represents the Hadamard product; the lower subscript 2 of the double vertical bars represents the two-norm.

Optionally, the total loss function of the image migration network is:

L_total: total loss; λ: a weight; l is_KLLoss of KL;

loss of domain antagonism;

reconstruction loss, with the loss denoted as L1 loss at subscript 1;

structural losses, with the lower subscript 2 indicating losses as L2 losses);

loss of style;

key point structureLoss;

color is lost.

Optionally, the second step specifically includes:

(2.1) rendering the source domain image I_sContent encoder for delivery to source domain

Deriving source domain image content coding

Target domain image I_tStyle encoder for delivery to a target domain

Obtaining target domain image style coding

(2.2) encoding target Domain image styles

And source domain image content encoding

Input target domain image generator G_tIn generating a migration image T of a source domain_t；

(2.3) Using the structural feature extractor H^cExtracting object structural features f^c: structural feature extractor H^cUsing the pre-trained VGG-19 network, the object part T of the migration image T is obtained by using the mask image M^objectThen T is added^objectSending the pre-trained VGG-19 network, and taking out a conv4_2 layer as an object structural feature f^cThe structural loss of an object is defined as:

f^cis a structural feature of the object, lower corner mark T_sTransition image, subscript T, representing source domain image_tMigration image representing target Domain image, subscript I_tRepresenting the target field image, subscript I_sRepresenting a source domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;

(2.4) extraction of features by means of the Style^sExtracting object structural features f^s: style feature extractor H^sUsing a pre-trained VGG-19 network which is the same as the structure feature extractor, firstly sending the migration image T into the pre-trained VGG-19 network, and taking out the conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 layers to calculate a gram matrix as the style feature f^sThe weight of each layer is 1, 0.8, 0.5, 0.3 and 0.1 respectively; the style loss is defined as:

f^sindicating style characteristics, lower corner mark T_tMigration image representing target Domain image, subscript I_tRepresenting a target domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;

(2.5) attention feature extractor H using key points^kExtracting heat maps of key points of the image: key point attention feature extractor H^kUsing source domain images I_sTraining to obtain the source domain image I_sAttention feature extractor H for feeding key points^kObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the object_cAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:

l represents loss, keypoint represents key point, and 2 represents L2 loss; h^cTo representA structural feature extractor; h^kA representation key point feature extractor; I.C. A_sRepresenting a source domain image; t is_tA migration image representing a target domain; the circle in the formula represents the Hadamard product; the lower corner mark 2 of the double vertical lines represents a two-norm;

(2.6) since the inter-domain differences are partly caused by light, decoupling the light when defining the color loss, representing the image from the RGB color model to the LAB color model by ρ, applying the L1 loss to the other two channels after removing the light channel:

l represents loss, color represents color loss, 1 represents L1 loss; ρ represents the image from the RGB color model to the LAB color model; i is_sRepresenting a source domain image; m is a group of_sRepresenting a source domain mask image; t is_tA migration image representing a target domain; the double vertical lines represent norms and the subscript 1 to the double vertical lines represents a norm.

Optionally, the feature extractor H using the key point attention force^kThe heatmap for extracting the key points of the image is specifically as follows:

2.5.1, extracting a feature map of the input image by using a feature pyramid network and ResNet 101;

2.5.2 inputting the extracted features into a Key Point extractor H^kThe network comprises 4 continuous 3 x 3 convolution layers, each layer is followed by a Relu serving as an activation function, the last layer is up-sampled to obtain a feature map with the same size as the picture, and the extracted features are used for generating a pixel-level probability distribution map h by utilizing softmax to represent the probability that the pixel point is the key point.

Optionally, the first step specifically includes:

(1.1) source domain image I_sFeed-to-source-domain style encoder

And source domain content encoder

Obtaining source-domain style coding

And source content encoding

Target domain image I_tFeed target domain style encoder

And a target domain content encoder

Obtaining target domain style coding

And target domain content encoding

(1.2) encoding the Source Domain content

And target domain content encoding

Feed content discriminator D_cContent encoding distinguishing two domains;

(1.3) Source Domain image Style coding

And content encoding of target domain images

Feed source domain image generator G_sGenerating a pseudo-paired image F of a target Domain_s(ii) a Imaging the target domainStyle coding

And content encoding of source domain images

Input target domain image generator G_tPseudo-paired image F of medium-generation source domain_t；

(1.4) pseudo-paired image F of Source Domain_tDestination domain identifier D_tDistinguishing a real image and a generated image of a target domain;

(1.5) pseudo-paired image F of target Domain_sDestination domain identifier D_sDistinguishing a real image of a source domain from a generated image;

(1.6) pseudo-paired image F of target Domain_sStyle coding of

Pseudo-paired image F with source domain_tContent encoding of

Feed source domain image generator G_sTo generate a reconstructed source domain image

Pseudo-paired image F of source domain_tStyle coding of

Pseudo-paired image F with target domain_sContent encoding of

Input target domain image generator G_tTo generate a reconstructed target field image

Optionally, the source content encoder

And target domain image generator G_tThe last layer of (2) shares the weight;

the target domain content encoder

And source domain image generator G_sThe last layer of (2) shares the weight.

Optionally, the method further includes a third step of testing the network:

step 3.1-Source Domain image I_sInput source domain content encoder

Deriving source content coding

Target domain image I_tInput target domain style encoder

Obtaining target domain style coding

Step 3.2 encoding the Source Domain content

And target domain style coding

Input target domain image generator G_tObtaining a migration image T_t。

Compared with the prior art, the invention has the following advantages:

1. the invention provides a method for realizing inter-domain migration of unpaired images while carrying out strong supervision on object structures.

2. A migration network is independently designed for the 6D attitude estimation task, and the inter-domain gap between real data and synthetic data in the 6D attitude estimation data set is effectively made up.

3. The invention uses decoupling representation as input, can effectively improve mode collapse and increase output diversity.

4. The invention solves the problem of inter-domain gaps from the aspect of data generation, and does not increase the complexity of a 6D attitude estimation algorithm while fully utilizing the data of the label-free target domain.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 shows an overall overview of the image content and style decoupling based 6D pose estimation dataset migration method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to embodiments, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the invention:

the domain-invariant content space refers to a space composed of structural features of objects that do not vary with the image domain.

The domain-specific style space means that each image domain has its own style characteristics, and the style characteristics of different domains constitute different style spaces.

The representation space of the domain is divided into a source domain representation space and a target domain representation space. The representation space of the source domain consists of a content space of the source domain and a style space of the source domain, the content space of the source domain consists of content characteristics of the source domain, and the style space of the source domain consists of style characteristics of the source domain; the representation space of the target domain is composed of a content space of the target domain and a style space of the target domain, the content space of the target domain is composed of content features of the target domain, and the style space of the target domain is composed of style features of the target domain.

The source domain image refers to a composite image, and the target domain image refers to a real image.

All the letter meanings related to the corner marks in the invention are uniformly defined as follows:

marking the upper corner: c represents a structure; s represents a style; k represents a key point;

lower corner mark: s represents a source domain; t represents a target domain;

h generally denotes an extractor; g generally denotes an image generator; e generally denotes an encoder; z generally represents a code; i generally represents the original image; t generally represents a migration image; m represents a mask image; l represents a loss; λ represents a weight; the letters, in combination with the upper and lower superscript meanings, denote corresponding content, e.g.

Denoted as source domain style encoder.

With reference to fig. 1, the method for migrating a 6D pose estimation data set for RGB images of the present invention includes:

the method comprises the following steps: training a pseudo-pairing image generation network: source domain image I_sAnd a target domain image I_tSending the pseudo-matching image into a pseudo-matching image generation network;

first acquired with the encoder (source domain image I)_sAnd a target domain image I_t) A domain-invariant content space and a domain-specific style space that share information across domains; and secondly, exchanging the domain-invariant content space and fixing the domain-specific style space, and sending the representation space of the domain into respective image generators (of the source domain and the target domain) to generate a pseudo-paired image of the domain. Finally, decoupling the pseudo-matching image and exchanging the space of the unchanged content of the domain to reconstruct the original source domain image

And target domain image

To enable training of unpaired data. The method uses decoupling representation as input, can effectively improve mode collapse and increase output diversity. The method solves the problem of inter-domain gaps from the aspect of data generation, and does not increase the complexity of a 6D attitude estimation algorithm while fully utilizing the data of the label-free target domain.

Step two: training an autonomously designed image migration network: the image migration network is designed with an object structure feature extractor H on the basis of a trained pseudo-paired image generation network^cStyle feature extractor H^sAnd key point attention feature extractor H^kThe image generator is further tuned. And the tuning content is to refine the structures near the key points of the object through the key point structure loss in the step two. Wherein, the object structure characteristic extractor H^cAnd style feature extractor H^sAnd the image after the source domain image processing is the image after the source domain image migration.

In the first step, the pseudo-pairing image generation network is trained by using input 6D attitude estimation data, and the method comprises the following steps:

step 101: source domain image I_sFeed source domain style encoder

And a content encoder

Obtaining a style code

And content coding

Target domain image I_tStyle encoder for delivery to a target domain

And a content encoder

Obtaining a style code

And content coding

Step 102: encoding source domain image styles

And content encoding of target domain images

Input source domain image generator G_sGenerating a pseudo-paired image F of a target Domain_s(ii) a Encoding target domain image styles

Content coding of source domain images

Input target domain image generator G_tPseudo-paired image F of medium-generation source domain_t。

Step 103: pseudo-paired image F of target domain_sFeed source domain style encoder

And a content encoder

Deriving a style code

And content coding

Pseudo-pairing image F of source domain_tStyle encoder for delivery to a target domain

And a content encoder

Deriving a style code

And content coding

Step 104: pseudo-paired image F of target domain_sStyle coding of

Pseudo-paired image F with source domain_tContent encoding of

Input source domain image generator G_sTo generate a reconstructed source domain image

Pseudo-pairing image F of source domain_tStyle coding of

Pseudo-paired image F with target domain_sContent coding of

Input target domain image generator G_tTo generate a reconstructed target domain image

The training of the image migration network comprises the following steps:

step 201: source domain image I_sContent encoder for delivery to source domain

Obtaining content coding

Target domain image I_tStyle encoder for delivery to a target domain

Obtaining a style code

Step 202: encoding target domain image styles

Content coding of source domain images

Input target domain image generator G_tIn generating a migration image T of a source domain_t。

Step 203: transferring image T of source domain_tSource domain image I_sAnd its mask image M_s(because of the migration of the source domain image T_tContent and target domain image I_sSame, therefore the migration image T of the source domain_tThe mask image is the source field image I_sMask image M_s) Sent into a pre-trained structural feature extractor H^cTo obtain an image T_tStructural features of the object

And source domain image I_sStructural features of the object

Step 204: transferring image T of source domain_tAnd a target domain image I_tSent into a style feature extractor H which is trained in advance^sObtaining an image T_tStyle characteristics of

And a target domain image I_tStyle characteristics of

Step 205: source domain image I_sSending the pre-trained key point attention feature extractor H^kObtaining a source domain image I_sHeat map of

The invention is further elucidated with reference to the drawing.

a) The method is a pseudo-pairing image generation network, and the network training comprises the following steps:

step 301: source domain image I_sFeed-to-source-domain style encoder

And a content encoder

Obtaining a style code

And content coding

Target domain image I_tStyle encoder for delivery to a target domain

And a content encoder

Deriving a style code

And content coding

Step 302: applying KL penalties to style coding encourages style representations to be as close as possible to the previous gaussian distribution:

L_KL＝E[D_KL((Z^s)||N(0,1))]；

wherein:

p represents the true sample distribution, q represents the estimated sample distribution, D_KL(p | | q) represents the distance between p and q; z^sRepresenting a style code and Z representing a code.

Step 303: encoding source content

And target domain content encoding

Feed content discriminator D^cThe content antagonism losses are:

step 304: encoding source domain image styles

And content encoding of target domain images

Feed source domain image generator G_sGenerating a pseudo-paired image F of a target Domain_s(ii) a Encoding target domain image styles

And content encoding of source domain images

Step 305: pseudo-paired image F of source domain_tDestination domain identifier D_tThe target domain resistance loss is:

step 306: pseudo-paired image F of target domain_sFeed source domain discriminator D_sThe loss of source domain antagonism is, the loss of domain antagonism is

Comprises the following steps:

step 307: encoding a style of a pseudo-paired image of a target domain

Content encoding of pseudo-paired images with source domain

Encoding a style of a pseudo-paired image of a source domain

Content encoding of pseudo-paired images with target domain

The reconstruction loss is defined as:

step 308: the overall loss function is:

L_total: total loss; λ: weights (corresponding to the same loss function as the trailing corner markers); l is_KLLoss of KL;

loss of domain antagonism;

reconstruction loss, with the loss denoted as L1 loss at subscript 1;

structural losses, with the lower subscript 2 indicating losses as L2 losses);

loss of style;

loss of key point structure;

color is lost.

(b) The image migration network is designed autonomously, and a source content encoder trained in a pseudo-paired image generation network is used on the basis of the pseudo-paired image generation network

Target domain style encoder

And a target domain image generator G_tAnd (4) forming. Designs a structural feature extractor H^cStyle feature extractor H^sAnd key point attention force feature extractor H^k. The method comprises the following specific steps:

step 401, source domain image I_sContent encoder for delivery to source domain

Obtaining content coding

Target domain image I_tStyle encoder for delivery to a target domain

Deriving a style code

Step 402, target domain image style coding

And content encoding of source domain images

Step 402, transferring the image T of the source domain_tSource domain image I_sAnd its mask image M_sSent into a pre-trained structural feature extractor H^cIn (1). H^cUsing the pre-trained VGG-19 network, the object part T of the migration image T is obtained by using the mask image M^objectThen T is added^objectSending into a pre-trained VGG-19 network, and taking out a conv4_2 layer as an object structural feature f^cThe structural loss of an object is defined as:

f is a feature, the upper corner mark c indicates the structure, the lower corner mark T_sTransition image, subscript T, representing source domain image_tMigration image representing target Domain image, subscript I_tRepresenting the target field image, subscript I_sRepresenting a source domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;

step 403, transferring the image T of the source domain_tAnd a target domain image I_tSent into a style feature extractor H which is trained in advance^sIn (1). H^sUsing a pre-trained VGG-19 network which is the same as the structure feature extractor, firstly sending the migration image T into the pre-trained VGG-19 network, and taking out the conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 layers to calculate a gram matrix as the style feature f^sThe weights of each layer are 1, 0.8, 0.5, 0.3 and 0.1, respectively. The style loss is:

f is a feature, the upper corner mark s indicates the style, the lower corner mark T_tMigration image representing target Domain image, subscript I_tRepresenting a target domain image; the double vertical lines represent norms, and the lower corner mark 2 of the double vertical lines represents a norm;

step 404, the source domain image I_sSource domain image I for input_sPre-trained key point attention feature extractor H^kIn the above, obtaining a heat map of the key points, and performing attention weighting on the structural loss of the generated image by using the obtained heat map, the structural loss of the key points is:

in the formula: h^cA representation structural feature extractor; h^kA representation key point feature extractor; i is_sRepresenting a source domainAn image; t is_tA migration image representing a target domain; the circles in the formula represent the Hadamard products, i.e., the products of the corresponding elements of the matrix; the double vertical lines represent norms and the subscript 2 of the double vertical lines represents a two-norm.

Step 405, because the inter-domain difference part is caused by light, we decouple the light when defining the color loss, represent the image from the RGB color model to the LAB color model by rho, and apply the L1 loss to the other two channels after removing the light channel:

l represents loss, color represents color loss, 1 represents L1 loss (a common loss); ρ represents the image from the RGB color model to the LAB color model; i is_sRepresenting a source domain image; m_sRepresenting a source domain mask image; t is a unit of_tA migration image representing a target domain; the double vertical lines represent the norm, and the subscript 1 of the double vertical lines represents a norm.

Step 406, the overall loss function of the image style migration network is:

l total: total loss; λ: weights (corresponding to the same loss function as the back corner markers); such as

λ weight, weight of content versus resistance loss; l is a radical of an alcohol_KLLoss of KL;

loss of domain antagonism;

reconstruction loss (lower corner mark 1 tableLoss shown as L1 loss);

structural losses (lower subscript 1 denotes loss as L2 loss);

loss of style;

loss of key point structure;

color loss.

The extraction step in step 404 is specifically as follows:

and 501, extracting a feature map of the input image by using the feature pyramid network and the ResNet 101.

Step 502, inputting the extracted features into a key point extraction network, wherein the network comprises 4 continuous 3 x 3 convolution layers, each layer is followed by a Relu serving as an activation function, the last layer is subjected to up-sampling to obtain a feature map with the same size as the picture, and the extracted features are used for generating a pixel-level probability distribution map h by utilizing softmax to represent the probability that the pixel points are key points;

(c) the method is a test network and comprises the following specific steps:

step 601, the source domain image I is processed_sInput source content encoder

Obtaining content coding

Target domain image I_tInput target domain style encoder

Deriving a style code

Step 602, encoding content

And style coding

Input target domain image generator G_tObtaining a migration image T_t。

Experimental part:

to demonstrate the effectiveness of the method, tests were performed on the LINEMOD real dataset and the LINEMOD-PBR synthetic dataset. Firstly, inputting RGB images of a real data set and a synthetic data set into a network to obtain an RGB image after the synthetic data set is migrated; then the synthetic image, the migrated image and the label of the synthetic data set are respectively sent into a 6D attitude estimation network to obtain a 6D attitude estimation network model of the synthetic image and a 6D attitude estimation network model of the migrated image; and finally, respectively testing the performances of the two 6D posture estimation network models on the real data set.

Because the data volume of the LINEMOD real data set is small, only one thousand pictures are trained for testing. The 6D posture estimation network uses HRNet estimation key points proposed by the research institute of science and Microsoft Asia in 2019, and then calculates the object posture by one PnP. ADD indexes are tested on eight objects such as Cat, and as shown in the following table 1, the average ADD value of the method is ten percent higher than that of a LINEMOD-PBR synthetic data set, which shows that the method effectively makes up for the inter-domain gap between real data and synthetic data in a 6D attitude estimation data set.

TABLE 1

Model	PBR	Our
			Cat	0.455	0.543
Cam	0.187	0.337
			Phone	0.253	0.389
Iron	0.268	0.340
			Driller	0.617	0.766
Can	0.592	0.727
			Glue	0.211	0.255
Duck	0.139	0.164
			Mean	0.340	0.440

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A6D attitude estimation data set migration method based on image content and style decoupling is characterized by comprising the following steps:

acquiring a domain-invariant content space and a domain-specific style space of cross-domain shared information by using an encoder; secondly, exchanging a domain-invariant content space, fixing a domain-specific style space, and sending the domain representation space into an image generator to generate a domain pseudo-pairing image; finally, decoupling the pseudo-paired images and exchanging the domain-invariant content space to obtain the reconstructed source domain image

And a reconstructed target domain image

To enable training of unpaired data;

step two, training an autonomously designed image migration network: the image migration network is based on the trained pseudo-paired image generation network and adopts an object structure feature extractor H^cStyle feature extractor H^sAnd key point attention feature extractor H^kThe image generator is further refined for structures near object keypoints.

2. The decoupled 6D pose based on image content and style of claim 1The method for migrating the estimation data set is characterized in that the key point attention feature extractor H^kUsing source domain images I_sTraining to obtain the source domain image I_sAttention feature extractor H for feeding key points^kObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the object^cAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:

l represents loss, keypoint represents key point, and 2 represents L2 loss; h^cA representation structural feature extractor; h^kA representation key point feature extractor; i is_sRepresenting a source domain image; t is_tA migration image representing a target domain; the circle in the formula represents the Hadamard product; the double vertical bar subscript 2 represents the two-norm.

3. The image content and style decoupling based 6D pose estimation data set migration method according to claim 1 or 2, wherein the total loss function of the image migration network is:

L_total: total loss; λ: a weight; l is_KLLoss of KL;

loss of domain antagonism;

reconstruction loss, with the loss denoted as L1 loss at subscript 1;

structural losses, with the lower subscript 2 indicating losses as L2 losses);

loss of style;

loss of key point structure;

color loss.

4. The image content and style decoupling based 6D pose estimation data set migration method of claim 1 or 2, wherein said step two specifically comprises:

(2.1) source domain image I_sContent encoder for delivery to source domain

Deriving source domain image content coding

Target domain image I_tStyle encoder for delivery to a target domain

Obtaining target domain image style coding

(2.2) encoding target Domain image styles

And source domain image content encoding

Input target domain image generator G_tMigration of medium generation source domainsImage T_t；

(2.4) extraction of features by means of a Style extractor H^sExtracting object structural features f^s: style feature extractor H^sUsing a pre-trained VGG-19 network which is the same as the structure feature extractor, firstly sending the migration image T into the pre-trained VGG-19 network, and taking out the conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 layers to calculate a gram matrix as the style feature f^sThe weight of each layer is 1, 0.8, 0.5, 0.3 and 0.1 respectively; the style loss is defined as:

(2.5) attention feature extractor H using key points^kExtracting heat maps of key points of the image: key point attention feature elevatorGet ware H^kUsing source domain images I_sTraining to obtain the source domain image I_sAttention feature extractor H for feeding key points^kObtaining the heat map of the key points, and utilizing the heat map of the key points to extract the structural features of the object_cAnd performing attention weighting on the structural loss of the processed source domain image, wherein the structural loss of the key point is defined as:

l represents loss, keypoint represents key point, and 2 represents L2 loss; h^cA representation structural feature extractor; h^kA representation key point feature extractor; i is_sRepresenting a source domain image; t is_tA migration image representing a target domain; the circle in the formula represents the Hadamard product; the lower corner mark 2 of the double vertical lines represents a two-norm;

l represents loss, color represents color loss, and 1 represents L1 loss; ρ represents the image from the RGB color model to the LAB color model; i is_sRepresenting a source domain image; m_sRepresenting a source domain mask image; t is_tA migration image representing a target domain; the double vertical lines represent the norm, and the subscript 1 of the double vertical lines represents a norm.

5. The image content and style decoupling based 6D pose estimation data set migration method according to claim 1 or 2, wherein said utilizing is relatedKey point attention force feature extractor H^kThe heatmap for extracting key points of the image is specifically as follows:

6. The image content and style decoupling based 6D pose estimation data set migration method of claim 1 or 2, wherein said step one specifically comprises:

(1.1) source domain image I_sFeed-to-source-domain style encoder

And source domain content encoder

Deriving source-domain style coding

And source content encoding

Target domain image I_tFeed target domain style encoder

And a target domain content encoder

Obtaining target domain style coding

And target domain content encoding

(1.2) encoding the Source Domain content

And target domain content encoding

Feed content discriminator D_cContent encoding distinguishing two domains;

(1.3) Source Domain image Style coding

And content encoding of target domain images

Content coding of source domain images

(1.6) pseudo-paired image F of target Domain_sOf a genre code of Z'_s ^sAnd sourcePseudo-paired image F of domain_tContent encoding of

Pseudo-paired image F of source domain_tStyle coding of

Pseudo-paired image F with target domain_sContent encoding of

7. The image content and style decoupling based 6D pose estimation data set migration method of claim 6, wherein said source domain content encoder

And target domain image generator G_tThe last layer of (2) shares the weight;

the target domain content encoder

And source domain image generator G_sThe last layer of (2) shares the weight.

8. The image content and style decoupling based 6D pose estimation data set migration method of claim 1 or 2, further comprising the step three of testing a network:

step (ii) of3.1 merging source Domain image I_sInput source domain content encoder

Deriving source-domain content coding

Target domain image I_tInput target domain style encoder

Obtaining target domain style coding

Step 3.2 encoding the Source Domain content

And target domain style coding

Input target domain image generator G_tObtaining a migration image T_t。