CN115861614A

CN115861614A - Method and device for automatically generating semantic segmentation graph based on down jacket image

Info

Publication number: CN115861614A
Application number: CN202211520959.9A
Authority: CN
Inventors: 汤永川; 陈镇; 何永兴; 林城誉; 孙凌云
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-28

Abstract

The invention discloses a method and a device for automatically generating a semantic segmentation graph based on a down jacket image, wherein an inverse mapping encoder is added at the input end of a generator of a pre-training StyleGAN2, and w + spatial implicit codes obtained by the inverse mapping encoder based on a real down jacket image replace w spatial implicit codes obtained by z spatial random sampling, so that the accuracy of w + spatial implicit codes containing down jacket structure information can be improved, the generator can be utilized again to obtain more accurate down jacket images, and the generation capability of the generator is increased. On the basis, a pixel-level classifier is added, and after the pixel-level classifier is used for carrying out resolution adjustment, channel adjustment and splicing on the feature map output by the generator middle layer, the splicing result is decoded to generate a semantic segmentation result. Therefore, the accuracy and the speed of semantic segmentation can be improved quickly.

Description

Method and device for automatically generating semantic segmentation graph based on down jacket image

Technical Field

The invention belongs to the technical field of down jacket semantic segmentation, and particularly relates to a method and a device for automatically generating a semantic segmentation map based on a down jacket image.

Background

The down jacket is a jacket filled with down filling materials, the appearance is large and mellow, the amount of common duck down of the down jacket accounts for more than half, meanwhile, some fine feathers can be mixed, the duck down is cleaned, and the down jacket is filled in the clothes after high-temperature disinfection, so that the down jacket has the best heat retention property.

The semantic segmentation graph of the image contains the class information of objects in the image and has rich advanced knowledge. Therefore, in the design process of the down jackets, image translation is required to be carried out according to the semantic segmentation maps to generate various down jacket images. For example, in the documents "Park T, liu M Y, wang T C, et al.Sematic image synthesis with adaptive normalization [ C ]// Proceedings of the IEEE/CVF reference on computer vision and pattern recognition.2019: 2337-2346", the SPADE module connects two convolution modules outside the semantic segmentation map to respectively obtain two parameters required by spatial dynamic instance normalization, and the two parameters are used as scale and shift of the pixel level after the normalization of the intermediate feature map, so as to retain the content information of the semantic segmentation map; for example, in the document "Zhu P, abdal R, qin Y, et al, sean: image synthesis with a semantic region-adaptive normalization [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2020: 5104-5113", the SEAN is expanded on the basis of the SPADE module, which can only extract one style for the whole Image, and the SEAN generates two parameters for each label, extracts one style, and then can edit and style-migrate the region of a specific label. These tasks require a large number of semantic segmentation maps, but artificially labeling images with corresponding semantic segmentation maps is very time consuming, and it takes 5 minutes to label an image containing 15 categories on average.

In the process of generating the semantic segmentation graph, a corresponding semantic style graph is obtained from an input down jacket image, the current mainstream method is to train a semantic segmentation network, such as depeplab v3, but the generation result of the semantic segmentation network depends on a high-quality training data set, otherwise, the generation result is often unsatisfactory.

Disclosure of Invention

In view of the above, the present invention provides a method for automatically generating a semantic segmentation map based on a down jacket image, which can accurately obtain a large number of semantic segmentation maps and is convenient for subsequent applications.

In order to achieve the above object, the method for automatically generating a semantic segmentation map based on a down jacket image provided by the embodiment of the invention comprises the following steps;

acquiring a real down jacket image, and pre-training StyleGAN2 by using the real down jacket image to obtain the pre-trained StyleGAN2;

constructing a first training framework, wherein the first training framework comprises an inverse mapping encoder and a pre-training generator, the inverse mapping encoder is used for encoding a real down jacket image into w + spatial implicit coding, and the pre-training generator obtains a generated down jacket image according to the w + spatial implicit coding;

after an inverse mapping encoder in a first training framework is trained, generating a down jacket image based on a real down jacket image by using the trained inverse mapping encoder and a generator, and performing semantic marking on the down jacket image to obtain a semantic segmentation map, wherein the real down jacket image and a corresponding semantic segmentation map form segmentation samples;

constructing a second training framework, wherein the second training framework comprises a trained inverse mapping encoder, a generator and a pixel-level classifier, and the pixel-level classifier is used for performing resolution adjustment, channel adjustment and splicing on a feature map output by a middle layer of the generator and then performing decoding operation on a splicing result to generate a semantic segmentation result;

training a classifier at a pixel level in a second training framework by using a segmentation sample, and forming a semantic segmentation model by the trained inverse mapping encoder generator and the classifier at the pixel level;

and performing semantic segmentation on the input down jacket image by using a semantic segmentation model.

Preferably, the inverse mapping encoder comprises a dividing submodule, an embedding submodule, a splicing submodule, a transform encoding submodule and a mapping submodule, wherein the dividing submodule is used for dividing the down jacket image into a plurality of blocks, the embedding submodule is used for performing embedded encoding on an image block, the splicing submodule is used for splicing an embedded encoding vector with pixels at corresponding positions in the image block, the transform encoding submodule is used for performing joint encoding on all splicing results, and the mapping submodule is used for mapping the joint encoding results to obtain w + spatial implicit encoding.

Preferably, the mapping submodule comprises at least 2 MLPs.

Preferably, the embedding submodule performs embedded encoding on the image block in a linear projection manner.

Preferably, the transform coding sub-module comprises a plurality of units, each unit comprises a first normalization layer, a multi-head attention layer, a second normalization layer and a multi-layer sensing machine which are sequentially connected, the first normalization layer processes the input splicing result, the multi-head attention layer is adopted to perform multi-head attention calculation on the output result of the normalization layer to obtain the attention weight, the attention weight and the input splicing result are weighted and then input into the second normalization layer, the second normalization layer is input into the multi-layer sensing machine after calculation, and the multi-layer sensing machine is output after mapping calculation.

Preferably, the pixel-level classifier includes an encoding portion and a decoding portion, the encoding portion includes a plurality of branches and a splicing operation in parallel, each branch is connected with an AdaIN module in the generator of the StyleGAN2, resolution adjustment and channel adjustment are performed on a feature map output by the AdaIN, the splicing operation splices an output structure of each branch according to the number of channels to obtain a splicing result, and the decoding portion adopts a full convolution network with a residual error structure.

Preferably, each branch comprises an upsampling operation, a convolution operation, resolution adjustment being achieved by the upsampling operation, and channel adjustment being achieved by the convolution operation.

Preferably, when the inverse mapping encoder in the first training architecture is trained, the Loss function adopted is Loss _transformer Comprises the following steps:

Loss1＝||x-G(E(x))|| ₂

Loss2＝||Q(x)-Q(G(E(x)))|| ₂

Loss _transformer ＝αLoss1+βLoss2

wherein, loss1 represents image reconstruction Loss, x represents real down jacket image, E (-) represents inverse mapping encoder, G (-) represents generator of StyleGAN2, | | | ₂ Expressing an L2 norm, loss2 expressing an image perception similarity image, Q (·) expressing a perception feature extractor, adopting a VGG network, taking alpha and beta as adjusting parameters, and alpha + beta =1;

and when the classifier of the pixel level in the second training framework is trained, a multi-classification cross entropy loss function is adopted.

Preferably, after the semantic segmentation is performed on the input down coat image by using the semantic segmentation model, the semantic segmentation result is visually presented.

In order to achieve the above object, an embodiment of the present invention further provides an apparatus for automatically generating a semantic segmentation map based on a down jacket image, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the memory stores a semantic segmentation model constructed by the method for automatically generating a semantic segmentation map based on a down jacket image, and the processor executes the computer program to implement the following steps:

acquiring a down jacket image;

calling a semantic segmentation model to perform semantic segmentation on the down jacket image, wherein the semantic segmentation comprises the following steps: the method comprises the steps of obtaining w + spatial implicit codes by using an inverse mapping encoder according to down jacket image coding, inputting the w + spatial implicit codes into a generator, generating a plurality of feature maps based on w + spatial coding vectors by using the generator, and calculating according to the feature maps generated by the generator by using a pixel-level classifier to generate semantic segmentation results.

Compared with the prior art, the invention has the beneficial effects that at least:

an inverse mapping encoder is added at the input end of a generator of the pre-training StyleGAN2, w + spatial implicit codes obtained by the inverse mapping encoder based on real down jacket images replace w spatial implicit codes obtained by z-spatial random sampling, the accuracy of the w + spatial implicit codes containing the down jacket structure information can be improved, the generator can be reused to obtain more accurate down jacket images, and meanwhile, the generation capacity of the generator is increased. On the basis, a pixel-level classifier is added, and after the pixel-level classifier is used for carrying out resolution adjustment, channel adjustment and splicing on the feature map output by the generator middle layer, the splicing result is decoded to generate a semantic segmentation result. Therefore, the accuracy and the speed of semantic segmentation can be improved quickly.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow diagram of an embodiment providing a method for automatically generating a semantic segmentation map based on a down jacket image;

FIG. 2 is a schematic structural diagram of StyleGAN2 provided in the examples;

FIG. 3 is a schematic structural diagram of a first training architecture provided by the embodiment;

FIG. 4 is a schematic structural diagram of an inverse mapping encoder according to an embodiment;

FIG. 5 is a schematic structural diagram of another inverse mapping encoder according to an embodiment;

FIG. 6 is a schematic structural diagram of a transform coding sub-module provided by an embodiment;

fig. 7 is a schematic structural diagram of a second training architecture provided by the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

FIG. 1 is a flow diagram of an embodiment providing a method for automatically generating a semantic segmentation map based on a down jacket image. As shown in fig. 1, the method for automatically generating a semantic segmentation graph provided by the embodiment includes the following steps:

step 1, acquiring a real down jacket image, and pre-training the StyleGAN2 by using the real down jacket image to obtain the pre-trained StyleGAN2.

Downloading and actually shooting 25w total down jacket images from the internet, and cleaning and preprocessing the down jacket images, wherein the core operation comprises the following steps: removing down jacket images with people by using an OpenPose algorithm, performing foreground extraction on the images by using Photoshop batch processing, manually screening and classifying the front down jacket images, filling pixels of the images into squares, obtaining 1024 by 1024 pixels by using a resize function of OpenCV, and finally obtaining 3w high-quality down jackets.

As shown in fig. 2, the styligan 2 includes a model including a mapping network, a generator and a discriminator, and is trained to generate a real down jacket image according to a randomly generated vector, where the mapping network is configured to perform w spatial mapping according to a vector randomly generated in a z space to generate a w spatial coding vector, the generator generates a down jacket image according to the w spatial coding vector, a specific input w spatial coding vector is input to a network layer after affine transformation (a) to generate a down jacket image through calculation, and the discriminator is configured to discriminate authenticity of the input generated down jacket image and the real down jacket image. Pre-training StyleGAN2 by taking a real down jacket image as a sample, wherein the Loss function Loss is as follows:

wherein x is the image of the real down jacket, P _data Representing the true steganographic spatial distribution, x-P _data Representing x obeying a true steganographic spatial distribution P _data Z represents a random vector of Z space, F (Z) represents a w-space encoding vector generated by the mapping network according to Z, G (-) represents a down jacket image generated by the generator, P _z Representing the random noise distribution, z-P _z Representing z-obeyed random noise distribution P _z D (-) indicates the discrimination result of the discriminator, and E indicates expectation.

The generator comprises an AdaIN module which is a Normalization layer used for rapid neural style conversion, and is similar to a Batch Normalization (BN) layer, and the purpose of the generation is to scale and shift the output result of the network intermediate layer, so as to realize specific style conversion. The pre-trained StyleGAN2 can generate down coat images by inputting any z-space encoding vector.

And 2, constructing a first training framework, wherein the first training framework comprises an inverse mapping encoder and a pre-training generator.

In an embodiment, the original StyleGAN2 can only generate random images, in order to control generation of a specific down jacket image, an inverse mapping encoder is added on the basis of the pre-trained StyleGAN2, specifically, as shown in fig. 3, a mapping network and a discriminator of the StyleGAN2 are removed, and an inverse mapping encoder is added at an input end of a pre-trained generator to form a first training framework, in the first training framework, the inverse mapping encoder is used for encoding a real down jacket image into w + spatial steganography, and the pre-trained generator obtains the generated down jacket image according to the w + spatial steganography.

In an embodiment, as shown in fig. 4, the inverse mapping encoder includes a dividing sub-module, an embedding sub-module, a splicing sub-module, a transform encoding sub-module, and a mapping sub-module, the dividing sub-module is configured to divide the down jacket image into a plurality of blocks, the embedding sub-module is configured to perform embedded encoding on an image block, the splicing sub-module is configured to splice an embedded encoding vector with a pixel at a corresponding position in the image block, the transform encoding sub-module performs joint encoding on all splicing results, and the mapping sub-module is configured to map the joint encoding results to obtain w + spatial hidden encoding. And the embedding submodule is used for embedding and coding the image block in a linear projection mode. The mapping sub-module may employ at least 2 MLPs, e.g., comprising 3 MLPs.

In one embodiment, the inverse map encoder uses a visual transform to encode the image, resulting in a w + spatial steganographic encoding. As shown in fig. 5, firstly, the partitioning sub-module is used to perform blocking processing on the original down jacket image, and the original image is equally divided into 9 image blocks, but the image cannot be sent to the transform coding sub-module, and also needs to be embedded and spliced, each image block is embedded and coded by the embedding sub-module through a Linear Projection manner (Linear Projection of flat Patches) of a flat sheet, and the embedded code and the position information of the image block in the original down jacket image are spliced and combined by the splicing sub-module and are sequentially sent to the transform coding sub-module. And the transform coding submodule consists of 5 transform blocks, performs joint coding on the input splicing result, inputs the result into a mapping submodule consisting of 3 layers of MLPs, and obtains w + spatial hidden codes through mapping calculation. The w + spatial steganography replaces w spatial steganography obtained by z spatial random sampling, and accuracy of the w + spatial steganography including down jacket structure information can be improved through training.

In an embodiment, as shown in fig. 6, the transform coding sub-module includes a plurality of units, each unit including a first normalization layer (Norm), a Multi-Head Attention layer (Multi-Head Attention), a second normalization layer (Norm), and a Multi-layer perceptron (MLP) connected in sequence. After the first normalization layer processes the input splicing result, the multi-head attention layer is adopted to perform multi-head attention calculation on the output result of the normalization layer to obtain the attention weight, the attention weight and the input splicing result are weighted and then input into the second normalization layer, the attention weight and the input splicing result are input into the multilayer perceptron through calculation, and the multilayer perceptron is output after mapping calculation.

And 3, constructing a segmentation sample after training the inverse mapping encoder in the first training framework.

In implementation, the Loss function Loss adopted when training the inverse mapping encoder in the first training architecture _transformer Comprises the following steps:

Loss1＝||x-G(E(x))|| ₂

Loss2＝||Q(x)-Q(G(E(x)))|| ₂

Loss _transformer ＝αLoss1+βLoss2

wherein, loss1 represents image reconstruction Loss, x represents real down jacket image, E (-) represents inverse mapping encoder, G (-) represents generator of StyleGAN2, | | | ₂ Expressing L2 norm, loss2 expressing image perception similarity image, Q (-) expressing perception feature extractor, using VGG network, α and β as adjusting parameters, and α + β =1.

After training is finished, the inverse mapping encoder can generate w + spatial implicit codes containing more accurate down jacket structure information. And obtaining a generated down jacket image based on the real down jacket image by using the trained inverse mapping encoder and generator, and performing semantic marking on the generated down jacket image to obtain a semantic segmentation image, wherein the real down jacket image and the corresponding semantic segmentation image form segmentation samples.

In the embodiment, a few parts for generating the down jacket image obtained by the generator are manually classified, and there are 20 categories in total, which are respectively: the semantic segmentation drawing comprises a background, a left front piece, a right front piece, a left side piece, a right side piece, a left sleeve, a right sleeve, a front middle part, a hat, a collar, a left shoulder piece, a right shoulder piece, a lower hem, a front fly, a waistband, cuffs, pockets, a mark, an inner lining and other decorations, wherein category labels are respectively arranged from 1 to 20, and a semantic segmentation drawing is formed through semantic classification marking. The generated semantic segmentation image and the real down jacket image form segmentation samples for training a classifier at a pixel level.

And 4, constructing a second training framework, wherein the second training framework comprises a trained inverse mapping encoder and generator and a pixel-level classifier.

In the embodiment, on the basis of a first training architecture, a second training architecture is constructed by adding a pixel-level classifier, wherein the second training architecture comprises a trained inverse mapping encoder, a generator and a newly added pixel-level classifier, the inverse mapping encoder is used for generating w + spatial hidden codes according to an input real down jacket image, the generator calculates according to the w + spatial hidden codes and outputs a feature map output by each AdaIN layer, and the pixel-level classifier is used for decoding a joint splicing result to generate a semantic segmentation result after performing resolution adjustment, channel adjustment and splicing on the feature map output by the AdaIN layer.

In an embodiment, as shown in fig. 7, the pixel-level classifier includes a pixel-level encoding portion and a pixel-level decoding portion, the pixel-level encoding portion includes a plurality of branches and a splicing operation in parallel, each branch is connected to an AdaIN module in the generator of the StyleGAN2, resolution adjustment and channel adjustment are performed on a feature map output by the AdaIN, the splicing operation splices an output structure of each branch according to the number of channels to obtain a splicing result, and the pixel-level decoding portion adopts a full convolution network with a residual error structure.

Specifically, as shown in fig. 7, each branch includes an Up sampling operation (Up sample) by which resolution adjustment is implemented, and a convolution operation (conv) by which channel adjustment is implemented. Specifically, the upsampling operation adjusts the input feature map to the same resolution as the input generator image, and the convolution operation reduces the channels to 1/2 of the original. If the input image is 512 by 512 resolution, 16 feature maps are obtained after the up-sampling operation and the convolution operation, and are connected by channels, 512 by 3008 feature maps can be obtained, and each pixel is represented by a 3008-dimensional vector. Then, the 3008-dimensional vector corresponding to each pixel is sent to a decoding part, the decoding part adopts a full convolution network with a residual structure, the residual structure connects the middle layers of the full convolution network, a 512 by 20 segmentation semantic graph is output according to the 3008-dimensional vector, the probability that each pixel belongs to 20 labels respectively is represented, and the value with the maximum probability is the label corresponding to the pixel in the down jacket image.

And 5, training the classifier at the pixel level in the second training framework by using the segmentation samples, and then constructing a semantic segmentation model.

In the embodiment, a second training framework is trained by using a small number of constructed segmentation samples, only a classifier at a pixel level is optimized during specific training, the classifier is enabled to classify each pixel in an image, and a loss function adopted during training is multi-classification cross entropy loss. Thus, only a small amount of manual participation is needed, a successful classifier with a pixel level can be trained and used for generating a semantic segmentation graph corresponding to the down jacket image.

After training, the trained inverse mapping encoder, generator and pixel-level classifier form a semantic segmentation model, and the semantic segmentation model can generate a corresponding semantic segmentation map according to an input image. Meanwhile, if an inverse mapping encoder is removed, a model capable of generating endless down jacket images and corresponding segmentation maps can be formed through random sampling z, a generator and a pixel-level classifier.

And 6, performing semantic segmentation on the input down jacket image by using a semantic segmentation model.

In the embodiment, when the semantic segmentation model is used for performing semantic segmentation on an input down jacket image, firstly, an inverse mapping encoder is used for encoding according to the down jacket image to obtain w + spatial hidden codes, the w + spatial hidden codes are input into a generator, the generator is used for generating a plurality of feature maps based on w + spatial coding vectors, and a pixel-level classifier is used for calculating according to the feature maps generated by the generator to generate a semantic segmentation result.

In the embodiment, a semantic annotation tool is further designed, and after the semantic segmentation model is used for performing semantic segmentation on the input down jacket image, the semantic annotation tool is used for visually presenting a semantic segmentation result. When the visual presentation is carried out, the same semantic categories adopt one color for presentation, different semantic categories adopt different colors for distinguishing, semantic marking can be automatically realized through a semantic marking tool, and a painting brush provided by the semantic marking tool can be manually utilized for modification aiming at a part with wrong marking, so that the workload of marking personnel is greatly reduced.

Based on the same inventive concept, an embodiment further provides an apparatus for automatically generating a semantic segmentation map based on a down jacket image, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the memory stores a semantic segmentation model constructed by the method for automatically generating a semantic segmentation map based on a down jacket image, and the processor executes the computer program to implement the following steps:

step 1, acquiring a down jacket image;

step 2, calling a semantic segmentation model to perform semantic segmentation on the down jacket image, wherein the semantic segmentation comprises the following steps: the method comprises the steps of obtaining w + spatial implicit codes by using an inverse mapping encoder according to down jacket image coding, inputting the w + spatial implicit codes into a generator, generating a plurality of feature maps based on w + spatial coding vectors by using the generator, and calculating according to the feature maps generated by the generator by using a pixel-level classifier to generate semantic segmentation results.

Compared with the traditional training semantic segmentation network, the method and the device for automatically generating the semantic segmentation map based on the down jacket image provided by the embodiment can obtain the semantic segmentation map with higher quality by pre-arranging the inverse mapping encoder on the StyleGAN2 and the classifier at the pixel level. And when training the classifier, only a small amount of manual work is needed to successfully train the classifier at a pixel level, a semantic segmentation graph corresponding to the input down jacket image can be generated based on the trained semantic segmentation model for subsequent learning of other tasks, and meanwhile, if an inverse mapping encoder module is removed, the generator and the classifier at the pixel level can form a model capable of generating an endless down jacket image and a corresponding segmentation graph through random sampling z.

In practical applications, the memory may be a volatile memory at the near end, such as RAM, a non-volatile memory, such as ROM, FLASH, a floppy disk, a mechanical hard disk, etc., or a remote storage cloud. The processor may be a Central Processing Unit (CPU), a microprocessor unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Graphics Processing Unit (GPU), or a neural Network Processor (NPU), i.e., steps of the method for automatically generating the semantic segmentation map based on the down jacket image may be implemented by the processor.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for automatically generating semantic segmentation maps based on down jacket images is characterized by comprising the following steps:

2. The method for automatically generating the semantic segmentation map based on the down jacket image according to claim 1, wherein the inverse mapping encoder comprises a partitioning submodule, an embedding submodule, a splicing submodule, a transform encoding submodule and a mapping submodule, wherein the partitioning submodule is used for partitioning the down jacket image into a plurality of blocks, the embedding submodule is used for embedding and encoding an image block, the splicing submodule is used for splicing an embedded encoding vector with pixels at corresponding positions in the image block, the transform encoding submodule is used for jointly encoding all splicing results, and the mapping submodule is used for mapping the joint encoding results to obtain w + spatial hidden codes.

3. The method for automatically generating semantic segmentation maps based on down jacket images according to claim 2, characterized in that the mapping sub-module comprises at least 2 MLPs.

4. The method for automatically generating the semantic segmentation map based on the down jacket image as claimed in claim 2, wherein the embedding sub-module is used for embedding and coding the image blocks in a linear projection mode.

5. The method for automatically generating the semantic segmentation map based on the down jacket image according to claim 1, wherein the transform coding submodule comprises a plurality of units, each unit comprises a first normalization layer, a multi-head attention layer, a second normalization layer and a multi-layer perceptron, which are sequentially connected, after the first normalization layer processes the input splicing result, the multi-head attention layer is adopted to perform multi-head attention calculation on the output result of the normalization layer to obtain an attention weight, the attention weight and the input splicing result are weighted and then input into the second normalization layer, the attention weight and the input splicing result are input into the multi-layer perceptron through calculation, and the multi-layer perceptron is output after mapping calculation.

6. The method for automatically generating the semantic segmentation map based on the down jacket image as claimed in claim 1, wherein the pixel-level classifier comprises an encoding part and a decoding part, the encoding part comprises a plurality of parallel branches and a splicing operation, each branch is connected with an AdaIN module in a generator of StyleGAN2, resolution adjustment and channel adjustment are performed on the feature map output by the adaIN, the splicing operation splices output structures of each branch according to the number of channels to obtain a splicing result, and the decoding part adopts a full convolution network with a residual error structure.

7. The method for automatically generating the semantic segmentation map based on the down jacket image as claimed in claim 6, wherein each branch comprises an up-sampling operation and a convolution operation, wherein the up-sampling operation is used for realizing resolution adjustment, and the convolution operation is used for realizing channel adjustment.

8. The method for automatically generating semantic segmentation maps based on down jacket images as claimed in claim 1, wherein a Loss function Loss is adopted when training an inverse mapping encoder in the first training framework _transformer Comprises the following steps:

Loss1＝||x-G(E(x))|| ₂

Loss2＝||Q(x)-Q(G(E(x)))|| ₂

Loss _transformer ＝αLoss1+βLoss2

wherein, loss1 represents image reconstruction Loss, x represents real down jacket image, E (-) represents inverse mapping encoder, G (-) represents generator of StyleGAN2, | | | ₂ Expressing an L2 norm, loss2 expressing an image perception similarity image, Q (-) expressing a perception feature extractor, adopting a VGG16 network, taking alpha and beta as adjusting parameters, and alpha + beta =1;

9. The method for automatically generating the semantic segmentation map based on the down jacket image as claimed in claim 1, wherein the semantic segmentation result is visually presented after performing semantic segmentation on the input down jacket image by using a semantic segmentation model.

10. An apparatus for automatically generating semantic segmentation maps based on images of down jackets, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the memory stores therein a semantic segmentation model constructed by the method for automatically generating semantic segmentation maps based on images of down jackets according to any one of claims 1 to 9, and the processor executes the computer program to implement the following steps:

acquiring a down jacket image;

calling a semantic segmentation model to perform semantic segmentation on the down jacket image, wherein the semantic segmentation comprises the following steps: the method comprises the steps of obtaining w + spatial hidden codes by using an inverse mapping encoder according to down jacket image coding, inputting the w + spatial hidden codes into a generator, generating a plurality of feature maps based on w + spatial coding vectors by using the generator, and calculating according to the feature maps generated by the generator by using a pixel-level classifier to generate a semantic segmentation result.