CN115082295B

CN115082295B - An image editing method and device based on self-attention mechanism

Info

Publication number: CN115082295B
Application number: CN202210715523.9A
Authority: CN
Inventors: 宋丹; 曾建豪; 童若锋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2024-04-02
Anticipated expiration: 2042-06-23
Also published as: CN115082295A

Abstract

The invention discloses a fashion image editing method and device based on a self-attention mechanism, wherein the method comprises the following steps: extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining coarse image editing results of different levels, generating refined image editing results of different levels, and predicting masks corresponding to target images; respectively extracting the refined image editing result and the characteristics of the image editing information through an encoder, and respectively traversing the selected channel image block and the spatial image block from the refined image editing result and the characteristics of the image editing information to calculate the attention weight matrix of the current level; multiplying the attention weight matrix and the feature points of the image editing information obtained from the previous level to generate features of the image editing information of the current level, and decoding the features of the image editing information through a convolutional neural network until a final fashion editing image is generated. The device comprises: a processor and a memory. The invention improves the quality and accuracy of the generated image.

Description

An image editing method and device based on self-attention mechanism

技术领域Technical Field

本发明涉及计算机视觉的图像生成领域，尤其涉及一种基于自注意力机制的图像编辑方法及装置。The present invention relates to the field of image generation in computer vision, and in particular to an image editing method and device based on a self-attention mechanism.

背景技术Background technique

随着互联网的高速发展和日渐普及，时尚图像编辑技术在多个领域都有着广泛的应用。如虚拟试衣不仅可以增强消费者的体验感，改变传统的购物方式，而且有助于降低销售成本^[1]。而姿势引导的人体图像生成技术在电影制作、在线购物和行人重识别等领域都有众多的潜在应用。人脸编辑和时装编辑有助于为时尚领域注入新的活力，并改进用户的消费体验。使用深度学习技术生成逼真的时尚图像将为时尚设计和营销、产业智能化发展发挥更重要作用。With the rapid development and increasing popularity of the Internet, fashion image editing technology has been widely used in many fields. For example, virtual fitting can not only enhance consumers' experience and change traditional shopping methods, but also help reduce sales costs ^[1] . The posture-guided human image generation technology has numerous potential applications in film production, online shopping, and pedestrian re-identification. Face editing and fashion editing help inject new vitality into the fashion field and improve users' consumption experience. Using deep learning technology to generate realistic fashion images will play a more important role in fashion design and marketing, and the development of industrial intelligence.

近年来，许多研究工作聚焦于通过使用卷积神经网络对原始图像和图像变换信息进行全局图像的特征信息提取，并通过估计特征之间的映射关系实现图像的变形或编辑，如APS^[2]。但是由于卷积神经网络每次操作只能关注到卷积核附近的信息，不能融合卷积核远处的信息，而在时尚图像生成任务中不仅常常需要考虑全局信息之间的联系和影响，而且需要考虑通道信息和图像信息之间的关联，因此对图像进行编辑时往往会对原有信息产生破坏。In recent years, many research works have focused on extracting global image feature information from original images and image transformation information using convolutional neural networks, and realizing image deformation or editing by estimating the mapping relationship between features, such as APS ^[2] . However, since convolutional neural networks can only focus on information near the convolution kernel in each operation and cannot integrate information far from the convolution kernel, in the fashion image generation task, it is often necessary to consider not only the connection and influence between global information, but also the relationship between channel information and image information. Therefore, when editing an image, the original information is often destroyed.

除此之外，先前的工作往往采用基于薄板样条变换或外观流变换估计特征之间的映射关系。例如：clothwarp^[3]，但是薄板样条变换无法准确处理较大的几何变形，而由于高度自由度和缺乏适当的正则化，外观流变换在图像变换过程中经常导致剧烈变形，从而产生显著的纹理伪影。除此之外，传统的外观流变换和薄版样条变换无法生成原有图像中不存在的信息，导致无法有效地完成图像编辑任务。In addition, previous work often uses thin plate spline transformation or appearance flow transformation to estimate the mapping relationship between features. For example: clothwarp ^[3] , but the thin plate spline transformation cannot accurately handle large geometric deformations, and due to the high degree of freedom and lack of appropriate regularization, the appearance flow transformation often causes severe deformation during the image transformation process, resulting in significant distortion. Texture artifacts. In addition, traditional appearance flow transformation and thin spline transformation cannot generate information that does not exist in the original image, making it impossible to effectively complete the image editing task.

发明内容Contents of the invention

本发明提供了一种基于自注意力机制的图像编辑方法及装置，本发明利用原始图像和图像编辑信息作为输入数据，分别提取二者的高级特征信息，并且根据高级特征信息之间的变换及映射关系估计多层级外观流变换矩阵，利用多层级外观流变换矩阵生成一系列的粗糙的目标编辑图像，并在此基础上使用自注意力机制捕获各个局部信息之间的联系，对粗糙的目标编辑图像进行优化，生成最终的时尚编辑图像，并提高了生成图像的质量和准确性，详见下文描述：The present invention provides an image editing method and device based on a self-attention mechanism. The present invention uses original images and image editing information as input data, respectively extracts high-level feature information of the two, and based on the transformation between high-level feature information and The mapping relationship estimates the multi-level appearance flow transformation matrix, uses the multi-level appearance flow transformation matrix to generate a series of rough target edited images, and uses the self-attention mechanism to capture the connection between each local information. Edit images are optimized to produce the final stylish edited image and improve the quality and accuracy of the generated image, as described below:

第一方面，一种基于自注意力机制的图像编辑方法，所述方法包括：In the first aspect, an image editing method based on a self-attention mechanism, the method includes:

针对原始图像和图像编辑信息，利用卷积神经网络提取两者的特征信息并生成多层级的特征信息对；For the original image and image editing information, a convolutional neural network is used to extract the feature information of both and generate multi-layer feature information pairs;

通过估计特征信息对之间的变换及映射关系生成多层级的外观流变换矩阵，并使用外观流变换矩阵对不同尺寸的原始图像进行转换或弯曲，生成一系列不同尺寸且粗糙的图像编辑结果；Generate a multi-level appearance flow transformation matrix by estimating the transformation and mapping relationship between pairs of feature information, and use the appearance flow transformation matrix to transform or bend original images of different sizes to generate a series of rough image editing results of different sizes;

利用循环卷积神经网络提取图像编辑信息的特征，对不同层级的粗糙的图像编辑结果进行渲染细化，生成不同层级的细化的图像编辑结果并预测目标图像对应的掩膜；Use circular convolutional neural networks to extract features of image editing information, render and refine rough image editing results at different levels, generate refined image editing results at different levels and predict the mask corresponding to the target image;

通过编码器分别提取细化的图像编辑结果和图像编辑信息的特征，再分别从细化的图像编辑结果和图像编辑信息的特征中遍历选取通道图像块和空间图像块以计算当前层级的注意力权重矩阵；The encoder extracts the features of the refined image editing results and image editing information respectively, and then traverses and selects channel image blocks and spatial image blocks from the features of the refined image editing results and image editing information to calculate the attention of the current level. weight matrix;

将注意力权重矩阵和上一层级得到的图像编辑信息的特征点乘生成当前层级的图像编辑信息的特征，再通过卷积神经网络对图像编辑信息的特征进行解码，直至生成最终的时尚编辑图像。The features of the image editing information at the current level are generated by multiplying the attention weight matrix and the features of the image editing information obtained from the previous level, and then decoding the features of the image editing information through the convolutional neural network until the final fashion edited image is generated. .

其中，所述原始图像和图像编辑信息为：对于虚拟试衣任务，原始图像是一张人物图像，图像编辑信息是一张待换服装图片；对于姿势引导的人物图像编辑任务，原始图像是一张人物图像，图像编辑信息是目标人体姿势；对于人脸编辑任务，原始图像是一张人脸图像，图像编辑信息是经由用户编辑的语义分割图；对于时装编辑任务，原始图像是一张人物图像，图像编辑信息是经由用户编辑的草图。Wherein, the original image and image editing information are: for the virtual fitting task, the original image is a character image, and the image editing information is a picture of clothing to be changed; for the posture-guided character image editing task, the original image is a A person image, the image editing information is the target human pose; for the face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by the user; for the fashion editing task, the original image is a person Image, image editing information is a sketch edited by the user.

进一步地，所述卷积神经网络为：Furthermore, the convolutional neural network is:

使用基于ResNet架构，构建两个多尺度特征提取网络，每个特征提取网络分别从原始图像和图像编辑信息中提取特征，每个特征提取网络包含一次降采样操作和两个残差网络，每次降采样操作包含一层卷积、一次数据归一化处理和一个激活函数，每个残差网络包含两层卷积、两次数据归一化处理和两个激活函数；Using the ResNet-based architecture, two multi-scale feature extraction networks are constructed. Each feature extraction network extracts features from the original image and image editing information respectively. Each feature extraction network contains a downsampling operation and two residual networks. Each time The downsampling operation includes one layer of convolution, one data normalization process and one activation function. Each residual network includes two layers of convolution, two data normalization processes and two activation functions;

两个多尺度特征提取网络分别生成通道数为256的三个不同尺寸下的特征矩阵，组成多层级的特征信息对为{{c₁,p₁},{c₂,p₂},{c₃,p₃}}，c_i,p_i∈R^H×W×C，其中c_i表示由原始图像中提取的第i层的特征信息，p_i表示由图像编辑信息中提取的第i层的特征信息，H,W,C分别代表视图特征的高，宽，通道数，R为实数集。The two multi-scale feature extraction networks respectively generate feature matrices of three different sizes with a channel number of 256. The multi-level feature information pairs are {{c ₁ , p ₁ }, {c ₂ , p ₂ }, {c ₃ ,p ₃ }}, c _i , p _i ∈R ^H×W×C , where c _i represents the feature information of the i-th layer extracted from the original image, and p _i represents the i-th layer extracted from the image editing information feature information, H, W, and C respectively represent the height, width, and number of channels of the view features, and R is a set of real numbers.

其中，所述外观流变换矩阵包括：坐标变换矩阵及像素偏差矩阵，所述坐标变换矩阵对原始图像中的像素进行重新排列，用于弯曲和变换原始图像；所述像素偏差矩阵对坐标变换后的像素进行补偿，用于生成原始图像中没有的编辑信息。Wherein, the appearance flow transformation matrix includes: a coordinate transformation matrix and a pixel deviation matrix. The coordinate transformation matrix rearranges the pixels in the original image and is used to bend and transform the original image; the pixel deviation matrix rearranges the pixels after coordinate transformation. pixels are compensated to generate editing information not present in the original image.

进一步地，每层外观流变换估计矩阵均由一个FlowNetSimple网络和两个FlowNetCor网络堆叠而成，看作一编码器-解码器的架构；Furthermore, the appearance flow transformation estimation matrix of each layer is stacked by a FlowNetSimple network and two FlowNetCor networks, which is regarded as an encoder-decoder architecture;

FlowNetSimple网络的编码器部分将原始图像和图像编辑信息按通道维度堆叠到一起，使用一系列卷积层提取特征，包含九个卷积层，其中六个卷积层的步长为2，每一层后还设置有一非线性的ReLU激活函数；The encoder part of the FlowNetSimple network stacks the original image and image editing information together in the channel dimension and uses a series of convolutional layers to extract features. It contains nine convolutional layers, six of which have a stride of 2 and a nonlinear ReLU activation function after each layer.

FlowNetCor网络的编码器部分先通过三个卷积层分别提取原始图像和图像编辑信息的特征，再遍历两个特征中的图像块进行相关性计算，中心坐标为(x₁,x₂)的图像块的相关性计算公式如下：The encoder part of the FlowNetCor network first extracts the features of the original image and image editing information through three convolutional layers, and then traverses the image blocks in the two features to perform correlation calculations. The correlation calculation formula for the image block with the center coordinates (x ₁ ,x ₂ ) is as follows:

其中，f₁和f₂分别表示原始图像和图像编辑信息的特征，k代表图像块的大小，通过计算当前图像块内不同位置的两个特征向量的点乘之和，得到中心坐标为(x₁,x₂)的图像块的相关性并用于后续的解码；Among them, f ₁ and f ₂ represent the characteristics of the original image and image editing information respectively, and k represents the size of the image block. By calculating the sum of the dot products of the two feature vectors at different positions in the current image block, the center coordinate is (x ₁ , x ₂ ) image block correlation and used for subsequent decoding;

外观流变换估计网络在堆叠一个FlowNetSimple网络和两个FlowNetCor网络过程中，将编码模块部分中大小为7x7和5x5的卷积核均换为多层3x3卷积核以增加对小位移的分辨率。In the process of stacking one FlowNetSimple network and two FlowNetCor networks, the appearance flow transformation estimation network replaces the 7x7 and 5x5 convolution kernels in the encoding module part with multi-layer 3x3 convolution kernels to increase the resolution of small displacements.

其中，所述注意力权重矩阵为：Among them, the attention weight matrix is:

通过编码器提取细化的图像编辑结果和图像编辑信息的特征，并分别遍历二者对应的图像块，对某个坐标(x,y)对应的核向量为：The encoder extracts the features of the refined image editing result and image editing information, and traverses the image blocks corresponding to the two respectively. The kernel vector corresponding to a certain coordinate (x, y) is:

k(x,y)＝M(f_s(x,y),f_t(x,y))k(x,y)=M(f _s (x, y), f _t (x, y))

其中，f_s和f_t分别表示细化的图像编辑结果和图像编辑信息的特征，而f_s(x,y)和f_t(x,y)表示坐标(x,y)处的细化的图像编辑结果和图像编辑信息的特征向量；M表示全连接层，采用softmax层作为激活函数，输出一个一维向量表示当前坐标下的图像块中各个点的重要程度，即核向量，将所有坐标的核向量进行拼接得到目前注意力权重矩阵；Among them, f _s and f _t represent the characteristics of the refined image editing result and the image editing information respectively, while f _s (x, y) and f _t (x, y) represent the refined image at the coordinate (x, y). Feature vector of the image editing result and image editing information; M represents the fully connected layer, using the softmax layer as the activation function, and outputs a one-dimensional vector representing the importance of each point in the image block at the current coordinates, that is, the kernel vector, which combines all coordinates The kernel vectors are spliced to obtain the current attention weight matrix;

将注意力权重矩阵和上一层级得到的图像编辑信息的特征进行点乘和平均池化，生成当前层级的图像编辑信息的特征用于后续解码。The attention weight matrix and the features of the image editing information obtained from the previous level are dot multiplied and average pooled to generate the features of the image editing information of the current level for subsequent decoding.

第二方面、一种基于自注意力机制的图像编辑装置，所述装置包括：处理器和存储器，所述存储器中存储有程序指令，所述处理器调用存储器中存储的程序指令以使装置执行第一方面中的任一项所述的方法步骤。The second aspect is an image editing device based on a self-attention mechanism. The device includes: a processor and a memory. Program instructions are stored in the memory. The processor calls the program instructions stored in the memory to cause the device to execute. The method steps of any one of the first aspects.

第三方面、一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令被处理器执行时使所述处理器执行第一方面中的任一项所述的方法步骤。In a third aspect, a computer-readable storage medium stores a computer program. The computer program includes program instructions. When executed by a processor, the program instructions cause the processor to execute the first aspect. The method steps described in any one of.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical solution provided by the present invention are:

1、本发明通过对原始图像和图像编辑信息进行特征提取与关联性融合，找到准确的特征映射关系，使用自注意力有效地捕获图像各个局部信息之间的联系，计算注意力权重矩阵并使用图像编辑信息作为先验信息进行约束；区别于传统的只使用单一图像块的自注意力，本发明采用两个独立的图像块将通道注意力和空间注意力进行串联，并将提取到的通道特征和空间特征进行仿射变换，使得网络对图像的空间特征信息和通道特征信息的理解更加准确，增强网络捕获长距离依赖关系的能力，提升了生成图像的准确性。1. The present invention finds accurate feature mapping relationships by performing feature extraction and correlation fusion on the original image and image editing information, uses self-attention to effectively capture the connection between various local information of the image, calculates the attention weight matrix and uses The image editing information is constrained as a priori information; different from the traditional self-attention that only uses a single image block, the present invention uses two independent image blocks to connect the channel attention and spatial attention in series, and extracts the channel The affine transformation of features and spatial features enables the network to understand the spatial feature information and channel feature information of the image more accurately, enhances the network's ability to capture long-distance dependencies, and improves the accuracy of generated images.

2、本发明通过估计多个层级的外观流变换矩阵，保留并传递了不同尺度下的特征映射关系，避免了原始图像的剧烈变形；区别于传统的只有坐标变换的外观流变换矩阵，本发明采用坐标变换矩阵和像素补偿矩阵，提高了外观流变换生成的图像信息的丰富性；并通过循环神经网络对粗糙图像进行渲染细化，优化了生成图像的质量。2. By estimating appearance flow transformation matrices at multiple levels, the present invention retains and transmits feature mapping relationships at different scales, avoiding severe deformation of the original image; different from the traditional appearance flow transformation matrix that only has coordinate transformation, the present invention The coordinate transformation matrix and pixel compensation matrix are used to improve the richness of the image information generated by the appearance flow transformation; the rough image is rendered and refined through a recurrent neural network to optimize the quality of the generated image.

因此，本发明能够有效估计原始图像和图像编辑信息的特征映射关系，并捕获图像局部信息之间的关联，提升了生成的时尚编辑图像的质量和准确性。Therefore, the present invention can effectively estimate the feature mapping relationship between the original image and the image editing information, and capture the correlation between the local information of the image, thereby improving the quality and accuracy of the generated fashion edited image.

附图说明Description of drawings

图1为一种基于自注意力机制的图像编辑方法的流程图；Figure 1 is a flow chart of an image editing method based on the self-attention mechanism;

图2为以姿势变换为例的基于自注意力机制的图像编辑方法的示意图；Figure 2 is a schematic diagram of the image editing method based on the self-attention mechanism, taking posture transformation as an example;

图3为以姿势变换为例的基于自注意力机制的图像编辑方法的外观流变换的示意图；FIG3 is a schematic diagram of appearance flow transformation of an image editing method based on a self-attention mechanism, taking posture transformation as an example;

图4为以姿势变换为例的基于自注意力机制的图像编辑方法的细化网络的示意图；Figure 4 is a schematic diagram of the refinement network of the image editing method based on the self-attention mechanism, taking pose transformation as an example;

图5为以姿势变换为例的基于自注意力机制的图像编辑方法的自注意力图像生成网络的示意图；Figure 5 is a schematic diagram of the self-attention image generation network of the image editing method based on the self-attention mechanism, taking posture transformation as an example;

图6为一种基于自注意力机制的图像编辑装置的结构示意图。Figure 6 is a schematic structural diagram of an image editing device based on a self-attention mechanism.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below.

实施例1Example 1

一种基于自注意力机制的图像编辑方法，参见图1，该方法包括以下步骤：An image editing method based on the self-attention mechanism, see Figure 1. The method includes the following steps:

步骤101：输入原始图像和图像编辑信息；Step 101: Enter the original image and image editing information;

步骤102：针对原始图像和图像编辑信息，利用卷积神经网络提取两者的特征信息并生成多层级的特征信息对；Step 102: For the original image and image editing information, use a convolutional neural network to extract feature information from both and generate multi-level feature information pairs;

步骤103：通过估计特征信息对之间的变换及映射关系生成多层级的外观流变换矩阵，并对不同尺寸的原始图像进行转换或弯曲，生成一系列不同尺寸的粗糙的图像编辑结果；Step 103: Generate a multi-level appearance flow transformation matrix by estimating the transformation and mapping relationship between feature information pairs, and convert or bend original images of different sizes to generate a series of rough image editing results of different sizes;

步骤104：利用循环卷积神经网络提取图像编辑信息的特征，用于对不同层级的粗糙的图像编辑结果进行渲染细化，并预测目标图像对应的掩膜，保存并传递了不同特征层级的语义映射结果；Step 104: extracting features of the image editing information using a recurrent convolutional neural network, which is used to render and refine the rough image editing results at different levels, and predict the mask corresponding to the target image, and save and transmit the semantic mapping results of different feature levels;

步骤105：利用自注意力捕获各个局部信息之间的联系，遍历通道图像块和空间图像块计算当前层级的注意力权重矩阵，并生成当前层级的图像编辑结果，再通过卷积神经网络对其进行解码，生成最终的时尚编辑图像。Step 105: Use self-attention to capture the connection between various local information, traverse the channel image blocks and spatial image blocks to calculate the attention weight matrix of the current level, and generate the image editing result of the current level, and then use the convolutional neural network to edit it Decoded to produce the final stylish edited image.

综上所述，本发明实施例通过上述步骤101-步骤105提高了时尚图像编辑的质量和准确性，满足了实际应用中的个性化需求。To sum up, the embodiment of the present invention improves the quality and accuracy of fashion image editing through the above-mentioned steps 101 to 105, and meets the personalized needs in practical applications.

实施例2Example 2

下面结合具体的计算公式、实例对实施例1中的方案进行进一步地介绍，详见下文描述：The scheme in Example 1 is further introduced below in combination with specific calculation formulas and examples, as described below for details:

201：输入原始图像和图像编辑信息；201: Enter the original image and image editing information;

针对时尚图像编辑的多个不同任务，其原始图像v和图像编辑信息p也各有不同，对于虚拟试衣任务，原始图像是一张人物图像，图像编辑信息是一张待换服装图片；对于姿势引导的人物图像编辑任务，原始图像是一张人物图像，图像编辑信息是目标人体姿势；对于人脸编辑任务，原始图像是一张人脸图像，图像编辑信息是经由用户编辑的语义分割图；对于时装编辑任务，原始图像是一张人物图像，图像编辑信息是经由用户编辑的草图。For many different tasks of fashion image editing, the original image v and image editing information p are also different. For the virtual fitting task, the original image is a character image, and the image editing information is a picture of clothing to be changed; for For the posture-guided human image editing task, the original image is a human image, and the image editing information is the target human posture; for the face editing task, the original image is a human face image, and the image editing information is a semantic segmentation map edited by the user. ; For the fashion editing task, the original image is a character image, and the image editing information is a sketch edited by the user.

本方法通过使用单一模型架构，根据输入信息的不同，实现四种时尚图像编辑任务。This method implements four fashion image editing tasks based on different input information by using a single model architecture.

202：针对原始图像和图像编辑信息，利用卷积神经网络提取两者的特征信息并生成多层级的特征信息对；202: For the original image and image editing information, use a convolutional neural network to extract feature information of both and generate multi-level feature information pairs;

针对输入的原始图像和图像编辑信息，使用基于ResNet^[4]架构设计的卷积神经网络提取不同层级的特征信息对。具体来说，构建两个多尺度特征提取网络，每个网络分别从原始图像和图像编辑信息中提取特征，每个特征提取网络包含一次降采样操作和两个残差网络，每次降采样操作包含一层卷积、一次数据归一化处理和一个激活函数，每个残差网络包含两层卷积、两次数据归一化处理和两个激活函数。其卷积核大小为3×3，激活函数选取ReLU激活函数，ReLU函数为ReLU(x)＝max(x,0)，其中max为最大值函数。两个多尺度特征提取网络分别生成通道数为256的三个不同尺寸下的特征矩阵，组成多层级的特征信息对。得到的多层级的特征信息对为{{c₁,p₁},{c₂,p₂},{c₃,p₃}}，c_i,p_i∈R^H×W×C，其中c_i表示由原始图像中提取的第i层的特征信息，p_i表示由图像编辑信息中提取的第i层的特征信息，H,W,C分别代表视图特征的高，宽，通道数，R为实数集。For the input original image and image editing information, a convolutional neural network designed based on the ResNet ^[4] architecture is used to extract feature information pairs at different levels. Specifically, two multi-scale feature extraction networks are constructed. Each network extracts features from the original image and image editing information respectively. Each feature extraction network contains one downsampling operation and two residual networks. Each downsampling operation It contains one layer of convolution, one data normalization process and one activation function. Each residual network contains two layers of convolution, two data normalization processes and two activation functions. The convolution kernel size is 3×3, and the activation function selects the ReLU activation function. The ReLU function is ReLU(x)=max(x,0), where max is the maximum value function. The two multi-scale feature extraction networks generate feature matrices of three different sizes with a channel number of 256, forming multi-level feature information pairs. The obtained multi-level feature information pairs are {{c ₁ , p ₁ }, {c ₂ , p ₂ }, {c ₃ , p ₃ }}, c _i , p _i ∈R ^H×W×C , where c _i represents the feature information of the i-th layer extracted from the original image, p _i represents the feature information of the i-th layer extracted from the image editing information, H, W, and C respectively represent the height, width, and channel number of the view feature, R is the set of real numbers.

203：通过估计特征信息对之间的变换及映射关系生成多层级的外观流变换矩阵，并对不同尺寸的原始图像进行转换或弯曲，生成一系列不同尺寸的粗糙的图像编辑结果；对于每一层级来说，将特征信息对和上一层级生成的外观流变换矩阵作为输入，使用经过重新设计后的外观流变换估计网络来计算当前层级的外观流变换矩阵。以第二层为例：203: Generate a multi-level appearance flow transformation matrix by estimating the transformation and mapping relationship between feature information pairs, and convert or bend original images of different sizes to generate a series of rough image editing results of different sizes; for each In terms of level, the feature information pair and the appearance flow transformation matrix generated by the previous level are used as input, and the redesigned appearance flow transformation estimation network is used to calculate the appearance flow transformation matrix of the current level. Take the second layer as an example:

f₂＝F({c₂,p₂},f₁) (1)f ₂ =F({c ₂ ,p ₂ },f ₁ ) (1)

其中{c₂,p₂}为第二层的特征信息对，f₁表示已经计算得到的第一层的外观流变换矩阵，f₂表示计算的第二层的外观流变换矩阵，F表示当前层的外观流变换估计网络。where {c ₂ , p ₂ } is the feature information pair of the second layer, f ₁ represents the calculated appearance flow transformation matrix of the first layer, f ₂ represents the calculated appearance flow transformation matrix of the second layer, and F represents the current Layers of appearance flow transform estimation networks.

具体来说，每层外观流变换估计网络都是由一个FlowNetSimple^[5]网络和两个FlowNetCorr^[5]网络堆叠而成，相比于单一的FlowNetSimple网络或单一的FlowNetCorr网络，堆叠后的网络能够有效地防止过拟合。FlowNetSimple网络和FlowNetCorr网络都可以看成是一个编码器-解码器的架构。FlowNetSimple网络的编码器部分将原始图像和图像编辑信息按通道维度堆叠到一起，然后使用一系列卷积层提取特征，包含九个卷积层，其中的六个卷积层的步长为2，每一层后面还有一个非线性的ReLU激活函数。FlowNetCor网络的编码器部分先通过三个卷积层分别提取原始图像和图像编辑信息的特征，再遍历两个特征中的图像块进行相关性计算，中心坐标为(x₁,x₂)的图像块的相关性计算公式如下：Specifically, each layer of appearance flow transformation estimation network is stacked by one FlowNetSimple ^[5] network and two FlowNetCorr ^[5] networks. Compared with a single FlowNetSimple network or a single FlowNetCorr network, the stacked network can Effectively prevent overfitting. Both the FlowNetSimple network and the FlowNetCorr network can be regarded as an encoder-decoder architecture. The encoder part of the FlowNetSimple network stacks the original image and image editing information together by channel dimension, and then uses a series of convolutional layers to extract features, including nine convolutional layers, six of which have a stride of 2. There is also a nonlinear ReLU activation function behind each layer. The encoder part of the FlowNetCor network first extracts the features of the original image and the image editing information through three convolutional layers, and then traverses the image blocks in the two features to perform correlation calculations. The image with the center coordinates (x ₁ , x ₂ ) The correlation formula of the block is as follows:

其中f₁和f₂分别表示原始图像和图像编辑信息的特征，k代表图像块的大小，通过计算当前图像块内不同位置的两个特征向量的点乘之和，得到中心坐标为(x₁,x₂)的图像块的相关性并用于后续的解码。本发明实施例的外观流变换估计网络还针对小位移情况进行了改进，将编码模块部分中大小为7x7和5x5的卷积核均换为多层3x3卷积核以增加对小位移的分辨率。where f ₁ and f ₂ represent the characteristics of the original image and image editing information respectively, and k represents the size of the image block. By calculating the sum of the dot products of the two feature vectors at different positions in the current image block, the center coordinate is (x ₁ ,x ₂ ) of the image block and used for subsequent decoding. The appearance flow transformation estimation network of the embodiment of the present invention is also improved for small displacement situations. The convolution kernels with sizes of 7x7 and 5x5 in the encoding module part are replaced with multi-layer 3x3 convolution kernels to increase the resolution of small displacements. .

用每个阶段生成的外观流变换矩阵对不同尺寸的原始图片/>进行变换，生成一系列不同尺寸的粗糙的图像编辑结果/>用于后续的图像生成。与原有的基于线性坐标变换的外观流变换矩阵不同，本发明实施例的外观流变换矩阵不仅包括坐标变换矩阵，还包括像素偏差矩阵。在时尚编辑任务中常需要生成原始图像中从未出现的信息，如果仅仅依靠坐标变换来生成粗糙的图像编辑图像，容易造成图像的严重失真。而通过像素偏差矩阵，能够在坐标变换后对像素进行补偿，以生成原始图像中没有的编辑信息。Use the appearance flow transformation matrix generated by each stage To original pictures of different sizes/> Perform transformations to produce a series of rough image edits of varying sizes/> used for subsequent image generation. Different from the original appearance flow transformation matrix based on linear coordinate transformation, the appearance flow transformation matrix in the embodiment of the present invention not only includes a coordinate transformation matrix, but also includes a pixel deviation matrix. In fashion editing tasks, it is often necessary to generate information that has never appeared in the original image. If you only rely on coordinate transformation to generate rough image editing images, it will easily cause serious distortion of the image. Through the pixel deviation matrix, pixels can be compensated after coordinate transformation to generate editing information that is not in the original image.

204：利用循环卷积神经网络提取图像编辑信息的特征，用于对不同层级的粗糙的图像编辑结果进行渲染细化，并预测目标图像对应的掩膜，保存并传递了不同特征层级的语义映射结果；204: Use circular convolutional neural network to extract features of image editing information, which is used to render and refine rough image editing results at different levels, predict the mask corresponding to the target image, and save and transfer the semantic mapping of different feature levels. result;

针对已经生成好的粗糙图像编辑结果本发明实施例使用循环残差卷积神经网络R2U_Ne^t[6]对粗糙图像进行渲染，相比于卷积神经网络，循环残差卷积神经网络在编解码过程中使用残差块而不是传统的卷积层加激活函数，这样能够有效地增加网络深度；使用循环残差卷积层进行特征积累，有助于特征提取。在渲染过程中，使用图像编辑信息指导，生成一系列不同层级的渲染后的编辑后的图像/>及其各自对应的遮罩/>即：Edit the result of the generated rough image The embodiment of the present invention uses a cyclic residual convolutional neural network R2U_Ne ^t[6] to render a rough image. Compared with a convolutional neural network, a cyclic residual convolutional neural network uses a residual block instead of a traditional convolutional layer plus an activation function during the encoding and decoding process, which can effectively increase the network depth; a cyclic residual convolutional layer is used for feature accumulation, which is helpful for feature extraction. During the rendering process, the image editing information is used as a guide to generate a series of rendered and edited images at different levels/> and their respective corresponding masks/> Right now:

u_i,m_i＝R(w_i,p) (2)u _i ,m _i =R( _wi ,p) (2)

通过生成的遮罩消除渲染图像中的多余信息并保留原始图像中的必要信息：Eliminate excess information from the rendered image and retain necessary information from the original image through the resulting mask:

其中，⊙表示逐元素相乘，表示对原始图像进行采样后得到的图像，v_i表示生成的细化后的目标编辑图像。Among them, ⊙ represents element-by-element multiplication, represents the image obtained after sampling the original image, and _vi represents the generated refined target edited image.

205：利用自注意力捕获各个局部信息之间的联系，遍历图像块生成当前层级的图像编辑结果，再通过卷积神经网络对其进行解码，生成最终的时尚编辑图像。205: Use self-attention to capture the connection between various local information, traverse the image blocks to generate the image editing result of the current level, and then decode it through the convolutional neural network to generate the final fashion edited image.

其中，自注意力机制以细化图像、图像编辑信息的特征和对应层级的遮罩作为输入，并进行特征关联性联合从而计算图像的注意力权重矩阵用于捕获图像编辑信息中的关键信息。具体来说：Among them, the self-attention mechanism takes the features of the refined image, image editing information and the mask of the corresponding level as input, and performs feature correlation to calculate the attention weight matrix of the image to capture the key information in the image editing information. Specifically:

分别从细化的图像编辑结果的特征f_s和图像编辑信息的特征f_t中遍历选取图像块f_s(x,y),f_t(x,y)，区别于传统的自注意力机制中单个n×n的图像块，本发明实施例采用两个独立的n×n图像块，沿着空间维度和通道维度遍历，将通道注意力和空间注意力进行串联，有效地提高了各个注意力权重信息在通道和空间上的联系。Traverse and select image blocks f _s (x, y), f _t (x, y) from the features f _s of the refined image editing result and the features f _t of the image editing information respectively, which is different from the traditional self-attention mechanism. For a single n×n image block, the embodiment of the present invention uses two independent n×n image blocks, traverses along the spatial dimension and the channel dimension, and connects the channel attention and the spatial attention in series, effectively improving each attention. The connection between weight information in channel and space.

其中f_s(x,y)和f_t(x,y)表示坐标(x,y)处的细化的图像编辑结果和图像编辑信息的特征向量。M表示全连接层，采用softmax层作为激活函数，输出一个一维向量表示当前坐标下的图像块中各个点的重要程度，即核向量k(x,y)。where f _s (x, y) and f _t (x, y) represent the refined image editing result and the feature vector of the image editing information at the coordinate (x, y). M represents the fully connected layer, which uses the softmax layer as the activation function to output a one-dimensional vector representing the importance of each point in the image block at the current coordinates, that is, the kernel vector k(x,y).

k(x,y)＝M(f_s(x,y),f_t(x,y)) (4)k(x,y)＝M(f _s (x, y), f _t (x, y)) (4)

使用生成的所有图像块的核向量k(x,y)生成注意力权重矩阵，使用点乘操作计算从图像编辑信息的特征f_t中(x,y)处的图像块使用自注意力机制后的结果，并利用全局平均池化操作得到坐标(x,y)的图像编辑结果p(x,y)，即：The attention weight matrix is generated using the kernel vector k(x,y) of all generated image blocks. The dot product operation is used to calculate the result of the self-attention mechanism for the image block at (x,y) in the feature f _t of the image editing information. The global average pooling operation is used to obtain the image editing result p(x,y) of the coordinate (x,y), that is:

p(x,y)＝Pooling(k(x,y)⊙f_t(x,y)) (5)p(x,y)＝Pooling(k(x,y)⊙f _t (x,y)) (5)

相比于使用f_t(x,y)作为当前层级的图像编辑信息的特征用于解码，使用注意力机制得到的特征p(x,y)能够让模型对图像编辑信息中的重要信息重点关注并充分学习吸收。Compared with using f _t (x, y) as the feature of the current level of image editing information for decoding, the feature p (x, y) obtained by using the attention mechanism can allow the model to focus on the important information in the image editing information. And fully learn and absorb.

通过遍历计算所有的图像块生成当前层级的图像编辑信息的特征p_attn，并通过生成的遮罩m消除生成的特征中的多余信息并保留原始图像编辑信息中的必要信息：The feature p _attn of the image editing information of the current level is generated by traversing and calculating all the image blocks, and the redundant information in the generated features is eliminated through the generated mask m and the necessary information in the original image editing information is retained:

p_out＝m⊙p_attn+(1-m)⊙f_t (6)p _out = _m⊙pattn + (1-m) _⊙ft (6)

逐步对不同层级使用自注意力模块和解码器，得到最终的图像编辑结果。Gradually use the self-attention module and decoder on different levels to get the final image editing result.

综上所述，本发明实施例通过上述步骤201-步骤205提高了时尚图像编辑的质量和准确性，满足了实际应用中的个性化需求。To sum up, the embodiment of the present invention improves the quality and accuracy of fashion image editing through the above-mentioned steps 201 to 205, and meets the personalized needs in practical applications.

实施例3Example 3

一种基于自注意力机制的图像编辑装置，该装置包括：处理器1和存储器2，存储器2中存储有程序指令，处理器1调用存储器2中存储的程序指令以使装置执行实施例1中的以下方法步骤：针对原始图像和图像编辑信息，利用卷积神经网络提取两者的特征信息并生成多层级的特征信息对；An image editing device based on a self-attention mechanism. The device includes: a processor 1 and a memory 2. Program instructions are stored in the memory 2. The processor 1 calls the program instructions stored in the memory 2 to cause the device to execute Embodiment 1. The following method steps: for the original image and image editing information, use a convolutional neural network to extract feature information of both and generate multi-level feature information pairs;

通过编码器分别提取细化的图像编辑结果和图像编辑信息的特征，再分别从细化的图像编辑结果和图像编辑信息的特征中遍历选取通道图像块和空间图像块以计算当前层级的注意力权重矩阵；The features of the refined image editing results and the image editing information are extracted through the encoder, and then the channel image blocks and the spatial image blocks are traversed and selected from the features of the refined image editing results and the image editing information to calculate the attention weight matrix of the current level;

将注意力权重矩阵和上一层级得到的图像编辑信息的特征点乘生成当前层级的图像编辑信息的特征，再通过卷积神经网络对图像编辑信息的特征进行解码，直至生成最终的时尚编辑图像。The features of the image editing information at the current level are generated by multiplying the attention weight matrix and the feature points of the image editing information obtained at the previous level. The features of the image editing information are then decoded through a convolutional neural network until the final fashion editing image is generated.

其中，原始图像和图像编辑信息为：对于虚拟试衣任务，原始图像是一张人物图像，图像编辑信息是一张待换服装图片；对于姿势引导的人物图像编辑任务，原始图像是一张人物图像，图像编辑信息是目标人体姿势；对于人脸编辑任务，原始图像是一张人脸图像，图像编辑信息是经由用户编辑的语义分割图；对于时装编辑任务，原始图像是一张人物图像，图像编辑信息是经由用户编辑的草图。Among them, the original image and image editing information are: for the virtual fitting task, the original image is a character image, and the image editing information is a picture of clothing to be changed; for the posture-guided character image editing task, the original image is a character image Image, the image editing information is the target human pose; for the face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by the user; for the fashion editing task, the original image is a person image, The image editing information is a sketch edited by the user.

进一步地，卷积神经网络为：Further, the convolutional neural network is:

其中，外观流变换矩阵包括：坐标变换矩阵及像素偏差矩阵，坐标变换矩阵对原始图像中的像素进行重新排列，用于弯曲和变换原始图像；像素偏差矩阵对坐标变换后的像素进行补偿，用于生成原始图像中没有的编辑信息。Among them, the appearance flow transformation matrix includes: a coordinate transformation matrix and a pixel deviation matrix. The coordinate transformation matrix rearranges the pixels in the original image and is used to bend and transform the original image; the pixel deviation matrix compensates the pixels after coordinate transformation, using to generate editing information not present in the original image.

FlowNetSimple网络的编码器部分将原始图像和图像编辑信息按通道维度堆叠到一起，使用一系列卷积层提取特征，包含九个卷积层，其中六个卷积层的步长为2，每一层后还设置有一非线性的ReLU激活函数；The encoder part of the FlowNetSimple network stacks the original image and image editing information together according to the channel dimension, and uses a series of convolutional layers to extract features, including nine convolutional layers, six of which have a stride of 2, each A nonlinear ReLU activation function is also set after the layer;

FlowNetCor网络的编码器部分先通过三个卷积层分别提取原始图像和图像编辑信息的特征，再遍历两个特征中的图像块进行相关性计算，中心坐标为(x₁,x₂)的图像块的相关性计算公式如下：The encoder part of the FlowNetCor network first extracts the features of the original image and the image editing information through three convolutional layers, and then traverses the image blocks in the two features to perform correlation calculations. The image with the center coordinates (x ₁ , x ₂ ) The correlation formula of the block is as follows:

其中，注意力权重矩阵为：Among them, the attention weight matrix is:

k(x,y)＝M(f_s(x,y),f_t(x,y))k(x,y)=M(f _s (x, y), f _t (x, y))

这里需要指出的是，以上实施例中的装置描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention will not be described again here.

上述的处理器1和存储器2的执行主体可以是计算机、单片机、微控制器等具有计算功能的器件，具体实现时，本发明实施例对执行主体不做限制，根据实际应用中的需要进行选择。The execution subjects of the above-mentioned processor 1 and memory 2 can be computers, microcontrollers, microcontrollers and other devices with computing functions. During specific implementation, the embodiments of the present invention do not limit the execution subjects, and they can be selected according to the needs of actual applications. .

存储器2和处理器1之间通过总线3传输数据信号，本发明实施例对此不做赘述。The data signal is transmitted between the memory 2 and the processor 1 via the bus 3, which will not be described in detail in the embodiment of the present invention.

基于同一发明构思，本发明实施例还提供了一种计算机可读存储介质，存储介质包括存储的程序，在程序运行时控制存储介质所在的设备执行上述实施例中的方法步骤。Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium. The storage medium includes a stored program. When the program is running, the device where the storage medium is located is controlled to execute the method steps in the above embodiments.

该计算机可读存储介质包括但不限于快闪存储器、硬盘、固态硬盘等。The computer-readable storage medium includes but is not limited to flash memory, hard disk, solid state drive, etc.

这里需要指出的是，以上实施例中的可读存储介质描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the embodiments of the present invention will not be described again here.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例的流程或功能。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, processes or functions according to embodiments of the present invention are generated in whole or in part.

计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中，或者通过计算机可读存储介质进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质或者半导体介质等。The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. Computer instructions may be stored in or transmitted over a computer-readable storage medium. Computer-readable storage media can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or other integrated media that contains one or more available media. Available media may be magnetic media or semiconductor media, etc.

参考文献references

[1]Ge Y,Song Y,Zhang R,et al.Parser-free virtual try-on viadistilling appearanceflows[C]//Proceedings of the IEEE/CVF Conference onComputer Vision and PatternRecognition.2021:8485-8493.[1]Ge Y,Song Y,Zhang R,et al.Parser-free virtual try-on viadistilling appearanceflows[C]//Proceedings of the IEEE/CVF Conference onComputer Vision and PatternRecognition.2021:8485-8493.

[2]Huang S,Xiong H,Cheng Z Q,et al.Generating person images withappearance-aware posestylizer[J].arXiv preprint arXiv:2007.09077,2020.[2]Huang S, Xiong H, Cheng Z Q, et al. Generating person images with appearance-aware posestylizer[J]. arXiv preprint arXiv:2007.09077,2020.

[3]Han X,Hu X,Huang W,et al.Clothflow:Aflow-based model for clothedpersongeneration[C]//Proceedings of the IEEE/CVF International Conference onComputer Vision.2019:10471-10480.[3]Han X, Hu

[4]He K,Zhang X,Ren S,et al.Deep residual learning for imagerecognition[C]//Proceedingsof the IEEE conference on computer vision andpattern recognition.2016:770-778.[4]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770-778.

[5]Ilg E,Mayer N,Saikia T,et al.Flownet 2.0:Evolution of optical flowestimation with deepnetworks[C]//Proceedings of the IEEE conference oncomputer vision and patternrecognition.2017:2462-2470.[5]Ilg E, Mayer N, Saikia T, et al. Flownet 2.0: Evolution of optical flowestimation with deepnetworks[C]//Proceedings of the IEEE conference oncomputer vision and patternrecognition.2017:2462-2470.

[6]Alom M Z,Hasan M,Yakopcic C,et al.Recurrent residual convolutionalneural networkbased on u-net(r2u-net)for medical image segmentation[J].arXivpreprintarXiv:1802.06955,2018.[6]Alom M Z,Hasan M,Yakopcic C,et al.Recurrent residual convolutional neural networkbased on u-net(r2u-net)for medical image segmentation[J].arXivpreprintarXiv:1802.06955,2018.

本发明实施例对各器件的型号除做特殊说明的以外，其他器件的型号不做限制，只要能完成上述功能的器件均可。The embodiments of the present invention do not limit the models of each device unless otherwise specified, as long as the devices can complete the above functions.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art will appreciate that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the embodiments of the present invention are only for description and do not represent the advantages or disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. An image editing method based on a self-attention mechanism, characterized in that the method comprises:

For the original image and image editing information, a convolutional neural network is used to extract the feature information of both and generate multi-layer feature information pairs;

A multi-level appearance flow transformation matrix is generated by estimating the transformation and mapping relationship between feature information pairs, and the appearance flow transformation matrix is used to transform or warp original images of different sizes to generate a series of rough image editing results of different sizes;

The recurrent convolutional neural network is used to extract the features of image editing information, and the rough image editing results at different levels are rendered and refined, and the refined image editing results at different levels are generated and the mask corresponding to the target image is predicted;

The encoder extracts the features of the refined image editing results and image editing information respectively, and then traverses and selects channel image blocks and spatial image blocks from the features of the refined image editing results and image editing information to calculate the attention of the current level. weight matrix;

Multiply the attention weight matrix and the features of the image editing information obtained from the previous level to generate the features of the image editing information of the current level, and then decode the features of the image editing information through the convolutional neural network until the final edited image is generated;

Wherein, the original image and image editing information are:

For the virtual fitting task, the original image is a character image, and the image editing information is a picture of the clothing to be changed; for the posture-guided character image editing task, the original image is a character image, and the image editing information is the target human posture; For the face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by the user; for the fashion editing task, the original image is a character image, and the image editing information is a sketch edited by the user;

The convolutional neural network is:

Using the ResNet-based architecture, two multi-scale feature extraction networks are constructed. Each feature extraction network extracts features from the original image and image editing information respectively. Each feature extraction network contains a downsampling operation and two residual networks. Each time The downsampling operation includes one layer of convolution, one data normalization process and one activation function. Each residual network includes two layers of convolution, two data normalization processes and two activation functions;

The two multi-scale feature extraction networks respectively generate feature matrices of three different sizes with a channel number of 256. The multi-level feature information pairs are {{c ₁ , p ₁ }, {c ₂ , p ₂ }, {c ₃ ,p ₃ }}, c _i , p _i ∈R ^H×W×C , where c _i represents the feature information of the i-th layer extracted from the original image, and p _i represents the i-th layer extracted from the image editing information feature information, H, W, and C respectively represent the height, width, and channel number of the view features, and R is a set of real numbers;

The appearance flow transformation matrix includes: a coordinate transformation matrix and a pixel deviation matrix, wherein the coordinate transformation matrix rearranges the pixels in the original image to bend and transform the original image; the pixel deviation matrix compensates the pixels after the coordinate transformation to generate editing information not in the original image;

The attention weight matrix is:

The encoder extracts the features of the refined image editing result and image editing information, and traverses the image blocks corresponding to the two respectively. The kernel vector corresponding to a certain coordinate (x, y) is:

k(x,y)=M(f _s (x, y), f _t (x, y))

Wherein, _fs and _ft represent the features of the refined image editing result and image editing information, respectively, and _fs (x,y) and _ft (x,y) represent the feature vectors of the refined image editing result and image editing information at the coordinate (x,y); M represents a fully connected layer, which uses the softmax layer as the activation function and outputs a one-dimensional vector representing the importance of each point in the image block at the current coordinate, namely, the kernel vector. The kernel vectors of all coordinates are concatenated to obtain the current attention weight matrix;

The attention weight matrix and the features of the image editing information obtained from the previous level are dot multiplied and average pooled to generate the features of the image editing information of the current level for subsequent decoding.

2. According to the image editing method based on the self-attention mechanism of claim 1, it is characterized in that each layer of the appearance flow transformation estimation network is composed of a FlowNetSimple network and two FlowNetCor networks stacked together, which is regarded as an encoder-decoder architecture;

The encoder part of the FlowNetSimple network stacks the original image and image editing information together according to the channel dimension, and uses a series of convolutional layers to extract features, including nine convolutional layers, six of which have a stride of 2, each A nonlinear ReLU activation function is also set after the layer;

The encoder part of the FlowNetCor network first extracts the features of the original image and the image editing information through three convolutional layers, and then traverses the image blocks in the two features to perform correlation calculations. The image with the center coordinates (x ₁ , x ₂ ) The correlation formula of the block is as follows:

Among them, f ₁ and f ₂ represent the characteristics of the original image and image editing information respectively, and k represents the size of the image block. By calculating the sum of the dot products of the two feature vectors at different positions in the current image block, the center coordinate is (x ₁ , x ₂ ) image block correlation and used for subsequent decoding;

In the process of stacking one FlowNetSimple network and two FlowNetCor networks, the appearance flow transformation estimation network replaces the 7x7 and 5x5 convolution kernels in the encoding module part with multi-layer 3x3 convolution kernels to increase the resolution of small displacements.

3. An image editing device based on a self-attention mechanism, the device comprising: a processor and a memory, the memory storing program instructions, the processor calling the program instructions stored in the memory to enable the device to execute the method steps described in any one of claims 1-2.

4. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions cause the processor to execute the right The method steps described in any one of claims 1-2.