CN115082295B - Image editing method and device based on self-attention mechanism - Google Patents
Image editing method and device based on self-attention mechanism Download PDFInfo
- Publication number
- CN115082295B CN115082295B CN202210715523.9A CN202210715523A CN115082295B CN 115082295 B CN115082295 B CN 115082295B CN 202210715523 A CN202210715523 A CN 202210715523A CN 115082295 B CN115082295 B CN 115082295B
- Authority
- CN
- China
- Prior art keywords
- image
- image editing
- information
- editing information
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000007246 mechanism Effects 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 57
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 25
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 10
- 238000009877 rendering Methods 0.000 claims abstract description 7
- 238000007670 refining Methods 0.000 claims abstract description 6
- 230000009466 transformation Effects 0.000 claims description 55
- 230000006870 function Effects 0.000 claims description 24
- 239000013598 vector Substances 0.000 claims description 23
- 230000004913 activation Effects 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 13
- 239000000284 extract Substances 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 10
- 238000005452 bending Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 238000006073 displacement reaction Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/04—Context-preserving transformations, e.g. by using an importance map
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/10—Geometric effects
- G06T15/20—Perspective computation
- G06T15/205—Image-based rendering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a fashion image editing method and device based on a self-attention mechanism, wherein the method comprises the following steps: extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining coarse image editing results of different levels, generating refined image editing results of different levels, and predicting masks corresponding to target images; respectively extracting the refined image editing result and the characteristics of the image editing information through an encoder, and respectively traversing the selected channel image block and the spatial image block from the refined image editing result and the characteristics of the image editing information to calculate the attention weight matrix of the current level; multiplying the attention weight matrix and the feature points of the image editing information obtained from the previous level to generate features of the image editing information of the current level, and decoding the features of the image editing information through a convolutional neural network until a final fashion editing image is generated. The device comprises: a processor and a memory. The invention improves the quality and accuracy of the generated image.
Description
Technical Field
The present invention relates to the field of image generation in computer vision, and in particular, to an image editing method and apparatus based on a self-attention mechanism.
Background
With the rapid development and the increasing popularity of the internet, fashionable image editing technology has been widely used in various fields. For example, the virtual fitting can not only enhance the experience of consumers and change the traditional shopping mode, but also be helpful for reducing the sales cost [1] . However, the gesture-guided human body image generation technology has many potential applications in the fields of movie production, online shopping, pedestrian re-recognition, and the like. Face editing and fashion editing help to inject new vitality for fashion fields and improve the consumer experience of the user. The deep learning technology is used for generating realistic fashion images, and plays an important role in fashion design and marketing and industrial intelligent development.
In recent years, many research efforts have focused on feature information extraction of global images by using convolutional neural networks for original images and image transformation information, and on realization of deformation or editing of images by estimating mapping relationships between features, such as APS [2] . However, the convolutional neural network only can pay attention to the information near the convolutional kernel and can not fuse the information far away from the convolutional kernel, and in a fashion image generation task, not only the relation and influence between global information, but also the relation between channel information and image information are required to be considered, so that the original information is often destroyed when the image is edited.
In addition, previous work often employed mapping relationships between estimated features based on thin-plate spline transforms or apparent-flow transforms. For example: clothwarp [3] However, thin-plate spline transforms cannot accurately handle large geometric deformations, and appearance stream transforms often result in severe deformations during image transforms due to high degrees of freedom and lack of proper regularization, thereby producing significant texture artifacts. In addition, conventional appearance stream transformation and thin spline transformation cannot generate information not existing in the original image, resulting in failure to be effectiveAnd finishing the image editing task.
Disclosure of Invention
The invention provides an image editing method and device based on a self-attention mechanism, which uses an original image and image editing information as input data, respectively extracts advanced characteristic information of the original image and the image editing information, estimates a multi-level appearance stream transformation matrix according to transformation and mapping relation between the advanced characteristic information, generates a series of rough target editing images by using the multi-level appearance stream transformation matrix, captures the relation between each local information by using the self-attention mechanism on the basis, optimizes the rough target editing images, generates a final fashion editing image, and improves the quality and accuracy of the generated images, and is described in detail below:
in a first aspect, a method of editing an image based on a self-attention mechanism, the method comprising:
extracting characteristic information of an original image and image editing information by using a convolutional neural network and generating a multi-level characteristic information pair;
generating a multi-level appearance transformation matrix by estimating transformation and mapping relation between characteristic information pairs, and converting or bending original images with different sizes by using the appearance transformation matrix to generate a series of rough image editing results with different sizes;
extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining coarse image editing results of different levels, generating refined image editing results of different levels, and predicting masks corresponding to target images;
respectively extracting the refined image editing result and the characteristics of the image editing information through an encoder, and respectively traversing the selected channel image block and the spatial image block from the refined image editing result and the characteristics of the image editing information to calculate the attention weight matrix of the current level;
multiplying the attention weight matrix and the feature points of the image editing information obtained from the previous level to generate features of the image editing information of the current level, and decoding the features of the image editing information through a convolutional neural network until a final fashion editing image is generated.
Wherein, the original image and the image editing information are: for a virtual fitting task, the original image is a character image, and the image editing information is a clothing picture to be replaced; for a gesture-guided character image editing task, the original image is a character image, and the image editing information is a target human gesture; for a face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by a user; for the fashion editing task, the original image is a person image, and the image editing information is a sketch edited via the user.
Further, the convolutional neural network is:
constructing two multi-scale feature extraction networks based on a ResNet architecture, wherein each feature extraction network respectively extracts features from an original image and image editing information, each feature extraction network comprises a downsampling operation and two residual error networks, each downsampling operation comprises a layer of convolution, a data normalization process and an activation function, and each residual error network comprises two layers of convolution, two times of data normalization process and two activation functions;
the two multi-scale feature extraction networks respectively generate three feature matrixes with 256 channels and different sizes, and the feature matrixes form multi-level feature information pairs of { { { c 1 ,p 1 },{c 2 ,p 2 },{c 3 ,p 3 }},c i ,p i ∈R H×W×C Wherein c i Characteristic information representing the ith layer extracted from the original image, p i And the characteristic information of the ith layer extracted from the image editing information is represented, H, W and C respectively represent the height, width and channel number of view characteristics, and R is a real number set.
Wherein the appearance stream transformation matrix comprises: the system comprises a coordinate transformation matrix and a pixel deviation matrix, wherein the coordinate transformation matrix rearranges pixels in an original image and is used for bending and transforming the original image; the pixel deviation matrix compensates pixels after coordinate transformation and is used for generating editing information which is not in an original image.
Further, each layer of the appearance flow transformation estimation matrix is formed by stacking one FlowNetSimple network and two FlowNetCor networks, and is regarded as an encoder-decoder framework;
the encoder part of the FlowNetSimple network stacks the original image and the image editing information together according to the channel dimension, extracts the characteristics by using a series of convolution layers, and comprises nine convolution layers, wherein the step length of six convolution layers is 2, and a nonlinear ReLU activation function is further arranged behind each layer;
the encoder part of the FlowNet Cor network extracts the features of the original image and the image editing information through three convolution layers respectively, and then traverses the image blocks in the two features to perform correlation calculation, wherein the center coordinate is (x 1 ,x 2 ) The correlation calculation formula of the image block of (a) is as follows:
wherein f 1 And f 2 Features representing the original image and the image editing information respectively, k represents the size of the image block, and the center coordinate is obtained by calculating the sum of dot products of two feature vectors at different positions in the current image block to obtain the center coordinate (x) 1 ,x 2 ) Is used for subsequent decoding;
appearance flow estimation network in stacking one FlowNetSimple network and two FlowNetCor networks, the convolution kernels of sizes 7x7 and 5x5 in the coding module part are both converted into multi-layer 3x3 convolution kernels to increase resolution for small displacements.
Wherein, the attention weight matrix is:
extracting features of the refined image editing result and the image editing information through an encoder, traversing image blocks corresponding to the refined image editing result and the image editing information respectively, and carrying out kernel vector corresponding to a certain coordinate (x, y) as follows:
k(x,y)=M(f s (x,y),f t (x,y))
wherein f s And f t Representing the thinned images respectivelyEditing result and feature of image editing information, and f s (x, y) and f t (x, y) represents feature vectors of the refined image editing result and the image editing information at coordinates (x, y); m represents a full connection layer, a softmax layer is used as an activation function, a one-dimensional vector is output to represent the importance degree of each point in an image block under the current coordinates, namely a kernel vector, and the kernel vectors of all coordinates are spliced to obtain a current attention weight matrix;
and carrying out dot multiplication and average pooling on the attention weight matrix and the characteristics of the image editing information obtained from the previous level, and generating the characteristics of the image editing information of the current level for subsequent decoding.
A second aspect, an image editing apparatus based on a self-attention mechanism, the apparatus comprising: a processor and a memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of the first aspects.
In a third aspect, a computer readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method steps of any of the first aspects.
The technical scheme provided by the invention has the beneficial effects that:
1. according to the invention, through carrying out feature extraction and relevance fusion on the original image and the image editing information, an accurate feature mapping relation is found, the connection between each local information of the image is effectively captured by using self-attention, an attention weight matrix is calculated, and the image editing information is used as prior information for constraint; different from the traditional self-attention using only a single image block, the invention adopts two independent image blocks to connect the channel attention and the space attention in series, and carries out affine transformation on the extracted channel characteristics and the spatial characteristics, so that the understanding of the spatial characteristic information and the channel characteristic information of the image by the network is more accurate, the capability of capturing long-distance dependency relationship by the network is enhanced, and the accuracy of generating the image is improved.
2. According to the method, the appearance transformation matrixes of a plurality of layers are estimated, so that the characteristic mapping relations under different scales are reserved and transferred, and the severe deformation of an original image is avoided; the invention adopts the coordinate transformation matrix and the pixel compensation matrix, which improves the richness of the image information generated by the appearance transformation; and the rough image is rendered and refined through the cyclic neural network, so that the quality of the generated image is optimized.
Therefore, the method and the device can effectively estimate the feature mapping relation of the original image and the image editing information, capture the association between the local information of the image and improve the quality and the accuracy of the generated fashion editing image.
Drawings
FIG. 1 is a flow chart of an image editing method based on a self-attention mechanism;
FIG. 2 is a schematic diagram of an image editing method based on a self-attention mechanism, taking gesture transformation as an example;
FIG. 3 is a schematic diagram of an appearance flow transformation of an image editing method based on a self-attention mechanism, taking a gesture transformation as an example;
FIG. 4 is a schematic diagram of a refinement network of an image editing method based on a self-attention mechanism, taking gesture transformation as an example;
FIG. 5 is a schematic diagram of a self-attention image generation network of a self-attention mechanism based image editing method, taking gesture transformation as an example;
fig. 6 is a schematic structural diagram of an image editing apparatus based on a self-attention mechanism.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
Example 1
An image editing method based on a self-attention mechanism, see fig. 1, the method comprising the steps of:
step 101: inputting an original image and image editing information;
step 102: extracting characteristic information of an original image and image editing information by using a convolutional neural network and generating a multi-level characteristic information pair;
step 103: generating a multi-level appearance transformation matrix by estimating transformation and mapping relation between characteristic information pairs, and converting or bending original images with different sizes to generate a series of rough image editing results with different sizes;
step 104: extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining rough image editing results of different levels, predicting masks corresponding to target images, and storing and transmitting semantic mapping results of different feature levels;
step 105: and utilizing self-attention to capture the relation among all local information, traversing the channel image block and the space image block to calculate the attention weight matrix of the current level, generating an image editing result of the current level, and decoding the image editing result through a convolutional neural network to generate a final fashion editing image.
In summary, the embodiment of the invention improves the quality and accuracy of fashion image editing through the steps 101-105, and meets the personalized requirements in practical application.
Example 2
The scheme of example 1 is further described below in conjunction with specific formulas and examples, as described below:
201: inputting an original image and image editing information;
for a plurality of different tasks of fashion image editing, the original image v and the image editing information p are different, and for a virtual fitting task, the original image is a figure image, and the image editing information is a clothing picture to be replaced; for a gesture-guided character image editing task, the original image is a character image, and the image editing information is a target human gesture; for a face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by a user; for the fashion editing task, the original image is a person image, and the image editing information is a sketch edited via the user.
According to the method, four fashionable image editing tasks are realized by using a single model architecture according to different input information.
202: extracting characteristic information of an original image and image editing information by using a convolutional neural network and generating a multi-level characteristic information pair;
for the input original image and image editing information, resNet-based is used [4] The convolutional neural network of the architecture design extracts characteristic information pairs of different levels. Specifically, two multi-scale feature extraction networks are constructed, each network extracting features from the original image and the image editing information respectively, each feature extraction network comprising a downsampling operation and two residual networks, each downsampling operation comprising a layer of convolution, a data normalization process and an activation function, each residual network comprising two layers of convolution, two times of data normalization process and two activation functions. The convolution kernel size is 3×3, the activation function selects a ReLU activation function, and the ReLU function is ReLU (x) =max (x, 0), where max is a maximum function. The two multi-scale feature extraction networks respectively generate three feature matrixes with 256 channels and different sizes to form multi-level feature information pairs. The obtained characteristic information pair of the multi-layer level is { { { c 1 ,p 1 },{c 2 ,p 2 },{c 3 ,p 3 }},c i ,p i ∈R H×W×C Wherein c i Characteristic information representing the ith layer extracted from the original image, p i And the characteristic information of the ith layer extracted from the image editing information is represented, H, W and C respectively represent the height, width and channel number of view characteristics, and R is a real number set.
203: generating a multi-level appearance transformation matrix by estimating transformation and mapping relation between characteristic information pairs, and converting or bending original images with different sizes to generate a series of rough image editing results with different sizes; for each level, the appearance flow transformation matrix of the current level is calculated using the redesigned appearance flow transformation estimation network with the feature information pairs and the appearance flow transformation matrix generated by the previous level as inputs. Taking the second layer as an example:
f 2 =F({c 2 ,p 2 },f 1 ) (1)
wherein { c } 2 ,p 2 Is the characteristic information pair of the second layer, f 1 Representing the calculated appearance flow transformation matrix of the first layer, f 2 Representing the calculated appearance flow transformation matrix of the second layer, F representing the appearance flow transformation estimation network of the current layer.
Specifically, each layer of the appearance flow transformation estimation network is formed by a FlowNetSimple [5] Network and two flownetcors [5] The network is stacked, and compared with a single FlowNetSimple network or a single FlowNetCorr network, the stacked network can effectively prevent over fitting. Both the FlowNetSimple network and the FlowNetCorr network can be seen as one encoder-decoder architecture. The encoder section of the FlowNetSimple network stacks the original image and image editing information together in the channel dimension and then extracts features using a series of convolution layers, including nine convolution layers, six of which have a step size of 2, each followed by a nonlinear ReLU activation function. The encoder part of the FlowNet Cor network extracts the features of the original image and the image editing information through three convolution layers respectively, and then traverses the image blocks in the two features to perform correlation calculation, wherein the center coordinate is (x 1 ,x 2 ) The correlation calculation formula of the image block of (a) is as follows:
wherein f 1 And f 2 Features representing the original image and the image editing information respectively, k represents the size of the image block, and the center coordinate is obtained by calculating the sum of dot products of two feature vectors at different positions in the current image block to obtain the center coordinate (x) 1 ,x 2 ) And is used for subsequent decoding. The appearance flow change estimation network of the embodiment of the invention also aims at the small displacement conditionImprovements were made to replace the convolution kernels of sizes 7x7 and 5x5 in the coding block portion with a multi-layer 3x3 convolution kernel to increase resolution for small displacements.
Transforming matrices with appearance produced at each stageFor original pictures of different sizes +.>Transforming to generate a series of rough image editing results with different sizes +.>For subsequent image generation. Unlike the original appearance stream transformation matrix based on linear coordinate transformation, the appearance stream transformation matrix of the embodiment of the invention not only comprises a coordinate transformation matrix, but also comprises a pixel deviation matrix. Information never appearing in the original image is often required to be generated in a fashion editing task, and if a rough image editing image is generated by means of only coordinate transformation, serious distortion of the image is liable to be caused. Through the pixel deviation matrix, the pixels can be compensated after coordinate transformation to generate editing information which is not in the original image.
204: extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining rough image editing results of different levels, predicting masks corresponding to target images, and storing and transmitting semantic mapping results of different feature levels;
editing results for already generated rough imagesThe embodiment of the invention uses a cyclic residual convolution neural network R2U_Ne t[6] Compared with a convolutional neural network, the cyclic residual convolutional neural network uses residual blocks instead of the traditional convolutional layer and an activation function in the coding and decoding process, so that the network depth can be effectively increased; convolving layer-by-layer using cyclic residualsLine feature accumulation facilitates feature extraction. In the rendering process, using image editing information to guide, generating a series of different levels of edited images +.>And their respective corresponding masks->Namely:
u i ,m i =R(w i ,p) (2)
eliminating redundant information in the rendered image and retaining necessary information in the original image by the generated mask:
wherein, as indicated by the multiplication element by element,representing an image obtained by sampling an original image, v i Representing the generated refined target edit image.
205: and capturing the connection among the local information by using the self-attention, traversing the image block to generate an image editing result of the current level, and decoding the image editing result through a convolutional neural network to generate a final fashion editing image.
The self-attention mechanism takes the features of the refined image and the image editing information and the mask of the corresponding level as input, and performs feature relevance combination so as to calculate an attention weight matrix of the image for capturing key information in the image editing information. Specifically:
feature f of the result is edited from the thinned images respectively s And feature f of image editing information t The image block f is selected by traversal s (x,y),f t (x, y), unlike a single nxn image block in a conventional self-attention mechanism, embodiments of the present invention employ twoThe independent n multiplied by n image blocks traverse along the space dimension and the channel dimension, and the channel attention and the space attention are connected in series, so that the connection of each attention weight information on the channel and the space is effectively improved.
Wherein f s (x, y) and f t (x, y) represents the feature vector of the image editing result and the image editing information of the refinement at the coordinates (x, y). M represents the full connection layer, adopts the softmax layer as an activation function, and outputs a one-dimensional vector to represent the importance degree of each point in the image block under the current coordinates, namely a kernel vector k (x, y).
k(x,y)=M(f s (x,y),f t (x,y)) (4)
Generating an attention weight matrix using kernel vectors k (x, y) of all the generated image blocks, calculating a feature f of editing information from the image using a point multiplication operation t The image block at (x, y) uses the self-attention mechanism result and uses global average pooling operation to obtain the image editing result p (x, y) of the coordinates (x, y), namely:
p(x,y)=Pooling(k(x,y)⊙f t (x,y)) (5)
compared with the use of f t (x, y) is used for decoding as the characteristic of the image editing information of the current level, and the characteristic p (x, y) obtained by using the attention mechanism can make the model focus on important information in the image editing information and fully learn and absorb the important information.
Generating features p of image editing information of current level by traversing all image blocks attn And eliminates redundant information in the generated features and retains necessary information in the original image editing information through the generated mask m:
p out =m⊙p attn +(1-m)⊙f t (6)
and gradually using the self-attention module and the decoder for different layers to obtain a final image editing result.
In summary, the embodiment of the invention improves the quality and accuracy of fashion image editing through the steps 201 to 205, and meets the personalized requirements in practical application.
Example 3
An image editing apparatus based on a self-attention mechanism, the apparatus comprising: a processor 1 and a memory 2, the memory 2 having stored therein program instructions, the processor 1 calling the program instructions stored in the memory 2 to cause the apparatus to perform the following method steps in embodiment 1: extracting characteristic information of an original image and image editing information by using a convolutional neural network and generating a multi-level characteristic information pair;
generating a multi-level appearance transformation matrix by estimating transformation and mapping relation between characteristic information pairs, and converting or bending original images with different sizes by using the appearance transformation matrix to generate a series of rough image editing results with different sizes;
extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining coarse image editing results of different levels, generating refined image editing results of different levels, and predicting masks corresponding to target images;
respectively extracting the refined image editing result and the characteristics of the image editing information through an encoder, and respectively traversing the selected channel image block and the spatial image block from the refined image editing result and the characteristics of the image editing information to calculate the attention weight matrix of the current level;
multiplying the attention weight matrix and the feature points of the image editing information obtained from the previous level to generate features of the image editing information of the current level, and decoding the features of the image editing information through a convolutional neural network until a final fashion editing image is generated.
Wherein, the original image and the image editing information are: for a virtual fitting task, the original image is a character image, and the image editing information is a clothing picture to be replaced; for a gesture-guided character image editing task, the original image is a character image, and the image editing information is a target human gesture; for a face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by a user; for the fashion editing task, the original image is a person image, and the image editing information is a sketch edited via the user.
Further, the convolutional neural network is:
constructing two multi-scale feature extraction networks based on a ResNet architecture, wherein each feature extraction network respectively extracts features from an original image and image editing information, each feature extraction network comprises a downsampling operation and two residual error networks, each downsampling operation comprises a layer of convolution, a data normalization process and an activation function, and each residual error network comprises two layers of convolution, two times of data normalization process and two activation functions;
the two multi-scale feature extraction networks respectively generate three feature matrixes with 256 channels and different sizes, and the feature matrixes form multi-level feature information pairs of { { { c 1 ,p 1 },{c 2 ,p 2 },{c 3 ,p 3 }},c i ,p i ∈R H×W×C Wherein c i Characteristic information representing the ith layer extracted from the original image, p i And the characteristic information of the ith layer extracted from the image editing information is represented, H, W and C respectively represent the height, width and channel number of view characteristics, and R is a real number set.
Wherein the appearance stream transformation matrix comprises: the coordinate transformation matrix rearranges pixels in the original image and is used for bending and transforming the original image; the pixel deviation matrix compensates the pixels after the coordinate transformation and is used for generating editing information which is not in the original image.
Further, each layer of the appearance flow transformation estimation matrix is formed by stacking one FlowNetSimple network and two FlowNetCor networks, and is regarded as an encoder-decoder framework;
the encoder part of the FlowNetSimple network stacks the original image and the image editing information together according to the channel dimension, extracts the characteristics by using a series of convolution layers, and comprises nine convolution layers, wherein the step length of six convolution layers is 2, and a nonlinear ReLU activation function is further arranged behind each layer;
the encoder portion of the FlowNetCor network is first divided by three convolutional layersFeatures of the original image and the image editing information are extracted respectively, then image blocks in the two features are traversed to perform correlation calculation, and center coordinates are (x 1 ,x 2 ) The correlation calculation formula of the image block of (a) is as follows:
wherein f 1 And f 2 Features representing the original image and the image editing information respectively, k represents the size of the image block, and the center coordinate is obtained by calculating the sum of dot products of two feature vectors at different positions in the current image block to obtain the center coordinate (x) 1 ,x 2 ) Is used for subsequent decoding;
appearance flow estimation network in stacking one FlowNetSimple network and two FlowNetCor networks, the convolution kernels of sizes 7x7 and 5x5 in the coding module part are both converted into multi-layer 3x3 convolution kernels to increase resolution for small displacements.
Wherein, the attention weight matrix is:
extracting features of the refined image editing result and the image editing information through an encoder, traversing image blocks corresponding to the refined image editing result and the image editing information respectively, and carrying out kernel vector corresponding to a certain coordinate (x, y) as follows:
k(x,y)=M(f s (x,y),f t (x,y))
wherein f s And f t Features representing the refined image editing result and the image editing information, respectively, and f s (x, y) and f t (x, y) represents feature vectors of the refined image editing result and the image editing information at coordinates (x, y); m represents a full connection layer, a softmax layer is used as an activation function, a one-dimensional vector is output to represent the importance degree of each point in an image block under the current coordinates, namely a kernel vector, and the kernel vectors of all coordinates are spliced to obtain a current attention weight matrix;
and carrying out dot multiplication and average pooling on the attention weight matrix and the characteristics of the image editing information obtained from the previous level, and generating the characteristics of the image editing information of the current level for subsequent decoding.
It should be noted that, the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention are not described herein in detail.
The execution main bodies of the processor 1 and the memory 2 may be devices with computing functions, such as a computer, a singlechip, a microcontroller, etc., and in particular implementation, the execution main bodies are not limited, and are selected according to the needs in practical application.
Data signals are transmitted between the memory 2 and the processor 1 via the bus 3, which is not described in detail in the embodiment of the present invention.
Based on the same inventive concept, the embodiment of the present invention also provides a computer readable storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute the method steps in the above embodiment.
The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.
It should be noted that the readable storage medium descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention are not described herein.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the invention, in whole or in part.
The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium or a semiconductor medium, or the like.
Reference to the literature
[1]Ge Y,Song Y,Zhang R,et al.Parser-free virtual try-on via distilling appearanceflows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition.2021:8485-8493.
[2]Huang S,Xiong H,Cheng Z Q,et al.Generating person images with appearance-aware posestylizer[J].arXiv preprint arXiv:2007.09077,2020.
[3]Han X,Hu X,Huang W,et al.Clothflow:Aflow-based model for clothed persongeneration[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:10471-10480.
[4]He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[C]//Proceedingsof the IEEE conference on computer vision and pattern recognition.2016:770-778.
[5]Ilg E,Mayer N,Saikia T,et al.Flownet 2.0:Evolution of optical flow estimation with deepnetworks[C]//Proceedings of the IEEE conference on computer vision and patternrecognition.2017:2462-2470.
[6]Alom M Z,Hasan M,Yakopcic C,et al.Recurrent residual convolutional neural networkbased on u-net(r2u-net)for medical image segmentation[J].arXiv preprintarXiv:1802.06955,2018.
The embodiment of the invention does not limit the types of other devices except the types of the devices, so long as the devices can complete the functions.
Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (4)
1. A method of image editing based on a self-attention mechanism, the method comprising:
extracting characteristic information of an original image and image editing information by using a convolutional neural network and generating a multi-level characteristic information pair;
generating a multi-level appearance transformation matrix by estimating transformation and mapping relation between characteristic information pairs, and converting or bending original images with different sizes by using the appearance transformation matrix to generate a series of rough image editing results with different sizes;
extracting features of image editing information by using a cyclic convolutional neural network, rendering and refining coarse image editing results of different levels, generating refined image editing results of different levels, and predicting masks corresponding to target images;
respectively extracting the refined image editing result and the characteristics of the image editing information through an encoder, and respectively traversing the selected channel image block and the spatial image block from the refined image editing result and the characteristics of the image editing information to calculate the attention weight matrix of the current level;
multiplying the attention weight matrix and the feature points of the image editing information obtained from the previous level to generate features of the image editing information of the current level, and decoding the features of the image editing information through a convolutional neural network until a final editing image is generated;
wherein, the original image and the image editing information are:
for a virtual fitting task, the original image is a character image, and the image editing information is a clothing picture to be replaced; for a gesture-guided character image editing task, the original image is a character image, and the image editing information is a target human gesture; for a face editing task, the original image is a face image, and the image editing information is a semantic segmentation map edited by a user; for a fashion editing task, the original image is a person image, and the image editing information is a sketch edited by a user;
the convolutional neural network is as follows:
constructing two multi-scale feature extraction networks based on a ResNet architecture, wherein each feature extraction network respectively extracts features from an original image and image editing information, each feature extraction network comprises a downsampling operation and two residual error networks, each downsampling operation comprises a layer of convolution, a data normalization process and an activation function, and each residual error network comprises two layers of convolution, two times of data normalization process and two activation functions;
the two multi-scale feature extraction networks respectively generate three feature matrixes with 256 channels and different sizes, and the feature matrixes form multi-level feature information pairs of { { { c 1 ,p 1 },{c 2 ,p 2 },{c 3 ,p 3 }},c i ,p i ∈R H×W×C Wherein c i Characteristic information representing the ith layer extracted from the original image, p i Characteristic information of an ith layer extracted from image editing information is represented, H, W and C respectively represent the height, width and channel number of view characteristics, and R is a real number set;
the appearance flow transformation matrix comprises: the system comprises a coordinate transformation matrix and a pixel deviation matrix, wherein the coordinate transformation matrix rearranges pixels in an original image and is used for bending and transforming the original image; the pixel deviation matrix compensates pixels after coordinate transformation and is used for generating editing information which is not in an original image;
the attention weight matrix is as follows:
extracting features of the refined image editing result and the image editing information through an encoder, traversing image blocks corresponding to the refined image editing result and the image editing information respectively, and carrying out kernel vector corresponding to a certain coordinate (x, y) as follows:
k(x,y)=M(f s (x,y),f t (x,y))
wherein f s And f t Features representing the refined image editing result and the image editing information, respectively, and f s (x, y) and f t (x, y) represents feature vectors of the refined image editing result and the image editing information at coordinates (x, y); m represents the full connection layer, adopts the softmax layer as an activation function, outputs oneThe one-dimensional vectors represent the importance degree of each point in the image block under the current coordinates, namely kernel vectors, and the kernel vectors of all the coordinates are spliced to obtain a current attention weight matrix;
and carrying out dot multiplication and average pooling on the attention weight matrix and the characteristics of the image editing information obtained from the previous level, and generating the characteristics of the image editing information of the current level for subsequent decoding.
2. The method for editing images based on self-attention mechanism according to claim 1, wherein each layer of the visual flow transformation estimation network is formed by stacking one FlowNetSimple network and two FlowNetCor networks, and is regarded as an encoder-decoder architecture;
the encoder part of the FlowNetSimple network stacks the original image and the image editing information together according to the channel dimension, extracts the characteristics by using a series of convolution layers, and comprises nine convolution layers, wherein the step length of six convolution layers is 2, and a nonlinear ReLU activation function is further arranged behind each layer;
the encoder part of the FlowNet Cor network extracts the features of the original image and the image editing information through three convolution layers respectively, and then traverses the image blocks in the two features to perform correlation calculation, wherein the center coordinate is (x 1 ,x 2 ) The correlation calculation formula of the image block of (a) is as follows:
wherein f 1 And f 2 Features representing the original image and the image editing information respectively, k represents the size of the image block, and the center coordinate is obtained by calculating the sum of dot products of two feature vectors at different positions in the current image block to obtain the center coordinate (x) 1 ,x 2 ) Is used for subsequent decoding;
appearance flow estimation network in stacking one FlowNetSimple network and two FlowNetCor networks, the convolution kernels of sizes 7x7 and 5x5 in the coding module part are both converted into multi-layer 3x3 convolution kernels to increase resolution for small displacements.
3. An image editing apparatus based on a self-attention mechanism, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-2.
4. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method steps of any of claims 1-2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210715523.9A CN115082295B (en) | 2022-06-23 | 2022-06-23 | Image editing method and device based on self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210715523.9A CN115082295B (en) | 2022-06-23 | 2022-06-23 | Image editing method and device based on self-attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115082295A CN115082295A (en) | 2022-09-20 |
CN115082295B true CN115082295B (en) | 2024-04-02 |
Family
ID=83253829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210715523.9A Active CN115082295B (en) | 2022-06-23 | 2022-06-23 | Image editing method and device based on self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115082295B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113570685A (en) * | 2021-01-27 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Image processing method and device, electronic device and storage medium |
CN114639161A (en) * | 2022-02-21 | 2022-06-17 | 深圳市海清视讯科技有限公司 | Training method of multitask model and virtual fitting method of clothes |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018128741A1 (en) * | 2017-01-06 | 2018-07-12 | Board Of Regents, The University Of Texas System | Segmenting generic foreground objects in images and videos |
-
2022
- 2022-06-23 CN CN202210715523.9A patent/CN115082295B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113570685A (en) * | 2021-01-27 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Image processing method and device, electronic device and storage medium |
CN114639161A (en) * | 2022-02-21 | 2022-06-17 | 深圳市海清视讯科技有限公司 | Training method of multitask model and virtual fitting method of clothes |
Non-Patent Citations (5)
Title |
---|
"Flownet 2.0: Evolution of optical flow estimation with deep networks";Eddy Ilg 等;《2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;20171109;第2462-2470页 * |
"FP-VTON:基于注意力机制的特征保持虚拟试衣网络";谭泽霖 等;《https://kns.cnki.net/kcms/detail/11.2127.TP.20210706.1000.004.html》;20210706;第1-16页 * |
"ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on":;Gaurav Kuppa等;《Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops》;20211231;第191 – 200页 * |
"基于注意力机制的图像补全算法研究";张晓峰;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20220315;《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
基于多通道多尺度卷积神经网络的单幅图像去雨方法;柳长源;王琪;毕晓君;;电子与信息学报;20200915(09);第224-231页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115082295A (en) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lim et al. | DSLR: Deep stacked Laplacian restorer for low-light image enhancement | |
Chen et al. | Cross parallax attention network for stereo image super-resolution | |
CN113486708B (en) | Human body posture estimation method, model training method, electronic device and storage medium | |
Sheng et al. | Cross-view recurrence-based self-supervised super-resolution of light field | |
CN110728219A (en) | 3D face generation method based on multi-column multi-scale graph convolution neural network | |
Shi et al. | Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-CNN structure for face super-resolution | |
Zuo et al. | Frequency-dependent depth map enhancement via iterative depth-guided affine transformation and intensity-guided refinement | |
WO2023202695A1 (en) | Data processing method and apparatus, device, and medium | |
Guo et al. | Exploiting non-local priors via self-convolution for highly-efficient image restoration | |
Song et al. | Depth estimation from a single image using guided deep network | |
Zhu et al. | Stereoscopic image super-resolution with interactive memory learning | |
CN113627487B (en) | Super-resolution reconstruction method based on deep attention mechanism | |
CN114155406A (en) | Pose estimation method based on region-level feature fusion | |
CN117788544A (en) | Image depth estimation method based on lightweight attention mechanism | |
CN117671371A (en) | Visual task processing method and system based on agent attention | |
Yang et al. | Feature similarity rank-based information distillation network for lightweight image superresolution | |
CN116503524B (en) | Virtual image generation method, system, device and storage medium | |
CN117333538A (en) | Multi-view multi-person human body posture estimation method based on local optimization | |
CN115082295B (en) | Image editing method and device based on self-attention mechanism | |
Shen et al. | TIM: An Efficient Temporal Interaction Module for Spiking Transformer | |
Zhu et al. | CED-Net: contextual encoder–decoder network for 3D face reconstruction | |
CN116597183A (en) | Multi-mode image feature matching method based on space and channel bi-dimensional attention | |
Wang et al. | FCNet: Learning noise-free features for point cloud denoising | |
CN116403142A (en) | Video processing method, device, electronic equipment and medium | |
CN116168162A (en) | Three-dimensional point cloud reconstruction method for multi-view weighted aggregation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |