US20220028031A1 - Image processing method and apparatus, device, and storage medium - Google Patents
Image processing method and apparatus, device, and storage medium Download PDFInfo
- Publication number
- US20220028031A1 US20220028031A1 US17/497,883 US202117497883A US2022028031A1 US 20220028031 A1 US20220028031 A1 US 20220028031A1 US 202117497883 A US202117497883 A US 202117497883A US 2022028031 A1 US2022028031 A1 US 2022028031A1
- Authority
- US
- United States
- Prior art keywords
- image
- encoding
- expression
- input image
- encoding result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 230000014509 gene expression Effects 0.000 claims abstract description 193
- 238000000034 method Methods 0.000 claims abstract description 27
- 230000007246 mechanism Effects 0.000 claims abstract description 26
- 230000008921 facial expression Effects 0.000 claims description 45
- 238000012545 processing Methods 0.000 claims description 29
- 238000006073 displacement reaction Methods 0.000 claims description 28
- 230000009466 transformation Effects 0.000 claims description 26
- 238000012549 training Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 description 28
- 238000013473 artificial intelligence Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 8
- 238000010801 machine learning Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 206010011469 Crying Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 210000000887 face Anatomy 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/0012—Context preserving transformation, e.g. by using an importance map
-
- G06T3/04—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G06K9/00268—
-
- G06K9/00302—
-
- G06K9/00979—
-
- G06K9/6256—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G06N3/0454—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
- G06V10/95—Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
Definitions
- the present disclosure relates to the field of computer vision technologies in artificial intelligence technologies, and in particular, to an image processing method and apparatus, a device, and a storage medium.
- Facial expression editing is to adjust an expression in a face image to obtain another image. For example, an expression in an original image is smile, and after the facial expression editing, an obtained expression in a target image is crying.
- the expression transformation capability may be limited.
- Embodiments of the present disclosure provide an image processing method and apparatus, a device, and a storage medium, which can generate an output image with a large expression difference from an input image, thereby improving an expression transformation capability.
- the technical solutions are as follows.
- the present disclosure provides an image processing method, applied to a computer device, and the method includes: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
- the present disclosure provides an image processing apparatus, and the apparatus includes a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of
- the present disclosure provides a non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
- An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
- the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
- an output image is generated according to the encoding result of the input image and the encoding result of the expression image
- the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
- FIG. 1 is a schematic flowchart of an image processing method according to one or more embodiments of the present disclosure
- FIG. 2 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure
- FIG. 3 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure
- FIG. 4 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure.
- FIG. 5 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure.
- FIG. 6 is a schematic structural block diagram of a computer device according to one or more embodiments of the present disclosure.
- AI Artificial intelligence
- AI is a theory, method, technology, and application system that use a digital computer or a machine controlled by the digital computer to simulate, extend, and expand human intelligence, perceive an environment, acquire knowledge, and use knowledge to obtain an optimal result.
- AI is a comprehensive technology in computer science and attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.
- AI is to study the design principles and implementation methods of various intelligent machines, to enable the machines to have the functions of perception, reasoning, and decision-making.
- the AI technology is a comprehensive discipline, and relates to a wide range of fields including both hardware-level technologies and software-level technologies.
- the basic AI technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, the big data processing technology, the operating/interaction system, and electromechanical integration.
- AI software technologies include several directions such as the computer vision (CV) technology, the speech processing technology, the natural language processing technology, and machine learning/deep learning.
- CV computer vision
- Computer vision (CV) technology is a science that studies how to enable a machine to “see”, and to be specific, to implement machine vision such as recognition, tracking, measurement, and the like of a target by using a camera and a computer in replacement of human eyes, and to perform further graphic processing by using a computer to generate an image more suitable for human eyes to observe or more suitable for transmission to and detection by an instrument.
- machine vision studies related theories and technologies and attempts to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
- the computer vision technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction, and further include biometric feature recognition technologies such as common face recognition and fingerprint recognition.
- technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction, and further include biometric feature recognition technologies such as common face recognition and fingerprint recognition.
- Machine learning is a multi-field interdisciplinary subject, involving multiple disciplines such as the probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory.
- the machine learning specializes in studying how a computer simulates or implements a human learning behavior to acquire new knowledge or skills, and reorganize an existing knowledge structure to continuously improve its own performance.
- the machine learning is the core of AI, is a basic way to make the computer intelligent, and is applied to various fields of AI.
- Machine learning and deep learning generally involve technologies such as the artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and learning from demonstrations.
- the artificial intelligence technology is studied and applied in a plurality of fields such as the smart home, smart wearable device, virtual assistant, smart speaker, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicle, robot, smart medical care, and smart customer service. It is believed that with the development of technologies, the AI technology will be applied to more fields, and play an increasingly important role.
- An expression may be encoded by using spatial transformation, that is, the spatial transformation is performed on an original image to obtain a target image. Because an expression feature relies on spatial transformation to be encoded into the target image, pixel units not appearing in the original image cannot be generated. For example, if there is no teeth in the original image, there will be no teeth in the target image, so that the target image with a large expression difference from the original image cannot be generated, and an expression transformation capability is limited.
- Solutions provided by the embodiments of the present disclosure involve the computer vision technology of artificial intelligence, and provide an image processing method.
- An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
- the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
- an output image is generated according to the encoding result of the input image and the encoding result of the expression image
- the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving the expression transformation capability.
- steps of the method may be performed by a computer device, which may be any electronic device with processing and storage capabilities, such as a mobile phone, a tablet computer, a game device, a multimedia playback device, an electronic photo frame, a wearable device, and a personal computer (PC), and may also be a server.
- a computer device may be any electronic device with processing and storage capabilities, such as a mobile phone, a tablet computer, a game device, a multimedia playback device, an electronic photo frame, a wearable device, and a personal computer (PC), and may also be a server.
- the term “computer device” is employed herein interchangeably with the term “computing device.”
- the steps are performed by a computer device, which, however, does not constitute a limitation.
- FIG. 1 is a flowchart of an image processing method provided by an embodiment of the present disclosure. The method may include the following steps ( 101 to 104 ).
- Step 101 Encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image.
- the input image is a human face image, that is, an image containing a human face.
- Multichannel encoding is performed on the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image.
- the encoding tensor set includes n encoding tensors
- the attention map set includes n attention maps
- n is an integer greater than 1.
- the value of n may be preset according to actual requirements, for example, the value of n is set to 8.
- the input image may be encoded by using a first encoder to obtain the encoding tensor set and the attention map set of the input image.
- the first encoder is configured to extract an image feature of the input image and encode the image feature to obtain the two sets.
- the first encoder is an imaginative encoder.
- An imagination module is embedded in the first encoder, which enables the first encoder to generate a plurality of encoding tensors of the input image.
- a plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified.
- the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
- the attention mechanism is a pixel-based information enhancement mechanism for a specific target in a feature map that uses an attention mechanism similar to human eyes in deep learning, that is, the attention mechanism can enhance target information in the feature map. After the feature map is processed based on the attention mechanism, the target information in the feature map will be enhanced.
- the attention-enhanced feature map can enhance target-based voxel level information.
- the input image is a 256 ⁇ 256 ⁇ 3 image, where 256 ⁇ 256 represents a resolution of the input image, and 3 represents three channels RGB.
- the obtained encoding tensor set includes eight 256 ⁇ 256 ⁇ 3 encoding tensors
- the obtained attention map set includes eight 256 ⁇ 256 ⁇ 1 attention maps, where the eight encoding tensors are in one-to-one correspondence to the eight attention maps.
- the first encoder may use a U-Net structure.
- the U-Net is an image segmentation model based on Convolutional Neural Network (CNN), including a convolution layer, a max pooling layer (downsampling), a deconvolution layer (upsampling), and a Rectified Linear Unit (ReLU) layer.
- CNN Convolutional Neural Network
- the first encoder may also use other network architectures. This is not limited in the embodiments of the present disclosure.
- Step 102 Obtain an encoding result of the input image according to the encoding tensor set and the attention map set, where the encoding result of the input image records an identity feature of a human face in the input image.
- the identity feature refers to feature information used for distinguishing faces of different people.
- the encoding result of the input image in addition to the identity feature of the human face in the input image, the encoding result of the input image further records an appearance feature of the human face in the input image, so that a final generated output image has the same identity feature and appearance feature as the input image.
- the appearance feature refers to feature information used for reflecting an external appearance attribute of a human face.
- the encoding tensor and the attention map are multiplied to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
- the encoding result E s (x) of the input image x is expressed as:
- e i represents an i th encoding tensor in the encoding tensor set E, and a i represents an i th attention map in the attention map set.
- the encoding result of the input image obtained after the operation includes eight 256 ⁇ 256 ⁇ 3 processed encoding tensors.
- Step 103 Encode an expression image to obtain an encoding result of the expression image, where the encoding result of the expression image records an expression feature of a human face in the expression image.
- the expression image is a face image used for providing an expression feature.
- the expression feature of the human face in the expression image is extracted by encoding the expression image, so that the final generated output image has the same identity feature and appearance feature as the input image, and has the same expression feature as the expression image. That is, the final output image is obtained by transforming the expression of the human face in the expression image into the input image and maintaining the identity feature and appearance feature of the human face in the input image.
- the encoding result of the expression image includes a displacement map set.
- the displacement map set includes n displacement maps, where an i th displacement map is used for performing spatial transformation processing on an i th processed encoding tensor.
- the expression image is y
- the encoding result E T (y) of the expression image y is expressed as:
- O i represents an i th displacement map in the displacement map set O.
- the encoding result of the expression image may include eight 256 ⁇ 256 ⁇ 2 displacement maps.
- the i th 256 ⁇ 256 ⁇ 2 displacement map is used as an example.
- the displacement map includes two 256 ⁇ 256 displacement maps. Element values of a pixel at position (x, y) in the two 256 ⁇ 256 displacement maps are recorded as x′ and y′, indicating that the pixel at position (x, y) is moved to (x′, y′) in the i th processed encoding tensor.
- the expression image may be encoded by using a second encoder to obtain the encoding result of the expression image.
- the second encoder is configured to extract an image feature of the expression image and encode the image feature to obtain a displacement map set as the encoding result of the expression image.
- the network structures of the second encoder and the first encoder may be the same or different. This is not limited in the embodiments of the present disclosure.
- Step 104 Generate the output image according to the encoding result of the input image and the encoding result of the expression image, where the output image has the identity feature of the input image and the expression feature of the expression image.
- the identity feature carried in the encoding result of the input image is mixed with the expression feature carried in the encoding result of the expression image, and then the output image is reconstructed, so that the output image has the identity feature of the input image and the expression feature of the expression image.
- the displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain n transformed encoding tensors; and the n transformed encoding tensors are decoded to generate the output image.
- Directly splicing the expression image and the processed encoding tensors may cause the identity feature of the expression image to escape into the final output image.
- a final training objective of the second encoder is set to learn a suitable displacement map.
- the displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain the transformed encoding tensor, and the transformed encoding tensor is decoded to generate the output image, so that the output image records only the identity feature of the input image, but not the identity feature of the expression image.
- the transformed encoding tensor set F is expressed as:
- the transformed encoding tensor set F includes n transformed encoding tensors, and finally a decoder decodes the n transformed encoding tensors to generate the output image R:
- F i represents an i th transformed encoding tensor
- D R represents decoding processing
- the step 101 to step 104 may be implemented by a trained or pre-trained facial expression editing model.
- the facial expression editing model is invoked to generate the output image based on the input image and the expression image.
- FIG. 2 is an exemplary schematic diagram of a facial expression editing model.
- the facial expression editing model includes: a first encoder 21 , a second encoder 22 , and a decoder 23 .
- the first encoder 21 is configured to encode an input image x based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image x, and to obtain an encoding result of the input image x according to the encoding tensor set and the attention map set.
- the second encoder 22 is configured to encode an expression image y to obtain an encoding result of the expression image.
- the decoder 23 is configured to generate an output image R according to the encoding result of the input image x and the encoding result of the expression image y.
- the input image x is a girl with a smiling expression
- the expression image y is a boy with a sad expression.
- the final generated output image R has the identity feature of the input image x and the expression feature of the expression image y, that is, the output image R is a girl with a sad expression.
- a video or a dynamic picture including the output image may be generated.
- the expression transformation processing described above may be performed on a plurality of input images to generate a plurality of output images accordingly, and then the plurality of output images are combined into a video or a dynamic picture.
- an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
- the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
- an output image is generated according to the encoding result of the input image and the encoding result of the expression image
- the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
- multichannel encoding is performed on the input image to generate a plurality of encoding tensors of the input image.
- a plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified.
- the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
- the facial expression editing model described above is a model constructed based on a generative adversarial network.
- a facial expression editing model 30 includes a generator 31 and a discriminator 32 .
- the generator 31 includes a first encoder 21 , a second encoder 22 , and a decoder 23 .
- a training process of the facial expression editing model 30 is as follows.
- each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions
- the generator is configured to generate an output image corresponding to the original image according to the original image and the target image
- the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator.
- the facial expression editing model is constructed based on the generative adversarial network, and the image pair included in the training sample is two images of the same human face with different expressions.
- the generator is configured to perform the face expression transformation processing described above to generate the output image corresponding to the original image, and the discriminator uses adversarial learning to adjust and optimize parameters of the generator, so that the output image corresponding to the original image generated by the generator is as similar as possible to the target image.
- a Least Squares Generative Adversarial Network may be selected as the generative adversarial network.
- a loss function L total of the facial expression editing model is:
- L total L L 1 + ⁇ LSGAN L LSGAN ⁇ P L P + ⁇ O L O , where
- L L 1 represents a first-order distance loss, that is, Manhattan distance in a pixel dimension
- L LSGAN , L P , and L O respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, ⁇ LSGAN , ⁇ P and ⁇ O respectively represent weights corresponding to the three losses.
- the weights corresponding to the three losses may be preset according to an actual situation, which is not limited in the embodiments of the present disclosure.
- A represents the attention map set
- ⁇ (a) represents a sigmoid function of a.
- the overlapping penalty loss L O is introduced to encourage the use of different encoding tensors to encode different parts of an image.
- a generative adversarial network in a self-supervised mode is used for training the generator, and additional labeling may not be needed.
- Rules for the facial expression transformation are learned from unlabeled data, which helps to reduce training cost of the model and improving training efficiency of the model.
- FIG. 4 is a block diagram of an image processing apparatus provided by an embodiment of the present disclosure.
- the apparatus has a function of implementing the method example. The function may be implemented by using hardware or may be implemented by hardware executing corresponding software.
- the apparatus may be a computer device or may be disposed in a computer device.
- the apparatus 400 may include: a first encoding module 410 , a second encoding module 420 , and an image generating module 430 .
- the first encoding module 410 is configured to encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image.
- the encoding tensor set includes n encoding tensors
- the attention map set includes n attention maps
- n is an integer greater than 1.
- the first encoding module 410 is further configured to obtain an encoding result of the input image according to the encoding tensor set and the attention map set.
- the encoding result of the input image records an identity feature of a human face in the input image.
- the second encoding module 420 is configured to encode an expression image to obtain an encoding result of the expression image.
- the encoding result of the expression image records an expression feature of a human face in the expression image.
- the image generating module 430 is configured to generate an output image according to the encoding result of the input image and the encoding result of the expression image.
- the output image has the identity feature of the input image and the expression feature of the expression image.
- the first encoding module 410 is configured to multiply, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
- the encoding result of the expression image includes n displacement maps.
- the image generating module 430 is configured to: perform spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and decode the n transformed encoding tensors to generate the output image.
- the apparatus 400 further includes: a model invoking module 440 , configured to invoke a facial expression editing model, the facial expression editing model including: a first encoder, a second encoder, and a decoder; where the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set; the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
- a model invoking module 440 configured to invoke a facial expression editing model, the facial expression editing model including: a first encoder, a second encoder, and a decoder; where the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set
- the facial expression editing model is a model constructed based on a generative adversarial network.
- the facial expression editing model includes a generator and a discriminator.
- the generator includes the first encoder, the second encoder, and the decoder.
- a training process of the facial expression editing model includes: obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, where the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and using the training sample to train the facial expression editing model.
- a loss function L total of the facial expression editing model is:
- L total L L 1 + ⁇ LSGAN L LSGAN ⁇ P L P + ⁇ O L O , where
- L L 1 represents a first-order distance loss
- L LSGAN , L P , and L O respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss
- ⁇ LSGAN , ⁇ P , and ⁇ O respectively represent weights corresponding to the three losses
- A represents the attention map set
- ⁇ (a) represents a sigmoid function of a.
- the apparatus 400 further includes: an image processing module 450 , configured to generate a video or a dynamic picture including the output image.
- an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
- the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
- an output image is generated according to the encoding result of the input image and the encoding result of the expression image
- the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
- the apparatus provided in the embodiments implements functions of the apparatus, it is illustrated with an example of division of each functional module.
- the functions may be distributed to different functional modules according to the requirements, that is, the internal structure of the device is divided into different functional modules, to implement all or some of the functions described above.
- the apparatus embodiments and the method embodiments provided in the embodiments belong to one conception. For the specific implementation process, refer to the method embodiments, and details are not described herein again.
- FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
- a computer device 600 includes a Central Processing Unit (CPU) 601 , a system memory 604 including a Random Access Memory (RAM) 602 and a Read-Only Memory (ROM) 603 , and a system bus 605 connecting the system memory 604 and the central processing unit 601 .
- the computer device 600 further includes a basic input/output (I/O) system 606 assisting in transmitting information between components in the computer, and a mass storage device 607 configured to store an operating system 613 , an application program 614 , and another program module 615 .
- I/O basic input/output
- the basic input/output system 606 includes a display 608 configured to display information and an input device 609 such as a mouse and a keyboard for a user to input information.
- the display 608 and the input device 609 are both connected to the CPU 601 by an input/output controller 610 connected to the system bus 605 .
- the basic input/output system 606 may further include the input/output controller 610 for receiving and processing input from a plurality of other devices such as a keyboard, a mouse, and an electronic stylus. Similarly, the input/output controller 610 further provides an output to a display screen, a printer or another type of an output device.
- the mass storage device 607 is connected to the CPU 601 through a mass storage controller (not shown) connected to the system bus 605 .
- the mass storage device 607 and an associated computer readable medium provide non-volatile storage for the computer device 600 . That is, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
- a computer readable medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
- the computer readable medium may include a computer storage medium and a communication medium.
- the computer storage medium includes volatile and non-volatile media, and removable and non-removable media implemented by using any method or technology used for storing information such as computer readable instructions, data structures, program modules, or other data.
- the computer storage medium includes a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash memory or another solid-state memory technology, a Compact Disc Read-Only Memory (CD-ROM), or another optical memory, a tape cartridge, a magnetic cassette, a magnetic disk memory, or another magnetic storage device.
- RAM Random Access Memory
- ROM Read-Only Memory
- EPROM Erasable Programmable Read-Only Memory
- CD-ROM Compact Disc Read-Only Memory
- a person skilled in the art may learn that the computer storage medium is not limited to the several types.
- the system memory 604 and the mass storage device 607 may be collectively referred to
- the computer device 600 may further be connected, through a network such as the Internet, to a remote computer on the network for running. That is, the computer device 600 may be connected to a network 612 by using a network interface unit 611 connected to the system bus 605 , or may be connected to another type of network or a remote computer system (not shown) by using the network interface unit 611 .
- unit in this disclosure may refer to a software unit, a hardware unit, or a combination thereof.
- a software unit e.g., computer program
- a hardware unit may be implemented using processing circuitry and/or memory.
- processors or processors and memory
- a processor or processors and memory
- each unit can be part of an overall unit that includes the functionalities of the unit.
- the memory further includes at least one instruction, at least one program, a code set, or an instruction set.
- the at least one instruction, the at least one program, the code set, or the instruction set is stored in the memory and is configured to be executed by one or more processors to implement the image processing method.
- a computer readable storage medium is further provided, the storage medium storing at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, the code set, or the instruction set, when executed by the processor of a terminal, implementing the image processing method.
- the computer readable storage medium may include: a ROM, a RAM, a Solid State Drive (SSD), an optical disc, or the like.
- the RAM may include a Resistance Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).
- a computer program product is further provided, the computer program product, when executed by the processor of a terminal, implementing the image processing method.
- “Plurality of” mentioned in the present disclosure means two or more.
- “And/or” describes an association relationship for describing associated objects and represents that three relationships may exist.
- a and/or B may represent the following three implementations: Only A exists, both A and B exist, and only B exists.
- the character “/” in the present disclosure generally indicates an “or” relationship between the associated objects.
- the step numbers described in the present disclosure merely exemplarily show a performing sequence of the steps. In some other embodiments, the steps may not be performed according to the number sequence. For example, two steps with different numbers may be performed simultaneously, or two steps with different numbers may be performed according to a sequence contrary to the sequence shown in the figure. This is not limited in the embodiments of the present disclosure.
Abstract
An image processing method is provided. The method includes: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
Description
- This application is a continuation application of PCT Patent Application No. PCT/CN2020/117455 filed on Sep. 24, 2020, which claims priority to Chinese Patent Application No. 201911072470.8, entitled “IMAGE PROCESSING METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM” and filed on Nov. 5, 2019, all of which are incorporated herein by reference in entirety.
- The present disclosure relates to the field of computer vision technologies in artificial intelligence technologies, and in particular, to an image processing method and apparatus, a device, and a storage medium.
- Facial expression editing (also referred to as facial expression transformation) is to adjust an expression in a face image to obtain another image. For example, an expression in an original image is smile, and after the facial expression editing, an obtained expression in a target image is crying. However, in solutions provided to implement the facial expression transformation, the expression transformation capability may be limited.
- Embodiments of the present disclosure provide an image processing method and apparatus, a device, and a storage medium, which can generate an output image with a large expression difference from an input image, thereby improving an expression transformation capability. The technical solutions are as follows.
- In one aspect, the present disclosure provides an image processing method, applied to a computer device, and the method includes: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
- In another aspect, the present disclosure provides an image processing apparatus, and the apparatus includes a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
- In yet another aspect, the present disclosure provides a non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
- The technical solutions provided in the embodiments of the present disclosure may bring the following beneficial effects.
- An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
- Other aspects of the present disclosure can be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.
- To facilitate a better understanding of technical solutions of certain embodiments of the present disclosure, accompanying drawings are described below. The accompanying drawings are illustrative of certain embodiments of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without having to exert creative efforts. When the following descriptions are made with reference to the accompanying drawings, unless otherwise indicated, same numbers in different accompanying drawings may represent same or similar elements. In addition, the accompanying drawings are not necessarily drawn to scale.
-
FIG. 1 is a schematic flowchart of an image processing method according to one or more embodiments of the present disclosure; -
FIG. 2 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure; -
FIG. 3 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure; -
FIG. 4 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure; -
FIG. 5 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure; and -
FIG. 6 is a schematic structural block diagram of a computer device according to one or more embodiments of the present disclosure. - To make objectives, technical solutions, and/or advantages of the present disclosure more comprehensible, certain embodiments of the present disclosure are further elaborated in detail with reference to the accompanying drawings. The embodiments as described are not to be construed as a limitation to the present disclosure. All other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of embodiments of the present disclosure.
- Throughout the description, and when applicable, “some embodiments” or “certain embodiments” describe subsets of all possible embodiments, but it may be understood that the “some embodiments” or “certain embodiments” may be the same subset or different subsets of all the possible embodiments, and can be combined with each other without conflict.
- In certain embodiments, the term “based on” is employed herein interchangeably with the term “according to.”
- Artificial intelligence (AI) is a theory, method, technology, and application system that use a digital computer or a machine controlled by the digital computer to simulate, extend, and expand human intelligence, perceive an environment, acquire knowledge, and use knowledge to obtain an optimal result. In other words, AI is a comprehensive technology in computer science and attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. AI is to study the design principles and implementation methods of various intelligent machines, to enable the machines to have the functions of perception, reasoning, and decision-making.
- The AI technology is a comprehensive discipline, and relates to a wide range of fields including both hardware-level technologies and software-level technologies. The basic AI technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, the big data processing technology, the operating/interaction system, and electromechanical integration. AI software technologies include several directions such as the computer vision (CV) technology, the speech processing technology, the natural language processing technology, and machine learning/deep learning.
- Computer vision (CV) technology is a science that studies how to enable a machine to “see”, and to be specific, to implement machine vision such as recognition, tracking, measurement, and the like of a target by using a camera and a computer in replacement of human eyes, and to perform further graphic processing by using a computer to generate an image more suitable for human eyes to observe or more suitable for transmission to and detection by an instrument. As a scientific discipline, computer vision studies related theories and technologies and attempts to establish an artificial intelligence system that can obtain information from images or multi-dimensional data. The computer vision technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction, and further include biometric feature recognition technologies such as common face recognition and fingerprint recognition.
- Machine learning (ML) is a multi-field interdisciplinary subject, involving multiple disciplines such as the probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. The machine learning specializes in studying how a computer simulates or implements a human learning behavior to acquire new knowledge or skills, and reorganize an existing knowledge structure to continuously improve its own performance. The machine learning is the core of AI, is a basic way to make the computer intelligent, and is applied to various fields of AI. Machine learning and deep learning generally involve technologies such as the artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and learning from demonstrations.
- With the research and progress of the artificial intelligence technology, the artificial intelligence technology is studied and applied in a plurality of fields such as the smart home, smart wearable device, virtual assistant, smart speaker, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicle, robot, smart medical care, and smart customer service. It is believed that with the development of technologies, the AI technology will be applied to more fields, and play an increasingly important role.
- An expression may be encoded by using spatial transformation, that is, the spatial transformation is performed on an original image to obtain a target image. Because an expression feature relies on spatial transformation to be encoded into the target image, pixel units not appearing in the original image cannot be generated. For example, if there is no teeth in the original image, there will be no teeth in the target image, so that the target image with a large expression difference from the original image cannot be generated, and an expression transformation capability is limited.
- Solutions provided by the embodiments of the present disclosure involve the computer vision technology of artificial intelligence, and provide an image processing method. An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving the expression transformation capability.
- According to the method provided by the embodiments of the present disclosure, steps of the method may be performed by a computer device, which may be any electronic device with processing and storage capabilities, such as a mobile phone, a tablet computer, a game device, a multimedia playback device, an electronic photo frame, a wearable device, and a personal computer (PC), and may also be a server. In certain embodiments, the term “computer device” is employed herein interchangeably with the term “computing device.” For ease of description, in the following method embodiments, the steps are performed by a computer device, which, however, does not constitute a limitation.
-
FIG. 1 is a flowchart of an image processing method provided by an embodiment of the present disclosure. The method may include the following steps (101 to 104). -
Step 101. Encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image. - In an embodiment of the present disclosure, the input image is a human face image, that is, an image containing a human face. Multichannel encoding is performed on the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image. The encoding tensor set includes n encoding tensors, the attention map set includes n attention maps, and n is an integer greater than 1. The value of n may be preset according to actual requirements, for example, the value of n is set to 8.
- The input image may be encoded by using a first encoder to obtain the encoding tensor set and the attention map set of the input image. The first encoder is configured to extract an image feature of the input image and encode the image feature to obtain the two sets. In an embodiment of the present disclosure, the first encoder is an imaginative encoder. An imagination module is embedded in the first encoder, which enables the first encoder to generate a plurality of encoding tensors of the input image. A plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified. In addition, in the imagination module, the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
- The attention mechanism is a pixel-based information enhancement mechanism for a specific target in a feature map that uses an attention mechanism similar to human eyes in deep learning, that is, the attention mechanism can enhance target information in the feature map. After the feature map is processed based on the attention mechanism, the target information in the feature map will be enhanced. The attention-enhanced feature map can enhance target-based voxel level information.
- Assume that the input image is a 256×256×3 image, where 256×256 represents a resolution of the input image, and 3 represents three channels RGB. When n is 8, after the first encoder encodes the input image, the obtained encoding tensor set includes eight 256×256×3 encoding tensors, and the obtained attention map set includes eight 256×256×1 attention maps, where the eight encoding tensors are in one-to-one correspondence to the eight attention maps.
- In certain embodiments, the first encoder may use a U-Net structure. The U-Net is an image segmentation model based on Convolutional Neural Network (CNN), including a convolution layer, a max pooling layer (downsampling), a deconvolution layer (upsampling), and a Rectified Linear Unit (ReLU) layer. In some other embodiments, the first encoder may also use other network architectures. This is not limited in the embodiments of the present disclosure.
-
Step 102. Obtain an encoding result of the input image according to the encoding tensor set and the attention map set, where the encoding result of the input image records an identity feature of a human face in the input image. - The identity feature refers to feature information used for distinguishing faces of different people. In an embodiment of the present disclosure, in addition to the identity feature of the human face in the input image, the encoding result of the input image further records an appearance feature of the human face in the input image, so that a final generated output image has the same identity feature and appearance feature as the input image. The appearance feature refers to feature information used for reflecting an external appearance attribute of a human face.
- In certain embodiments, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map are multiplied to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
- Assume that the input image is x, the encoding tensor set of the input image is E, the attention map set is A, the number of elements in E and A are both n, and n is an integer greater than 1. The encoding result Es(x) of the input image x is expressed as:
-
E s(x)={e i ⊗a i,1≤i≤n}, where - ei represents an ith encoding tensor in the encoding tensor set E, and ai represents an ith attention map in the attention map set.
- Assuming that the encoding tensor set includes eight 256×256×3 encoding tensors, and the attention map set includes eight 256×256×1 attention maps, the encoding result of the input image obtained after the operation includes eight 256×256×3 processed encoding tensors.
-
Step 103. Encode an expression image to obtain an encoding result of the expression image, where the encoding result of the expression image records an expression feature of a human face in the expression image. - The expression image is a face image used for providing an expression feature. In an embodiment of the present disclosure, the expression feature of the human face in the expression image is extracted by encoding the expression image, so that the final generated output image has the same identity feature and appearance feature as the input image, and has the same expression feature as the expression image. That is, the final output image is obtained by transforming the expression of the human face in the expression image into the input image and maintaining the identity feature and appearance feature of the human face in the input image.
- The encoding result of the expression image includes a displacement map set. The displacement map set includes n displacement maps, where an ith displacement map is used for performing spatial transformation processing on an ith processed encoding tensor. Assuming that the expression image is y, the encoding result ET(y) of the expression image y is expressed as:
-
E T(y)={O i,1≤i≤n}, where - Oi represents an ith displacement map in the displacement map set O.
- Exemplarily, the encoding result of the expression image may include eight 256×256×2 displacement maps. The ith 256×256×2 displacement map is used as an example. The displacement map includes two 256×256 displacement maps. Element values of a pixel at position (x, y) in the two 256×256 displacement maps are recorded as x′ and y′, indicating that the pixel at position (x, y) is moved to (x′, y′) in the ith processed encoding tensor.
- The expression image may be encoded by using a second encoder to obtain the encoding result of the expression image. The second encoder is configured to extract an image feature of the expression image and encode the image feature to obtain a displacement map set as the encoding result of the expression image.
- In addition, the network structures of the second encoder and the first encoder may be the same or different. This is not limited in the embodiments of the present disclosure.
-
Step 104. Generate the output image according to the encoding result of the input image and the encoding result of the expression image, where the output image has the identity feature of the input image and the expression feature of the expression image. - The identity feature carried in the encoding result of the input image is mixed with the expression feature carried in the encoding result of the expression image, and then the output image is reconstructed, so that the output image has the identity feature of the input image and the expression feature of the expression image.
- In certain embodiments, for each group of a corresponding processed encoding tensor and a corresponding displacement map, the displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain n transformed encoding tensors; and the n transformed encoding tensors are decoded to generate the output image. Directly splicing the expression image and the processed encoding tensors may cause the identity feature of the expression image to escape into the final output image. Considering this, in an embodiment of the present disclosure, a final training objective of the second encoder is set to learn a suitable displacement map. The displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain the transformed encoding tensor, and the transformed encoding tensor is decoded to generate the output image, so that the output image records only the identity feature of the input image, but not the identity feature of the expression image.
- In certain embodiments, the transformed encoding tensor set F is expressed as:
-
F=ST(E s(x),O), where - ST represents the spatial transformation processing. After the spatial transformation processing, the transformed encoding tensor set F includes n transformed encoding tensors, and finally a decoder decodes the n transformed encoding tensors to generate the output image R:
-
R=D R({F i,1≤i≤n}), where - Fi represents an ith transformed encoding tensor, and DR represents decoding processing.
- In an exemplary embodiment, the
step 101 to step 104 may be implemented by a trained or pre-trained facial expression editing model. The facial expression editing model is invoked to generate the output image based on the input image and the expression image.FIG. 2 is an exemplary schematic diagram of a facial expression editing model. The facial expression editing model includes: afirst encoder 21, asecond encoder 22, and adecoder 23. Thefirst encoder 21 is configured to encode an input image x based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image x, and to obtain an encoding result of the input image x according to the encoding tensor set and the attention map set. Thesecond encoder 22 is configured to encode an expression image y to obtain an encoding result of the expression image. Thedecoder 23 is configured to generate an output image R according to the encoding result of the input image x and the encoding result of the expression image y. For example, as shown inFIG. 2 , the input image x is a girl with a smiling expression, and the expression image y is a boy with a sad expression. After the expression transformation processing, the final generated output image R has the identity feature of the input image x and the expression feature of the expression image y, that is, the output image R is a girl with a sad expression. - In certain embodiments, after the expression transformation is performed on the input image to generate the output image, a video or a dynamic picture including the output image may be generated. For example, the expression transformation processing described above may be performed on a plurality of input images to generate a plurality of output images accordingly, and then the plurality of output images are combined into a video or a dynamic picture.
- In summary, in the technical solutions provided in the embodiments of the present disclosure, an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
- In addition, multichannel encoding is performed on the input image to generate a plurality of encoding tensors of the input image. A plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified. In addition, the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
- In an exemplary embodiment, the facial expression editing model described above is a model constructed based on a generative adversarial network. As shown in
FIG. 3 , a facialexpression editing model 30 includes agenerator 31 and adiscriminator 32. Thegenerator 31 includes afirst encoder 21, asecond encoder 22, and adecoder 23. - A training process of the facial
expression editing model 30 is as follows. - 1. Obtain at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, where the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator.
- 2. Use the training sample to train the facial expression editing model.
- In an embodiment of the present disclosure, the facial expression editing model is constructed based on the generative adversarial network, and the image pair included in the training sample is two images of the same human face with different expressions. The generator is configured to perform the face expression transformation processing described above to generate the output image corresponding to the original image, and the discriminator uses adversarial learning to adjust and optimize parameters of the generator, so that the output image corresponding to the original image generated by the generator is as similar as possible to the target image. In certain embodiments, a Least Squares Generative Adversarial Network (LSGAN) may be selected as the generative adversarial network.
- In certain embodiments, a loss function Ltotal of the facial expression editing model is:
-
L total =L L1 +λLSGAN L LSGANλP L P+λO L O, where - LL
1 represents a first-order distance loss, that is, Manhattan distance in a pixel dimension, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP and λO respectively represent weights corresponding to the three losses. The weights corresponding to the three losses may be preset according to an actual situation, which is not limited in the embodiments of the present disclosure. In the overlapping penalty loss LO=Σi=1 nσ(a)−1, a∈A, A represents the attention map set, and σ(a) represents a sigmoid function of a. In an embodiment of the present disclosure, to make full use of channel width, the overlapping penalty loss LO is introduced to encourage the use of different encoding tensors to encode different parts of an image. - In summary, in the technical solutions provided in the embodiments of the present disclosure, a generative adversarial network in a self-supervised mode is used for training the generator, and additional labeling may not be needed. Rules for the facial expression transformation are learned from unlabeled data, which helps to reduce training cost of the model and improving training efficiency of the model.
- The following describes apparatus embodiments of the present disclosure, which can be used for performing the method embodiments of the present disclosure. For details not disclosed in the apparatus embodiments of the present disclosure, refer to the method embodiments of the present disclosure.
-
FIG. 4 is a block diagram of an image processing apparatus provided by an embodiment of the present disclosure. The apparatus has a function of implementing the method example. The function may be implemented by using hardware or may be implemented by hardware executing corresponding software. The apparatus may be a computer device or may be disposed in a computer device. Theapparatus 400 may include: afirst encoding module 410, asecond encoding module 420, and animage generating module 430. - The
first encoding module 410 is configured to encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image. The encoding tensor set includes n encoding tensors, the attention map set includes n attention maps, and n is an integer greater than 1. - The
first encoding module 410 is further configured to obtain an encoding result of the input image according to the encoding tensor set and the attention map set. The encoding result of the input image records an identity feature of a human face in the input image. - The
second encoding module 420 is configured to encode an expression image to obtain an encoding result of the expression image. The encoding result of the expression image records an expression feature of a human face in the expression image. - The
image generating module 430 is configured to generate an output image according to the encoding result of the input image and the encoding result of the expression image. The output image has the identity feature of the input image and the expression feature of the expression image. - In an exemplary embodiment, the
first encoding module 410 is configured to multiply, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors. - In an exemplary embodiment, the encoding result of the expression image includes n displacement maps.
- The
image generating module 430 is configured to: perform spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and decode the n transformed encoding tensors to generate the output image. - In an exemplary embodiment, as shown in
FIG. 5 , theapparatus 400 further includes: amodel invoking module 440, configured to invoke a facial expression editing model, the facial expression editing model including: a first encoder, a second encoder, and a decoder; where the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set; the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image. - In an exemplary embodiment, the facial expression editing model is a model constructed based on a generative adversarial network. The facial expression editing model includes a generator and a discriminator. The generator includes the first encoder, the second encoder, and the decoder.
- A training process of the facial expression editing model includes: obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, where the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and using the training sample to train the facial expression editing model.
- In an exemplary embodiment, a loss function Ltotal of the facial expression editing model is:
-
L total =L L1 +λLSGAN L LSGANλP L P+λO L O, where - LL
1 represents a first-order distance loss, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP, and λO respectively represent weights corresponding to the three losses, and in the overlapping penalty loss LO=Σi=1 nσ(a)−1, a∈A, A represents the attention map set, and σ(a) represents a sigmoid function of a. - In an exemplary embodiment, as shown in
FIG. 5 , theapparatus 400 further includes: animage processing module 450, configured to generate a video or a dynamic picture including the output image. - In summary, in the technical solutions provided in the embodiments of the present disclosure, an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
- When the apparatus provided in the embodiments implements functions of the apparatus, it is illustrated with an example of division of each functional module. In a practical implementation, the functions may be distributed to different functional modules according to the requirements, that is, the internal structure of the device is divided into different functional modules, to implement all or some of the functions described above. In addition, the apparatus embodiments and the method embodiments provided in the embodiments belong to one conception. For the specific implementation process, refer to the method embodiments, and details are not described herein again.
-
FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. - A
computer device 600 includes a Central Processing Unit (CPU) 601, asystem memory 604 including a Random Access Memory (RAM) 602 and a Read-Only Memory (ROM) 603, and asystem bus 605 connecting thesystem memory 604 and thecentral processing unit 601. Thecomputer device 600 further includes a basic input/output (I/O)system 606 assisting in transmitting information between components in the computer, and amass storage device 607 configured to store anoperating system 613, anapplication program 614, and anotherprogram module 615. - The basic input/
output system 606 includes adisplay 608 configured to display information and aninput device 609 such as a mouse and a keyboard for a user to input information. Thedisplay 608 and theinput device 609 are both connected to theCPU 601 by an input/output controller 610 connected to thesystem bus 605. The basic input/output system 606 may further include the input/output controller 610 for receiving and processing input from a plurality of other devices such as a keyboard, a mouse, and an electronic stylus. Similarly, the input/output controller 610 further provides an output to a display screen, a printer or another type of an output device. - The
mass storage device 607 is connected to theCPU 601 through a mass storage controller (not shown) connected to thesystem bus 605. Themass storage device 607 and an associated computer readable medium provide non-volatile storage for thecomputer device 600. That is, themass storage device 607 may include a computer readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive. - Without loss of generality, the computer readable medium may include a computer storage medium and a communication medium. The computer storage medium includes volatile and non-volatile media, and removable and non-removable media implemented by using any method or technology used for storing information such as computer readable instructions, data structures, program modules, or other data. The computer storage medium includes a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash memory or another solid-state memory technology, a Compact Disc Read-Only Memory (CD-ROM), or another optical memory, a tape cartridge, a magnetic cassette, a magnetic disk memory, or another magnetic storage device. A person skilled in the art may learn that the computer storage medium is not limited to the several types. The
system memory 604 and themass storage device 607 may be collectively referred to as a memory. - According to the various embodiments of the present disclosure, the
computer device 600 may further be connected, through a network such as the Internet, to a remote computer on the network for running. That is, thecomputer device 600 may be connected to anetwork 612 by using anetwork interface unit 611 connected to thesystem bus 605, or may be connected to another type of network or a remote computer system (not shown) by using thenetwork interface unit 611. - The term unit (and other similar terms such as subunit, module, submodule, etc.) in this disclosure may refer to a software unit, a hardware unit, or a combination thereof. A software unit (e.g., computer program) may be developed using a computer programming language. A hardware unit may be implemented using processing circuitry and/or memory. Each unit can be implemented using one or more processors (or processors and memory). Likewise, a processor (or processors and memory) can be used to implement one or more units. Moreover, each unit can be part of an overall unit that includes the functionalities of the unit.
- The memory further includes at least one instruction, at least one program, a code set, or an instruction set. The at least one instruction, the at least one program, the code set, or the instruction set is stored in the memory and is configured to be executed by one or more processors to implement the image processing method.
- In an exemplary embodiment, a computer readable storage medium is further provided, the storage medium storing at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, the code set, or the instruction set, when executed by the processor of a terminal, implementing the image processing method.
- In certain embodiments, the computer readable storage medium may include: a ROM, a RAM, a Solid State Drive (SSD), an optical disc, or the like. The RAM may include a Resistance Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).
- In an exemplary embodiment, a computer program product is further provided, the computer program product, when executed by the processor of a terminal, implementing the image processing method.
- “Plurality of” mentioned in the present disclosure means two or more. “And/or” describes an association relationship for describing associated objects and represents that three relationships may exist. For example, A and/or B may represent the following three implementations: Only A exists, both A and B exist, and only B exists. The character “/” in the present disclosure generally indicates an “or” relationship between the associated objects. In addition, the step numbers described in the present disclosure merely exemplarily show a performing sequence of the steps. In some other embodiments, the steps may not be performed according to the number sequence. For example, two steps with different numbers may be performed simultaneously, or two steps with different numbers may be performed according to a sequence contrary to the sequence shown in the figure. This is not limited in the embodiments of the present disclosure.
- The descriptions are merely exemplary embodiments of the present disclosure, but are not intended to limit the present disclosure. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure all fall within the protection scope of the present disclosure.
Claims (20)
1. An image processing method, applied to a computer device, the method comprising:
encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1;
obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image;
encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and
generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
2. The method according to claim 1 , wherein obtaining the encoding result of the input image comprises:
multiplying, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, wherein the encoding result of the input image includes the n processed encoding tensors.
3. The method according to claim 2 , wherein the encoding result of the expression image includes n displacement maps, and generating the output image comprises:
performing spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and
decode the n transformed encoding tensors to generate the output image.
4. The method according to claim 1 , further comprising:
invoking a facial expression editing model, the facial expression editing model including a first encoder, a second encoder, and a decoder, wherein:
the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set;
the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and
the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
5. The method according to claim 4 , wherein the facial expression editing model is a model constructed based on a generative adversarial network, the facial expression editing model includes a generator and a discriminator, and the generator includes the first encoder, the second encoder, and the decoder, and wherein a training process of the facial expression editing model comprises:
obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, wherein the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and
using the training sample to train the facial expression editing model.
6. The method according to claim 4 , wherein a loss function of the facial expression editing model Ltotal is:
L total =L L1 +λLSGAN L LSGANλP L P+λO L O, where
L total =L L
LL 1 represents a first-order distance loss, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP, and λO respectively represent weights corresponding to the three losses, and in the overlapping penalty loss LO=Σi=1 nσ(a)−1, a∈A, A represents the attention map set, and a (a) represents a sigmoid function of a.
7. The method according to claim 1 , further comprising:
generating a video or a dynamic picture including the output image.
8. An image processing apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform:
encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1;
obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image;
encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and
generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
9. The apparatus according to claim 8 , wherein obtaining the encoding result of the input image includes:
multiplying, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, wherein the encoding result of the input image includes the n processed encoding tensors.
10. The apparatus according to claim 9 , wherein the encoding result of the expression image includes n displacement maps; and generating the output image includes:
performing spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and
decoding the n transformed encoding tensors to generate the output image.
11. The apparatus according to claim 8 , wherein the processor is further configured to execute the computer program instructions and perform:
invoking a facial expression editing model, the facial expression editing model including a first encoder, a second encoder, and a decoder, wherein:
the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set;
the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and
the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
12. The apparatus according to claim 11 , wherein the facial expression editing model is a model constructed based on a generative adversarial network, the facial expression editing model includes a generator and a discriminator, and the generator includes the first encoder, the second encoder, and the decoder, and wherein a training process of the facial expression editing model includes:
obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, wherein the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and
using the training sample to train the facial expression editing model.
13. The apparatus according to claim 8 , wherein the processor is further configured to execute the computer program instructions and perform:
generating a video or a dynamic picture including the output image.
14. The apparatus according to claim 8 , wherein a loss function of the facial expression editing model Ltotal is:
L total =L L1 +λLSGAN L LSGANλP L P+λO L O, where
L total =L L
LL 1 represents a first-order distance loss, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP, and λO respectively represent weights corresponding to the three losses, and in the overlapping penalty loss LO=Σi=1 nσ(a)−1, a∈A, A represents the attention map set, and σ(a) represents a sigmoid function of a.
15. A non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform:
encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1;
obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image;
encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and
generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
16. The non-transitory computer-readable storage medium according to claim 15 , wherein obtaining the encoding result of the input image includes:
multiplying, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, wherein the encoding result of the input image includes the n processed encoding tensors.
17. The non-transitory computer-readable storage medium according to claim 16 , wherein the encoding result of the expression image includes n displacement maps, and generating the output image includes:
performing spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and
decode the n transformed encoding tensors to generate the output image.
18. The non-transitory computer-readable storage medium according to claim 15 , wherein the computer program instructions are executable by the at least one processor to further perform:
invoking a facial expression editing model, the facial expression editing model including a first encoder, a second encoder, and a decoder, wherein:
the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set;
the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and
the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
19. The non-transitory computer-readable storage medium according to claim 15 , wherein the facial expression editing model is a model constructed based on a generative adversarial network, the facial expression editing model includes a generator and a discriminator, and the generator includes the first encoder, the second encoder, and the decoder, and wherein a training process of the facial expression editing model includes:
obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, wherein the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and
using the training sample to train the facial expression editing model.
20. The non-transitory computer-readable storage medium according to claim 15 , wherein the computer program instructions are executable by the at least one processor to further perform:
generating a video or a dynamic picture including the output image.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911072470.8 | 2019-11-05 | ||
CN201911072470.8A CN110796111B (en) | 2019-11-05 | 2019-11-05 | Image processing method, device, equipment and storage medium |
PCT/CN2020/117455 WO2021088556A1 (en) | 2019-11-05 | 2020-09-24 | Image processing method and apparatus, device, and storage medium |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/117455 Continuation WO2021088556A1 (en) | 2019-11-05 | 2020-09-24 | Image processing method and apparatus, device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220028031A1 true US20220028031A1 (en) | 2022-01-27 |
Family
ID=69442779
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/497,883 Pending US20220028031A1 (en) | 2019-11-05 | 2021-10-08 | Image processing method and apparatus, device, and storage medium |
Country Status (3)
Country | Link |
---|---|
US (1) | US20220028031A1 (en) |
CN (1) | CN110796111B (en) |
WO (1) | WO2021088556A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113706388A (en) * | 2021-09-24 | 2021-11-26 | 上海壁仞智能科技有限公司 | Image super-resolution reconstruction method and device |
US11526971B2 (en) * | 2020-06-01 | 2022-12-13 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method for translating image and method for training image translation model |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796111B (en) * | 2019-11-05 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Image processing method, device, equipment and storage medium |
CN111401216B (en) * | 2020-03-12 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Image processing method, model training method, image processing device, model training device, computer equipment and storage medium |
CN113538604B (en) * | 2020-04-21 | 2024-03-19 | 中移(成都)信息通信科技有限公司 | Image generation method, device, equipment and medium |
CN111553267B (en) * | 2020-04-27 | 2023-12-01 | 腾讯科技(深圳)有限公司 | Image processing method, image processing model training method and device |
CN111783603A (en) * | 2020-06-24 | 2020-10-16 | 有半岛(北京)信息科技有限公司 | Training method for generating confrontation network, image face changing method and video face changing method and device |
CN113507608A (en) * | 2021-06-09 | 2021-10-15 | 北京三快在线科技有限公司 | Image coding method and device and electronic equipment |
CN113723480B (en) * | 2021-08-18 | 2024-03-05 | 北京达佳互联信息技术有限公司 | Image processing method, device, electronic equipment and storage medium |
CN114565941A (en) * | 2021-08-24 | 2022-05-31 | 商汤国际私人有限公司 | Texture generation method, device, equipment and computer readable storage medium |
CN114866345B (en) * | 2022-07-05 | 2022-12-09 | 支付宝(杭州)信息技术有限公司 | Processing method, device and equipment for biological recognition |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200294294A1 (en) * | 2019-03-15 | 2020-09-17 | NeoCortext Inc. | Face-swapping apparatus and method |
US10825219B2 (en) * | 2018-03-22 | 2020-11-03 | Northeastern University | Segmentation guided image generation with adversarial networks |
US20210056348A1 (en) * | 2019-08-19 | 2021-02-25 | Neon Evolution Inc. | Methods and systems for image and voice processing |
US11074711B1 (en) * | 2018-06-15 | 2021-07-27 | Bertec Corporation | System for estimating a pose of one or more persons in a scene |
US20220222897A1 (en) * | 2019-06-28 | 2022-07-14 | Microsoft Technology Licensing, Llc | Portrait editing and synthesis |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9610510B2 (en) * | 2015-07-21 | 2017-04-04 | Disney Enterprises, Inc. | Sensing and managing vehicle behavior based on occupant awareness |
CN108921061B (en) * | 2018-06-20 | 2022-08-26 | 腾讯科技(深圳)有限公司 | Expression recognition method, device and equipment |
CN109190472B (en) * | 2018-07-28 | 2021-09-14 | 天津大学 | Pedestrian attribute identification method based on image and attribute combined guidance |
CN109325422A (en) * | 2018-08-28 | 2019-02-12 | 深圳壹账通智能科技有限公司 | Expression recognition method, device, terminal and computer readable storage medium |
CN109508689B (en) * | 2018-11-28 | 2023-01-03 | 中山大学 | Face recognition method for strengthening confrontation |
CN109934116B (en) * | 2019-02-19 | 2020-11-24 | 华南理工大学 | Standard face generation method based on confrontation generation mechanism and attention generation mechanism |
CN109934767A (en) * | 2019-03-06 | 2019-06-25 | 中南大学 | A kind of human face expression conversion method of identity-based and expressive features conversion |
CN110008846B (en) * | 2019-03-13 | 2022-08-30 | 南京邮电大学 | Image processing method |
CN110222588B (en) * | 2019-05-15 | 2020-03-27 | 合肥进毅智能技术有限公司 | Human face sketch image aging synthesis method, device and storage medium |
CN110796111B (en) * | 2019-11-05 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Image processing method, device, equipment and storage medium |
-
2019
- 2019-11-05 CN CN201911072470.8A patent/CN110796111B/en active Active
-
2020
- 2020-09-24 WO PCT/CN2020/117455 patent/WO2021088556A1/en active Application Filing
-
2021
- 2021-10-08 US US17/497,883 patent/US20220028031A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10825219B2 (en) * | 2018-03-22 | 2020-11-03 | Northeastern University | Segmentation guided image generation with adversarial networks |
US11074711B1 (en) * | 2018-06-15 | 2021-07-27 | Bertec Corporation | System for estimating a pose of one or more persons in a scene |
US20200294294A1 (en) * | 2019-03-15 | 2020-09-17 | NeoCortext Inc. | Face-swapping apparatus and method |
US20220222897A1 (en) * | 2019-06-28 | 2022-07-14 | Microsoft Technology Licensing, Llc | Portrait editing and synthesis |
US20210056348A1 (en) * | 2019-08-19 | 2021-02-25 | Neon Evolution Inc. | Methods and systems for image and voice processing |
Non-Patent Citations (2)
Title |
---|
Chen, Mingyi, et al. "Double encoder conditional GAN for facial expression synthesis." 2018 37th Chinese Control Conference (CCC). IEEE, 2018. (Year: 2018) * |
Zhang, Gang, et al. "Generative adversarial network with spatial attention for face attribute editing." Proceedings of the European conference on computer vision (ECCV). 2018. (Year: 2018) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11526971B2 (en) * | 2020-06-01 | 2022-12-13 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method for translating image and method for training image translation model |
CN113706388A (en) * | 2021-09-24 | 2021-11-26 | 上海壁仞智能科技有限公司 | Image super-resolution reconstruction method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2021088556A1 (en) | 2021-05-14 |
CN110796111A (en) | 2020-02-14 |
CN110796111B (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220028031A1 (en) | Image processing method and apparatus, device, and storage medium | |
CN112215927B (en) | Face video synthesis method, device, equipment and medium | |
US20210224601A1 (en) | Video sequence selection method, computer device, and storage medium | |
US20230049533A1 (en) | Image gaze correction method, apparatus, electronic device, computer-readable storage medium, and computer program product | |
CN111401216B (en) | Image processing method, model training method, image processing device, model training device, computer equipment and storage medium | |
US20230082605A1 (en) | Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium | |
US20230072627A1 (en) | Gaze correction method and apparatus for face image, device, computer-readable storage medium, and computer program product face image | |
CN111553267B (en) | Image processing method, image processing model training method and device | |
US20210192701A1 (en) | Image processing method and apparatus, device, and storage medium | |
CN113761153B (en) | Picture-based question-answering processing method and device, readable medium and electronic equipment | |
CN115565238B (en) | Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product | |
CN111833360B (en) | Image processing method, device, equipment and computer readable storage medium | |
WO2024051480A1 (en) | Image processing method and apparatus, computer device, and storage medium | |
CN116958323A (en) | Image generation method, device, electronic equipment, storage medium and program product | |
CN113657272B (en) | Micro video classification method and system based on missing data completion | |
US20240046471A1 (en) | Three-dimensional medical image recognition method and apparatus, device, storage medium, and product | |
CN116958324A (en) | Training method, device, equipment and storage medium of image generation model | |
CN110047118B (en) | Video generation method, device, computer equipment and storage medium | |
CN113538254A (en) | Image restoration method and device, electronic equipment and computer readable storage medium | |
CN113011320A (en) | Video processing method and device, electronic equipment and storage medium | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN117540007B (en) | Multi-mode emotion analysis method, system and equipment based on similar mode completion | |
WO2024066549A1 (en) | Data processing method and related device | |
CN112463936A (en) | Visual question answering method and system based on three-dimensional information | |
CN115731101A (en) | Super-resolution image processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SUN, TIANYU;HUANG, HAOZHI;LIU, WEI;SIGNING DATES FROM 20210924 TO 20210926;REEL/FRAME:057745/0578 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |