US20220028031A1 - Image processing method and apparatus, device, and storage medium - Google Patents

Image processing method and apparatus, device, and storage medium Download PDF

Info

Publication number
US20220028031A1
US20220028031A1 US17/497,883 US202117497883A US2022028031A1 US 20220028031 A1 US20220028031 A1 US 20220028031A1 US 202117497883 A US202117497883 A US 202117497883A US 2022028031 A1 US2022028031 A1 US 2022028031A1
Authority
US
United States
Prior art keywords
image
encoding
expression
input image
encoding result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/497,883
Inventor
Tianyu Sun
Haozhi Huang
Wei Liu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED reassignment TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HUANG, Haozhi, LIU, WEI, SUN, TIANYU
Publication of US20220028031A1 publication Critical patent/US20220028031A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/0012Context preserving transformation, e.g. by using an importance map
    • G06T3/04
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06K9/00268
    • G06K9/00302
    • G06K9/00979
    • G06K9/6256
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0454
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Definitions

  • the present disclosure relates to the field of computer vision technologies in artificial intelligence technologies, and in particular, to an image processing method and apparatus, a device, and a storage medium.
  • Facial expression editing is to adjust an expression in a face image to obtain another image. For example, an expression in an original image is smile, and after the facial expression editing, an obtained expression in a target image is crying.
  • the expression transformation capability may be limited.
  • Embodiments of the present disclosure provide an image processing method and apparatus, a device, and a storage medium, which can generate an output image with a large expression difference from an input image, thereby improving an expression transformation capability.
  • the technical solutions are as follows.
  • the present disclosure provides an image processing method, applied to a computer device, and the method includes: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
  • the present disclosure provides an image processing apparatus, and the apparatus includes a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of
  • the present disclosure provides a non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
  • An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
  • the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
  • an output image is generated according to the encoding result of the input image and the encoding result of the expression image
  • the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
  • FIG. 1 is a schematic flowchart of an image processing method according to one or more embodiments of the present disclosure
  • FIG. 2 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure
  • FIG. 3 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure
  • FIG. 4 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure.
  • FIG. 5 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure.
  • FIG. 6 is a schematic structural block diagram of a computer device according to one or more embodiments of the present disclosure.
  • AI Artificial intelligence
  • AI is a theory, method, technology, and application system that use a digital computer or a machine controlled by the digital computer to simulate, extend, and expand human intelligence, perceive an environment, acquire knowledge, and use knowledge to obtain an optimal result.
  • AI is a comprehensive technology in computer science and attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.
  • AI is to study the design principles and implementation methods of various intelligent machines, to enable the machines to have the functions of perception, reasoning, and decision-making.
  • the AI technology is a comprehensive discipline, and relates to a wide range of fields including both hardware-level technologies and software-level technologies.
  • the basic AI technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, the big data processing technology, the operating/interaction system, and electromechanical integration.
  • AI software technologies include several directions such as the computer vision (CV) technology, the speech processing technology, the natural language processing technology, and machine learning/deep learning.
  • CV computer vision
  • Computer vision (CV) technology is a science that studies how to enable a machine to “see”, and to be specific, to implement machine vision such as recognition, tracking, measurement, and the like of a target by using a camera and a computer in replacement of human eyes, and to perform further graphic processing by using a computer to generate an image more suitable for human eyes to observe or more suitable for transmission to and detection by an instrument.
  • machine vision studies related theories and technologies and attempts to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • the computer vision technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction, and further include biometric feature recognition technologies such as common face recognition and fingerprint recognition.
  • technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction, and further include biometric feature recognition technologies such as common face recognition and fingerprint recognition.
  • Machine learning is a multi-field interdisciplinary subject, involving multiple disciplines such as the probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory.
  • the machine learning specializes in studying how a computer simulates or implements a human learning behavior to acquire new knowledge or skills, and reorganize an existing knowledge structure to continuously improve its own performance.
  • the machine learning is the core of AI, is a basic way to make the computer intelligent, and is applied to various fields of AI.
  • Machine learning and deep learning generally involve technologies such as the artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and learning from demonstrations.
  • the artificial intelligence technology is studied and applied in a plurality of fields such as the smart home, smart wearable device, virtual assistant, smart speaker, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicle, robot, smart medical care, and smart customer service. It is believed that with the development of technologies, the AI technology will be applied to more fields, and play an increasingly important role.
  • An expression may be encoded by using spatial transformation, that is, the spatial transformation is performed on an original image to obtain a target image. Because an expression feature relies on spatial transformation to be encoded into the target image, pixel units not appearing in the original image cannot be generated. For example, if there is no teeth in the original image, there will be no teeth in the target image, so that the target image with a large expression difference from the original image cannot be generated, and an expression transformation capability is limited.
  • Solutions provided by the embodiments of the present disclosure involve the computer vision technology of artificial intelligence, and provide an image processing method.
  • An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
  • the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
  • an output image is generated according to the encoding result of the input image and the encoding result of the expression image
  • the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving the expression transformation capability.
  • steps of the method may be performed by a computer device, which may be any electronic device with processing and storage capabilities, such as a mobile phone, a tablet computer, a game device, a multimedia playback device, an electronic photo frame, a wearable device, and a personal computer (PC), and may also be a server.
  • a computer device may be any electronic device with processing and storage capabilities, such as a mobile phone, a tablet computer, a game device, a multimedia playback device, an electronic photo frame, a wearable device, and a personal computer (PC), and may also be a server.
  • the term “computer device” is employed herein interchangeably with the term “computing device.”
  • the steps are performed by a computer device, which, however, does not constitute a limitation.
  • FIG. 1 is a flowchart of an image processing method provided by an embodiment of the present disclosure. The method may include the following steps ( 101 to 104 ).
  • Step 101 Encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image.
  • the input image is a human face image, that is, an image containing a human face.
  • Multichannel encoding is performed on the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image.
  • the encoding tensor set includes n encoding tensors
  • the attention map set includes n attention maps
  • n is an integer greater than 1.
  • the value of n may be preset according to actual requirements, for example, the value of n is set to 8.
  • the input image may be encoded by using a first encoder to obtain the encoding tensor set and the attention map set of the input image.
  • the first encoder is configured to extract an image feature of the input image and encode the image feature to obtain the two sets.
  • the first encoder is an imaginative encoder.
  • An imagination module is embedded in the first encoder, which enables the first encoder to generate a plurality of encoding tensors of the input image.
  • a plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified.
  • the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
  • the attention mechanism is a pixel-based information enhancement mechanism for a specific target in a feature map that uses an attention mechanism similar to human eyes in deep learning, that is, the attention mechanism can enhance target information in the feature map. After the feature map is processed based on the attention mechanism, the target information in the feature map will be enhanced.
  • the attention-enhanced feature map can enhance target-based voxel level information.
  • the input image is a 256 ⁇ 256 ⁇ 3 image, where 256 ⁇ 256 represents a resolution of the input image, and 3 represents three channels RGB.
  • the obtained encoding tensor set includes eight 256 ⁇ 256 ⁇ 3 encoding tensors
  • the obtained attention map set includes eight 256 ⁇ 256 ⁇ 1 attention maps, where the eight encoding tensors are in one-to-one correspondence to the eight attention maps.
  • the first encoder may use a U-Net structure.
  • the U-Net is an image segmentation model based on Convolutional Neural Network (CNN), including a convolution layer, a max pooling layer (downsampling), a deconvolution layer (upsampling), and a Rectified Linear Unit (ReLU) layer.
  • CNN Convolutional Neural Network
  • the first encoder may also use other network architectures. This is not limited in the embodiments of the present disclosure.
  • Step 102 Obtain an encoding result of the input image according to the encoding tensor set and the attention map set, where the encoding result of the input image records an identity feature of a human face in the input image.
  • the identity feature refers to feature information used for distinguishing faces of different people.
  • the encoding result of the input image in addition to the identity feature of the human face in the input image, the encoding result of the input image further records an appearance feature of the human face in the input image, so that a final generated output image has the same identity feature and appearance feature as the input image.
  • the appearance feature refers to feature information used for reflecting an external appearance attribute of a human face.
  • the encoding tensor and the attention map are multiplied to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
  • the encoding result E s (x) of the input image x is expressed as:
  • e i represents an i th encoding tensor in the encoding tensor set E, and a i represents an i th attention map in the attention map set.
  • the encoding result of the input image obtained after the operation includes eight 256 ⁇ 256 ⁇ 3 processed encoding tensors.
  • Step 103 Encode an expression image to obtain an encoding result of the expression image, where the encoding result of the expression image records an expression feature of a human face in the expression image.
  • the expression image is a face image used for providing an expression feature.
  • the expression feature of the human face in the expression image is extracted by encoding the expression image, so that the final generated output image has the same identity feature and appearance feature as the input image, and has the same expression feature as the expression image. That is, the final output image is obtained by transforming the expression of the human face in the expression image into the input image and maintaining the identity feature and appearance feature of the human face in the input image.
  • the encoding result of the expression image includes a displacement map set.
  • the displacement map set includes n displacement maps, where an i th displacement map is used for performing spatial transformation processing on an i th processed encoding tensor.
  • the expression image is y
  • the encoding result E T (y) of the expression image y is expressed as:
  • O i represents an i th displacement map in the displacement map set O.
  • the encoding result of the expression image may include eight 256 ⁇ 256 ⁇ 2 displacement maps.
  • the i th 256 ⁇ 256 ⁇ 2 displacement map is used as an example.
  • the displacement map includes two 256 ⁇ 256 displacement maps. Element values of a pixel at position (x, y) in the two 256 ⁇ 256 displacement maps are recorded as x′ and y′, indicating that the pixel at position (x, y) is moved to (x′, y′) in the i th processed encoding tensor.
  • the expression image may be encoded by using a second encoder to obtain the encoding result of the expression image.
  • the second encoder is configured to extract an image feature of the expression image and encode the image feature to obtain a displacement map set as the encoding result of the expression image.
  • the network structures of the second encoder and the first encoder may be the same or different. This is not limited in the embodiments of the present disclosure.
  • Step 104 Generate the output image according to the encoding result of the input image and the encoding result of the expression image, where the output image has the identity feature of the input image and the expression feature of the expression image.
  • the identity feature carried in the encoding result of the input image is mixed with the expression feature carried in the encoding result of the expression image, and then the output image is reconstructed, so that the output image has the identity feature of the input image and the expression feature of the expression image.
  • the displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain n transformed encoding tensors; and the n transformed encoding tensors are decoded to generate the output image.
  • Directly splicing the expression image and the processed encoding tensors may cause the identity feature of the expression image to escape into the final output image.
  • a final training objective of the second encoder is set to learn a suitable displacement map.
  • the displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain the transformed encoding tensor, and the transformed encoding tensor is decoded to generate the output image, so that the output image records only the identity feature of the input image, but not the identity feature of the expression image.
  • the transformed encoding tensor set F is expressed as:
  • the transformed encoding tensor set F includes n transformed encoding tensors, and finally a decoder decodes the n transformed encoding tensors to generate the output image R:
  • F i represents an i th transformed encoding tensor
  • D R represents decoding processing
  • the step 101 to step 104 may be implemented by a trained or pre-trained facial expression editing model.
  • the facial expression editing model is invoked to generate the output image based on the input image and the expression image.
  • FIG. 2 is an exemplary schematic diagram of a facial expression editing model.
  • the facial expression editing model includes: a first encoder 21 , a second encoder 22 , and a decoder 23 .
  • the first encoder 21 is configured to encode an input image x based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image x, and to obtain an encoding result of the input image x according to the encoding tensor set and the attention map set.
  • the second encoder 22 is configured to encode an expression image y to obtain an encoding result of the expression image.
  • the decoder 23 is configured to generate an output image R according to the encoding result of the input image x and the encoding result of the expression image y.
  • the input image x is a girl with a smiling expression
  • the expression image y is a boy with a sad expression.
  • the final generated output image R has the identity feature of the input image x and the expression feature of the expression image y, that is, the output image R is a girl with a sad expression.
  • a video or a dynamic picture including the output image may be generated.
  • the expression transformation processing described above may be performed on a plurality of input images to generate a plurality of output images accordingly, and then the plurality of output images are combined into a video or a dynamic picture.
  • an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
  • the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
  • an output image is generated according to the encoding result of the input image and the encoding result of the expression image
  • the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
  • multichannel encoding is performed on the input image to generate a plurality of encoding tensors of the input image.
  • a plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified.
  • the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
  • the facial expression editing model described above is a model constructed based on a generative adversarial network.
  • a facial expression editing model 30 includes a generator 31 and a discriminator 32 .
  • the generator 31 includes a first encoder 21 , a second encoder 22 , and a decoder 23 .
  • a training process of the facial expression editing model 30 is as follows.
  • each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions
  • the generator is configured to generate an output image corresponding to the original image according to the original image and the target image
  • the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator.
  • the facial expression editing model is constructed based on the generative adversarial network, and the image pair included in the training sample is two images of the same human face with different expressions.
  • the generator is configured to perform the face expression transformation processing described above to generate the output image corresponding to the original image, and the discriminator uses adversarial learning to adjust and optimize parameters of the generator, so that the output image corresponding to the original image generated by the generator is as similar as possible to the target image.
  • a Least Squares Generative Adversarial Network may be selected as the generative adversarial network.
  • a loss function L total of the facial expression editing model is:
  • L total L L 1 + ⁇ LSGAN L LSGAN ⁇ P L P + ⁇ O L O , where
  • L L 1 represents a first-order distance loss, that is, Manhattan distance in a pixel dimension
  • L LSGAN , L P , and L O respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, ⁇ LSGAN , ⁇ P and ⁇ O respectively represent weights corresponding to the three losses.
  • the weights corresponding to the three losses may be preset according to an actual situation, which is not limited in the embodiments of the present disclosure.
  • A represents the attention map set
  • ⁇ (a) represents a sigmoid function of a.
  • the overlapping penalty loss L O is introduced to encourage the use of different encoding tensors to encode different parts of an image.
  • a generative adversarial network in a self-supervised mode is used for training the generator, and additional labeling may not be needed.
  • Rules for the facial expression transformation are learned from unlabeled data, which helps to reduce training cost of the model and improving training efficiency of the model.
  • FIG. 4 is a block diagram of an image processing apparatus provided by an embodiment of the present disclosure.
  • the apparatus has a function of implementing the method example. The function may be implemented by using hardware or may be implemented by hardware executing corresponding software.
  • the apparatus may be a computer device or may be disposed in a computer device.
  • the apparatus 400 may include: a first encoding module 410 , a second encoding module 420 , and an image generating module 430 .
  • the first encoding module 410 is configured to encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image.
  • the encoding tensor set includes n encoding tensors
  • the attention map set includes n attention maps
  • n is an integer greater than 1.
  • the first encoding module 410 is further configured to obtain an encoding result of the input image according to the encoding tensor set and the attention map set.
  • the encoding result of the input image records an identity feature of a human face in the input image.
  • the second encoding module 420 is configured to encode an expression image to obtain an encoding result of the expression image.
  • the encoding result of the expression image records an expression feature of a human face in the expression image.
  • the image generating module 430 is configured to generate an output image according to the encoding result of the input image and the encoding result of the expression image.
  • the output image has the identity feature of the input image and the expression feature of the expression image.
  • the first encoding module 410 is configured to multiply, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
  • the encoding result of the expression image includes n displacement maps.
  • the image generating module 430 is configured to: perform spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and decode the n transformed encoding tensors to generate the output image.
  • the apparatus 400 further includes: a model invoking module 440 , configured to invoke a facial expression editing model, the facial expression editing model including: a first encoder, a second encoder, and a decoder; where the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set; the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
  • a model invoking module 440 configured to invoke a facial expression editing model, the facial expression editing model including: a first encoder, a second encoder, and a decoder; where the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set
  • the facial expression editing model is a model constructed based on a generative adversarial network.
  • the facial expression editing model includes a generator and a discriminator.
  • the generator includes the first encoder, the second encoder, and the decoder.
  • a training process of the facial expression editing model includes: obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, where the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and using the training sample to train the facial expression editing model.
  • a loss function L total of the facial expression editing model is:
  • L total L L 1 + ⁇ LSGAN L LSGAN ⁇ P L P + ⁇ O L O , where
  • L L 1 represents a first-order distance loss
  • L LSGAN , L P , and L O respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss
  • ⁇ LSGAN , ⁇ P , and ⁇ O respectively represent weights corresponding to the three losses
  • A represents the attention map set
  • ⁇ (a) represents a sigmoid function of a.
  • the apparatus 400 further includes: an image processing module 450 , configured to generate a video or a dynamic picture including the output image.
  • an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image.
  • the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image
  • an output image is generated according to the encoding result of the input image and the encoding result of the expression image
  • the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
  • the apparatus provided in the embodiments implements functions of the apparatus, it is illustrated with an example of division of each functional module.
  • the functions may be distributed to different functional modules according to the requirements, that is, the internal structure of the device is divided into different functional modules, to implement all or some of the functions described above.
  • the apparatus embodiments and the method embodiments provided in the embodiments belong to one conception. For the specific implementation process, refer to the method embodiments, and details are not described herein again.
  • FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
  • a computer device 600 includes a Central Processing Unit (CPU) 601 , a system memory 604 including a Random Access Memory (RAM) 602 and a Read-Only Memory (ROM) 603 , and a system bus 605 connecting the system memory 604 and the central processing unit 601 .
  • the computer device 600 further includes a basic input/output (I/O) system 606 assisting in transmitting information between components in the computer, and a mass storage device 607 configured to store an operating system 613 , an application program 614 , and another program module 615 .
  • I/O basic input/output
  • the basic input/output system 606 includes a display 608 configured to display information and an input device 609 such as a mouse and a keyboard for a user to input information.
  • the display 608 and the input device 609 are both connected to the CPU 601 by an input/output controller 610 connected to the system bus 605 .
  • the basic input/output system 606 may further include the input/output controller 610 for receiving and processing input from a plurality of other devices such as a keyboard, a mouse, and an electronic stylus. Similarly, the input/output controller 610 further provides an output to a display screen, a printer or another type of an output device.
  • the mass storage device 607 is connected to the CPU 601 through a mass storage controller (not shown) connected to the system bus 605 .
  • the mass storage device 607 and an associated computer readable medium provide non-volatile storage for the computer device 600 . That is, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • a computer readable medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • the computer readable medium may include a computer storage medium and a communication medium.
  • the computer storage medium includes volatile and non-volatile media, and removable and non-removable media implemented by using any method or technology used for storing information such as computer readable instructions, data structures, program modules, or other data.
  • the computer storage medium includes a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash memory or another solid-state memory technology, a Compact Disc Read-Only Memory (CD-ROM), or another optical memory, a tape cartridge, a magnetic cassette, a magnetic disk memory, or another magnetic storage device.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read-Only Memory
  • a person skilled in the art may learn that the computer storage medium is not limited to the several types.
  • the system memory 604 and the mass storage device 607 may be collectively referred to
  • the computer device 600 may further be connected, through a network such as the Internet, to a remote computer on the network for running. That is, the computer device 600 may be connected to a network 612 by using a network interface unit 611 connected to the system bus 605 , or may be connected to another type of network or a remote computer system (not shown) by using the network interface unit 611 .
  • unit in this disclosure may refer to a software unit, a hardware unit, or a combination thereof.
  • a software unit e.g., computer program
  • a hardware unit may be implemented using processing circuitry and/or memory.
  • processors or processors and memory
  • a processor or processors and memory
  • each unit can be part of an overall unit that includes the functionalities of the unit.
  • the memory further includes at least one instruction, at least one program, a code set, or an instruction set.
  • the at least one instruction, the at least one program, the code set, or the instruction set is stored in the memory and is configured to be executed by one or more processors to implement the image processing method.
  • a computer readable storage medium is further provided, the storage medium storing at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, the code set, or the instruction set, when executed by the processor of a terminal, implementing the image processing method.
  • the computer readable storage medium may include: a ROM, a RAM, a Solid State Drive (SSD), an optical disc, or the like.
  • the RAM may include a Resistance Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).
  • a computer program product is further provided, the computer program product, when executed by the processor of a terminal, implementing the image processing method.
  • “Plurality of” mentioned in the present disclosure means two or more.
  • “And/or” describes an association relationship for describing associated objects and represents that three relationships may exist.
  • a and/or B may represent the following three implementations: Only A exists, both A and B exist, and only B exists.
  • the character “/” in the present disclosure generally indicates an “or” relationship between the associated objects.
  • the step numbers described in the present disclosure merely exemplarily show a performing sequence of the steps. In some other embodiments, the steps may not be performed according to the number sequence. For example, two steps with different numbers may be performed simultaneously, or two steps with different numbers may be performed according to a sequence contrary to the sequence shown in the figure. This is not limited in the embodiments of the present disclosure.

Abstract

An image processing method is provided. The method includes: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.

Description

    RELATED APPLICATION(S)
  • This application is a continuation application of PCT Patent Application No. PCT/CN2020/117455 filed on Sep. 24, 2020, which claims priority to Chinese Patent Application No. 201911072470.8, entitled “IMAGE PROCESSING METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM” and filed on Nov. 5, 2019, all of which are incorporated herein by reference in entirety.
  • FIELD OF THE TECHNOLOGY
  • The present disclosure relates to the field of computer vision technologies in artificial intelligence technologies, and in particular, to an image processing method and apparatus, a device, and a storage medium.
  • BACKGROUND
  • Facial expression editing (also referred to as facial expression transformation) is to adjust an expression in a face image to obtain another image. For example, an expression in an original image is smile, and after the facial expression editing, an obtained expression in a target image is crying. However, in solutions provided to implement the facial expression transformation, the expression transformation capability may be limited.
  • SUMMARY
  • Embodiments of the present disclosure provide an image processing method and apparatus, a device, and a storage medium, which can generate an output image with a large expression difference from an input image, thereby improving an expression transformation capability. The technical solutions are as follows.
  • In one aspect, the present disclosure provides an image processing method, applied to a computer device, and the method includes: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
  • In another aspect, the present disclosure provides an image processing apparatus, and the apparatus includes a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
  • In yet another aspect, the present disclosure provides a non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform: encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1; obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image; encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
  • The technical solutions provided in the embodiments of the present disclosure may bring the following beneficial effects.
  • An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
  • Other aspects of the present disclosure can be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • To facilitate a better understanding of technical solutions of certain embodiments of the present disclosure, accompanying drawings are described below. The accompanying drawings are illustrative of certain embodiments of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without having to exert creative efforts. When the following descriptions are made with reference to the accompanying drawings, unless otherwise indicated, same numbers in different accompanying drawings may represent same or similar elements. In addition, the accompanying drawings are not necessarily drawn to scale.
  • FIG. 1 is a schematic flowchart of an image processing method according to one or more embodiments of the present disclosure;
  • FIG. 2 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure;
  • FIG. 3 is a schematic diagram of a facial expression editing model according to one or more embodiments of the present disclosure;
  • FIG. 4 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure;
  • FIG. 5 is a schematic block diagram of an image processing apparatus according to one or more embodiments of the present disclosure; and
  • FIG. 6 is a schematic structural block diagram of a computer device according to one or more embodiments of the present disclosure.
  • DETAILED DESCRIPTION
  • To make objectives, technical solutions, and/or advantages of the present disclosure more comprehensible, certain embodiments of the present disclosure are further elaborated in detail with reference to the accompanying drawings. The embodiments as described are not to be construed as a limitation to the present disclosure. All other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of embodiments of the present disclosure.
  • Throughout the description, and when applicable, “some embodiments” or “certain embodiments” describe subsets of all possible embodiments, but it may be understood that the “some embodiments” or “certain embodiments” may be the same subset or different subsets of all the possible embodiments, and can be combined with each other without conflict.
  • In certain embodiments, the term “based on” is employed herein interchangeably with the term “according to.”
  • Artificial intelligence (AI) is a theory, method, technology, and application system that use a digital computer or a machine controlled by the digital computer to simulate, extend, and expand human intelligence, perceive an environment, acquire knowledge, and use knowledge to obtain an optimal result. In other words, AI is a comprehensive technology in computer science and attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. AI is to study the design principles and implementation methods of various intelligent machines, to enable the machines to have the functions of perception, reasoning, and decision-making.
  • The AI technology is a comprehensive discipline, and relates to a wide range of fields including both hardware-level technologies and software-level technologies. The basic AI technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, the big data processing technology, the operating/interaction system, and electromechanical integration. AI software technologies include several directions such as the computer vision (CV) technology, the speech processing technology, the natural language processing technology, and machine learning/deep learning.
  • Computer vision (CV) technology is a science that studies how to enable a machine to “see”, and to be specific, to implement machine vision such as recognition, tracking, measurement, and the like of a target by using a camera and a computer in replacement of human eyes, and to perform further graphic processing by using a computer to generate an image more suitable for human eyes to observe or more suitable for transmission to and detection by an instrument. As a scientific discipline, computer vision studies related theories and technologies and attempts to establish an artificial intelligence system that can obtain information from images or multi-dimensional data. The computer vision technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, and map construction, and further include biometric feature recognition technologies such as common face recognition and fingerprint recognition.
  • Machine learning (ML) is a multi-field interdisciplinary subject, involving multiple disciplines such as the probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. The machine learning specializes in studying how a computer simulates or implements a human learning behavior to acquire new knowledge or skills, and reorganize an existing knowledge structure to continuously improve its own performance. The machine learning is the core of AI, is a basic way to make the computer intelligent, and is applied to various fields of AI. Machine learning and deep learning generally involve technologies such as the artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and learning from demonstrations.
  • With the research and progress of the artificial intelligence technology, the artificial intelligence technology is studied and applied in a plurality of fields such as the smart home, smart wearable device, virtual assistant, smart speaker, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicle, robot, smart medical care, and smart customer service. It is believed that with the development of technologies, the AI technology will be applied to more fields, and play an increasingly important role.
  • An expression may be encoded by using spatial transformation, that is, the spatial transformation is performed on an original image to obtain a target image. Because an expression feature relies on spatial transformation to be encoded into the target image, pixel units not appearing in the original image cannot be generated. For example, if there is no teeth in the original image, there will be no teeth in the target image, so that the target image with a large expression difference from the original image cannot be generated, and an expression transformation capability is limited.
  • Solutions provided by the embodiments of the present disclosure involve the computer vision technology of artificial intelligence, and provide an image processing method. An input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving the expression transformation capability.
  • According to the method provided by the embodiments of the present disclosure, steps of the method may be performed by a computer device, which may be any electronic device with processing and storage capabilities, such as a mobile phone, a tablet computer, a game device, a multimedia playback device, an electronic photo frame, a wearable device, and a personal computer (PC), and may also be a server. In certain embodiments, the term “computer device” is employed herein interchangeably with the term “computing device.” For ease of description, in the following method embodiments, the steps are performed by a computer device, which, however, does not constitute a limitation.
  • FIG. 1 is a flowchart of an image processing method provided by an embodiment of the present disclosure. The method may include the following steps (101 to 104).
  • Step 101. Encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image.
  • In an embodiment of the present disclosure, the input image is a human face image, that is, an image containing a human face. Multichannel encoding is performed on the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image. The encoding tensor set includes n encoding tensors, the attention map set includes n attention maps, and n is an integer greater than 1. The value of n may be preset according to actual requirements, for example, the value of n is set to 8.
  • The input image may be encoded by using a first encoder to obtain the encoding tensor set and the attention map set of the input image. The first encoder is configured to extract an image feature of the input image and encode the image feature to obtain the two sets. In an embodiment of the present disclosure, the first encoder is an imaginative encoder. An imagination module is embedded in the first encoder, which enables the first encoder to generate a plurality of encoding tensors of the input image. A plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified. In addition, in the imagination module, the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
  • The attention mechanism is a pixel-based information enhancement mechanism for a specific target in a feature map that uses an attention mechanism similar to human eyes in deep learning, that is, the attention mechanism can enhance target information in the feature map. After the feature map is processed based on the attention mechanism, the target information in the feature map will be enhanced. The attention-enhanced feature map can enhance target-based voxel level information.
  • Assume that the input image is a 256×256×3 image, where 256×256 represents a resolution of the input image, and 3 represents three channels RGB. When n is 8, after the first encoder encodes the input image, the obtained encoding tensor set includes eight 256×256×3 encoding tensors, and the obtained attention map set includes eight 256×256×1 attention maps, where the eight encoding tensors are in one-to-one correspondence to the eight attention maps.
  • In certain embodiments, the first encoder may use a U-Net structure. The U-Net is an image segmentation model based on Convolutional Neural Network (CNN), including a convolution layer, a max pooling layer (downsampling), a deconvolution layer (upsampling), and a Rectified Linear Unit (ReLU) layer. In some other embodiments, the first encoder may also use other network architectures. This is not limited in the embodiments of the present disclosure.
  • Step 102. Obtain an encoding result of the input image according to the encoding tensor set and the attention map set, where the encoding result of the input image records an identity feature of a human face in the input image.
  • The identity feature refers to feature information used for distinguishing faces of different people. In an embodiment of the present disclosure, in addition to the identity feature of the human face in the input image, the encoding result of the input image further records an appearance feature of the human face in the input image, so that a final generated output image has the same identity feature and appearance feature as the input image. The appearance feature refers to feature information used for reflecting an external appearance attribute of a human face.
  • In certain embodiments, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map are multiplied to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
  • Assume that the input image is x, the encoding tensor set of the input image is E, the attention map set is A, the number of elements in E and A are both n, and n is an integer greater than 1. The encoding result Es(x) of the input image x is expressed as:

  • E s(x)={e i ⊗a i,1≤i≤n}, where
  • ei represents an ith encoding tensor in the encoding tensor set E, and ai represents an ith attention map in the attention map set.
  • Assuming that the encoding tensor set includes eight 256×256×3 encoding tensors, and the attention map set includes eight 256×256×1 attention maps, the encoding result of the input image obtained after the operation includes eight 256×256×3 processed encoding tensors.
  • Step 103. Encode an expression image to obtain an encoding result of the expression image, where the encoding result of the expression image records an expression feature of a human face in the expression image.
  • The expression image is a face image used for providing an expression feature. In an embodiment of the present disclosure, the expression feature of the human face in the expression image is extracted by encoding the expression image, so that the final generated output image has the same identity feature and appearance feature as the input image, and has the same expression feature as the expression image. That is, the final output image is obtained by transforming the expression of the human face in the expression image into the input image and maintaining the identity feature and appearance feature of the human face in the input image.
  • The encoding result of the expression image includes a displacement map set. The displacement map set includes n displacement maps, where an ith displacement map is used for performing spatial transformation processing on an ith processed encoding tensor. Assuming that the expression image is y, the encoding result ET(y) of the expression image y is expressed as:

  • E T(y)={O i,1≤i≤n}, where
  • Oi represents an ith displacement map in the displacement map set O.
  • Exemplarily, the encoding result of the expression image may include eight 256×256×2 displacement maps. The ith 256×256×2 displacement map is used as an example. The displacement map includes two 256×256 displacement maps. Element values of a pixel at position (x, y) in the two 256×256 displacement maps are recorded as x′ and y′, indicating that the pixel at position (x, y) is moved to (x′, y′) in the ith processed encoding tensor.
  • The expression image may be encoded by using a second encoder to obtain the encoding result of the expression image. The second encoder is configured to extract an image feature of the expression image and encode the image feature to obtain a displacement map set as the encoding result of the expression image.
  • In addition, the network structures of the second encoder and the first encoder may be the same or different. This is not limited in the embodiments of the present disclosure.
  • Step 104. Generate the output image according to the encoding result of the input image and the encoding result of the expression image, where the output image has the identity feature of the input image and the expression feature of the expression image.
  • The identity feature carried in the encoding result of the input image is mixed with the expression feature carried in the encoding result of the expression image, and then the output image is reconstructed, so that the output image has the identity feature of the input image and the expression feature of the expression image.
  • In certain embodiments, for each group of a corresponding processed encoding tensor and a corresponding displacement map, the displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain n transformed encoding tensors; and the n transformed encoding tensors are decoded to generate the output image. Directly splicing the expression image and the processed encoding tensors may cause the identity feature of the expression image to escape into the final output image. Considering this, in an embodiment of the present disclosure, a final training objective of the second encoder is set to learn a suitable displacement map. The displacement map is used for performing spatial transformation processing on the processed encoding tensor to obtain the transformed encoding tensor, and the transformed encoding tensor is decoded to generate the output image, so that the output image records only the identity feature of the input image, but not the identity feature of the expression image.
  • In certain embodiments, the transformed encoding tensor set F is expressed as:

  • F=ST(E s(x),O), where
  • ST represents the spatial transformation processing. After the spatial transformation processing, the transformed encoding tensor set F includes n transformed encoding tensors, and finally a decoder decodes the n transformed encoding tensors to generate the output image R:

  • R=D R({F i,1≤i≤n}), where
  • Fi represents an ith transformed encoding tensor, and DR represents decoding processing.
  • In an exemplary embodiment, the step 101 to step 104 may be implemented by a trained or pre-trained facial expression editing model. The facial expression editing model is invoked to generate the output image based on the input image and the expression image. FIG. 2 is an exemplary schematic diagram of a facial expression editing model. The facial expression editing model includes: a first encoder 21, a second encoder 22, and a decoder 23. The first encoder 21 is configured to encode an input image x based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image x, and to obtain an encoding result of the input image x according to the encoding tensor set and the attention map set. The second encoder 22 is configured to encode an expression image y to obtain an encoding result of the expression image. The decoder 23 is configured to generate an output image R according to the encoding result of the input image x and the encoding result of the expression image y. For example, as shown in FIG. 2, the input image x is a girl with a smiling expression, and the expression image y is a boy with a sad expression. After the expression transformation processing, the final generated output image R has the identity feature of the input image x and the expression feature of the expression image y, that is, the output image R is a girl with a sad expression.
  • In certain embodiments, after the expression transformation is performed on the input image to generate the output image, a video or a dynamic picture including the output image may be generated. For example, the expression transformation processing described above may be performed on a plurality of input images to generate a plurality of output images accordingly, and then the plurality of output images are combined into a video or a dynamic picture.
  • In summary, in the technical solutions provided in the embodiments of the present disclosure, an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
  • In addition, multichannel encoding is performed on the input image to generate a plurality of encoding tensors of the input image. A plurality of expressions of the input image are encoded in the plurality of encoding tensors, thereby obtaining pixel units that are relatively more diversified. In addition, the attention mechanism is embedded to obtain intuitive understanding of an expression encoded by each subunit through visual observation.
  • In an exemplary embodiment, the facial expression editing model described above is a model constructed based on a generative adversarial network. As shown in FIG. 3, a facial expression editing model 30 includes a generator 31 and a discriminator 32. The generator 31 includes a first encoder 21, a second encoder 22, and a decoder 23.
  • A training process of the facial expression editing model 30 is as follows.
  • 1. Obtain at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, where the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator.
  • 2. Use the training sample to train the facial expression editing model.
  • In an embodiment of the present disclosure, the facial expression editing model is constructed based on the generative adversarial network, and the image pair included in the training sample is two images of the same human face with different expressions. The generator is configured to perform the face expression transformation processing described above to generate the output image corresponding to the original image, and the discriminator uses adversarial learning to adjust and optimize parameters of the generator, so that the output image corresponding to the original image generated by the generator is as similar as possible to the target image. In certain embodiments, a Least Squares Generative Adversarial Network (LSGAN) may be selected as the generative adversarial network.
  • In certain embodiments, a loss function Ltotal of the facial expression editing model is:

  • L total =L L 1 LSGAN L LSGANλP L PO L O, where
  • LL 1 represents a first-order distance loss, that is, Manhattan distance in a pixel dimension, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP and λO respectively represent weights corresponding to the three losses. The weights corresponding to the three losses may be preset according to an actual situation, which is not limited in the embodiments of the present disclosure. In the overlapping penalty loss LOi=1 nσ(a)−1, a∈A, A represents the attention map set, and σ(a) represents a sigmoid function of a. In an embodiment of the present disclosure, to make full use of channel width, the overlapping penalty loss LO is introduced to encourage the use of different encoding tensors to encode different parts of an image.
  • In summary, in the technical solutions provided in the embodiments of the present disclosure, a generative adversarial network in a self-supervised mode is used for training the generator, and additional labeling may not be needed. Rules for the facial expression transformation are learned from unlabeled data, which helps to reduce training cost of the model and improving training efficiency of the model.
  • The following describes apparatus embodiments of the present disclosure, which can be used for performing the method embodiments of the present disclosure. For details not disclosed in the apparatus embodiments of the present disclosure, refer to the method embodiments of the present disclosure.
  • FIG. 4 is a block diagram of an image processing apparatus provided by an embodiment of the present disclosure. The apparatus has a function of implementing the method example. The function may be implemented by using hardware or may be implemented by hardware executing corresponding software. The apparatus may be a computer device or may be disposed in a computer device. The apparatus 400 may include: a first encoding module 410, a second encoding module 420, and an image generating module 430.
  • The first encoding module 410 is configured to encode an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image. The encoding tensor set includes n encoding tensors, the attention map set includes n attention maps, and n is an integer greater than 1.
  • The first encoding module 410 is further configured to obtain an encoding result of the input image according to the encoding tensor set and the attention map set. The encoding result of the input image records an identity feature of a human face in the input image.
  • The second encoding module 420 is configured to encode an expression image to obtain an encoding result of the expression image. The encoding result of the expression image records an expression feature of a human face in the expression image.
  • The image generating module 430 is configured to generate an output image according to the encoding result of the input image and the encoding result of the expression image. The output image has the identity feature of the input image and the expression feature of the expression image.
  • In an exemplary embodiment, the first encoding module 410 is configured to multiply, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, where the encoding result of the input image includes the n processed encoding tensors.
  • In an exemplary embodiment, the encoding result of the expression image includes n displacement maps.
  • The image generating module 430 is configured to: perform spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and decode the n transformed encoding tensors to generate the output image.
  • In an exemplary embodiment, as shown in FIG. 5, the apparatus 400 further includes: a model invoking module 440, configured to invoke a facial expression editing model, the facial expression editing model including: a first encoder, a second encoder, and a decoder; where the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set; the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
  • In an exemplary embodiment, the facial expression editing model is a model constructed based on a generative adversarial network. The facial expression editing model includes a generator and a discriminator. The generator includes the first encoder, the second encoder, and the decoder.
  • A training process of the facial expression editing model includes: obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, where the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and using the training sample to train the facial expression editing model.
  • In an exemplary embodiment, a loss function Ltotal of the facial expression editing model is:

  • L total =L L 1 LSGAN L LSGANλP L PO L O, where
  • LL 1 represents a first-order distance loss, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP, and λO respectively represent weights corresponding to the three losses, and in the overlapping penalty loss LOi=1 nσ(a)−1, a∈A, A represents the attention map set, and σ(a) represents a sigmoid function of a.
  • In an exemplary embodiment, as shown in FIG. 5, the apparatus 400 further includes: an image processing module 450, configured to generate a video or a dynamic picture including the output image.
  • In summary, in the technical solutions provided in the embodiments of the present disclosure, an input image is encoded based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, then an encoding result of the input image is obtained according to the encoding tensor set and the attention map set, and then an expression image is encoded to obtain an encoding result of the expression image. Because the encoding result of the input image records an identity feature of a human face in the input image and the encoding result of the expression image records an expression feature of a human face in the expression image, and an output image is generated according to the encoding result of the input image and the encoding result of the expression image, the output image has the identity feature of the input image and the expression feature of the expression image, and the expression feature of the output image is determined by the expression image instead of the input image. In this way, the generated output image can have a large expression difference from the input image, thereby improving an expression transformation capability.
  • When the apparatus provided in the embodiments implements functions of the apparatus, it is illustrated with an example of division of each functional module. In a practical implementation, the functions may be distributed to different functional modules according to the requirements, that is, the internal structure of the device is divided into different functional modules, to implement all or some of the functions described above. In addition, the apparatus embodiments and the method embodiments provided in the embodiments belong to one conception. For the specific implementation process, refer to the method embodiments, and details are not described herein again.
  • FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.
  • A computer device 600 includes a Central Processing Unit (CPU) 601, a system memory 604 including a Random Access Memory (RAM) 602 and a Read-Only Memory (ROM) 603, and a system bus 605 connecting the system memory 604 and the central processing unit 601. The computer device 600 further includes a basic input/output (I/O) system 606 assisting in transmitting information between components in the computer, and a mass storage device 607 configured to store an operating system 613, an application program 614, and another program module 615.
  • The basic input/output system 606 includes a display 608 configured to display information and an input device 609 such as a mouse and a keyboard for a user to input information. The display 608 and the input device 609 are both connected to the CPU 601 by an input/output controller 610 connected to the system bus 605. The basic input/output system 606 may further include the input/output controller 610 for receiving and processing input from a plurality of other devices such as a keyboard, a mouse, and an electronic stylus. Similarly, the input/output controller 610 further provides an output to a display screen, a printer or another type of an output device.
  • The mass storage device 607 is connected to the CPU 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and an associated computer readable medium provide non-volatile storage for the computer device 600. That is, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • Without loss of generality, the computer readable medium may include a computer storage medium and a communication medium. The computer storage medium includes volatile and non-volatile media, and removable and non-removable media implemented by using any method or technology used for storing information such as computer readable instructions, data structures, program modules, or other data. The computer storage medium includes a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash memory or another solid-state memory technology, a Compact Disc Read-Only Memory (CD-ROM), or another optical memory, a tape cartridge, a magnetic cassette, a magnetic disk memory, or another magnetic storage device. A person skilled in the art may learn that the computer storage medium is not limited to the several types. The system memory 604 and the mass storage device 607 may be collectively referred to as a memory.
  • According to the various embodiments of the present disclosure, the computer device 600 may further be connected, through a network such as the Internet, to a remote computer on the network for running. That is, the computer device 600 may be connected to a network 612 by using a network interface unit 611 connected to the system bus 605, or may be connected to another type of network or a remote computer system (not shown) by using the network interface unit 611.
  • The term unit (and other similar terms such as subunit, module, submodule, etc.) in this disclosure may refer to a software unit, a hardware unit, or a combination thereof. A software unit (e.g., computer program) may be developed using a computer programming language. A hardware unit may be implemented using processing circuitry and/or memory. Each unit can be implemented using one or more processors (or processors and memory). Likewise, a processor (or processors and memory) can be used to implement one or more units. Moreover, each unit can be part of an overall unit that includes the functionalities of the unit.
  • The memory further includes at least one instruction, at least one program, a code set, or an instruction set. The at least one instruction, the at least one program, the code set, or the instruction set is stored in the memory and is configured to be executed by one or more processors to implement the image processing method.
  • In an exemplary embodiment, a computer readable storage medium is further provided, the storage medium storing at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, the code set, or the instruction set, when executed by the processor of a terminal, implementing the image processing method.
  • In certain embodiments, the computer readable storage medium may include: a ROM, a RAM, a Solid State Drive (SSD), an optical disc, or the like. The RAM may include a Resistance Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).
  • In an exemplary embodiment, a computer program product is further provided, the computer program product, when executed by the processor of a terminal, implementing the image processing method.
  • “Plurality of” mentioned in the present disclosure means two or more. “And/or” describes an association relationship for describing associated objects and represents that three relationships may exist. For example, A and/or B may represent the following three implementations: Only A exists, both A and B exist, and only B exists. The character “/” in the present disclosure generally indicates an “or” relationship between the associated objects. In addition, the step numbers described in the present disclosure merely exemplarily show a performing sequence of the steps. In some other embodiments, the steps may not be performed according to the number sequence. For example, two steps with different numbers may be performed simultaneously, or two steps with different numbers may be performed according to a sequence contrary to the sequence shown in the figure. This is not limited in the embodiments of the present disclosure.
  • The descriptions are merely exemplary embodiments of the present disclosure, but are not intended to limit the present disclosure. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure all fall within the protection scope of the present disclosure.

Claims (20)

What is claimed is:
1. An image processing method, applied to a computer device, the method comprising:
encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1;
obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image;
encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and
generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
2. The method according to claim 1, wherein obtaining the encoding result of the input image comprises:
multiplying, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, wherein the encoding result of the input image includes the n processed encoding tensors.
3. The method according to claim 2, wherein the encoding result of the expression image includes n displacement maps, and generating the output image comprises:
performing spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and
decode the n transformed encoding tensors to generate the output image.
4. The method according to claim 1, further comprising:
invoking a facial expression editing model, the facial expression editing model including a first encoder, a second encoder, and a decoder, wherein:
the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set;
the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and
the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
5. The method according to claim 4, wherein the facial expression editing model is a model constructed based on a generative adversarial network, the facial expression editing model includes a generator and a discriminator, and the generator includes the first encoder, the second encoder, and the decoder, and wherein a training process of the facial expression editing model comprises:
obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, wherein the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and
using the training sample to train the facial expression editing model.
6. The method according to claim 4, wherein a loss function of the facial expression editing model Ltotal is:

L total =L L 1 LSGAN L LSGANλP L PO L O, where
LL 1 represents a first-order distance loss, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP, and λO respectively represent weights corresponding to the three losses, and in the overlapping penalty loss LOi=1 nσ(a)−1, a∈A, A represents the attention map set, and a (a) represents a sigmoid function of a.
7. The method according to claim 1, further comprising:
generating a video or a dynamic picture including the output image.
8. An image processing apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform:
encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image; the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1;
obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image;
encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and
generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
9. The apparatus according to claim 8, wherein obtaining the encoding result of the input image includes:
multiplying, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, wherein the encoding result of the input image includes the n processed encoding tensors.
10. The apparatus according to claim 9, wherein the encoding result of the expression image includes n displacement maps; and generating the output image includes:
performing spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and
decoding the n transformed encoding tensors to generate the output image.
11. The apparatus according to claim 8, wherein the processor is further configured to execute the computer program instructions and perform:
invoking a facial expression editing model, the facial expression editing model including a first encoder, a second encoder, and a decoder, wherein:
the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set;
the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and
the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
12. The apparatus according to claim 11, wherein the facial expression editing model is a model constructed based on a generative adversarial network, the facial expression editing model includes a generator and a discriminator, and the generator includes the first encoder, the second encoder, and the decoder, and wherein a training process of the facial expression editing model includes:
obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, wherein the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and
using the training sample to train the facial expression editing model.
13. The apparatus according to claim 8, wherein the processor is further configured to execute the computer program instructions and perform:
generating a video or a dynamic picture including the output image.
14. The apparatus according to claim 8, wherein a loss function of the facial expression editing model Ltotal is:

L total =L L 1 LSGAN L LSGANλP L PO L O, where
LL 1 represents a first-order distance loss, LLSGAN, LP, and LO respectively represent a least squares generative adversarial network loss, a perceptual loss, and an overlapping penalty loss, λLSGAN, λP, and λO respectively represent weights corresponding to the three losses, and in the overlapping penalty loss LOi=1 nσ(a)−1, a∈A, A represents the attention map set, and σ(a) represents a sigmoid function of a.
15. A non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform:
encoding an input image based on an attention mechanism to obtain an encoding tensor set and an attention map set of the input image, the encoding tensor set including n encoding tensors, the attention map set including n attention maps, and n being an integer greater than 1;
obtaining an encoding result of the input image according to the encoding tensor set and the attention map set, the encoding result of the input image recording an identity feature of a human face in the input image;
encoding an expression image to obtain an encoding result of the expression image, the encoding result of the expression image recording an expression feature of a human face in the expression image; and
generating an output image according to the encoding result of the input image and the encoding result of the expression image, the output image having the identity feature of the input image and the expression feature of the expression image.
16. The non-transitory computer-readable storage medium according to claim 15, wherein obtaining the encoding result of the input image includes:
multiplying, for each group of a corresponding encoding tensor and a corresponding attention map in the encoding tensor set and the attention map set, the encoding tensor and the attention map to obtain n processed encoding tensors, wherein the encoding result of the input image includes the n processed encoding tensors.
17. The non-transitory computer-readable storage medium according to claim 16, wherein the encoding result of the expression image includes n displacement maps, and generating the output image includes:
performing spatial transformation processing, for each group of a corresponding processed encoding tensor and a corresponding displacement map, on the processed encoding tensor by using the displacement map to obtain n transformed encoding tensors; and
decode the n transformed encoding tensors to generate the output image.
18. The non-transitory computer-readable storage medium according to claim 15, wherein the computer program instructions are executable by the at least one processor to further perform:
invoking a facial expression editing model, the facial expression editing model including a first encoder, a second encoder, and a decoder, wherein:
the first encoder is configured to encode the input image based on the attention mechanism to obtain the encoding tensor set and the attention map set of the input image, and to obtain the encoding result of the input image according to the encoding tensor set and the attention map set;
the second encoder is configured to encode the expression image to obtain the encoding result of the expression image; and
the decoder is configured to generate the output image according to the encoding result of the input image and the encoding result of the expression image.
19. The non-transitory computer-readable storage medium according to claim 15, wherein the facial expression editing model is a model constructed based on a generative adversarial network, the facial expression editing model includes a generator and a discriminator, and the generator includes the first encoder, the second encoder, and the decoder, and wherein a training process of the facial expression editing model includes:
obtaining at least one training sample, each of the training sample being an image pair including an original image and a target image, the original image and the target image being two images of a same human face, and the original image and the target image having different expressions, wherein the generator is configured to generate an output image corresponding to the original image according to the original image and the target image, and the discriminator is configured to determine whether the output image corresponding to the original image and the target image are the image generated by the generator; and
using the training sample to train the facial expression editing model.
20. The non-transitory computer-readable storage medium according to claim 15, wherein the computer program instructions are executable by the at least one processor to further perform:
generating a video or a dynamic picture including the output image.
US17/497,883 2019-11-05 2021-10-08 Image processing method and apparatus, device, and storage medium Pending US20220028031A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201911072470.8 2019-11-05
CN201911072470.8A CN110796111B (en) 2019-11-05 2019-11-05 Image processing method, device, equipment and storage medium
PCT/CN2020/117455 WO2021088556A1 (en) 2019-11-05 2020-09-24 Image processing method and apparatus, device, and storage medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117455 Continuation WO2021088556A1 (en) 2019-11-05 2020-09-24 Image processing method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
US20220028031A1 true US20220028031A1 (en) 2022-01-27

Family

ID=69442779

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/497,883 Pending US20220028031A1 (en) 2019-11-05 2021-10-08 Image processing method and apparatus, device, and storage medium

Country Status (3)

Country Link
US (1) US20220028031A1 (en)
CN (1) CN110796111B (en)
WO (1) WO2021088556A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113706388A (en) * 2021-09-24 2021-11-26 上海壁仞智能科技有限公司 Image super-resolution reconstruction method and device
US11526971B2 (en) * 2020-06-01 2022-12-13 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for translating image and method for training image translation model

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796111B (en) * 2019-11-05 2020-11-10 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN111401216B (en) * 2020-03-12 2023-04-18 腾讯科技(深圳)有限公司 Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
CN113538604B (en) * 2020-04-21 2024-03-19 中移(成都)信息通信科技有限公司 Image generation method, device, equipment and medium
CN111553267B (en) * 2020-04-27 2023-12-01 腾讯科技(深圳)有限公司 Image processing method, image processing model training method and device
CN111783603A (en) * 2020-06-24 2020-10-16 有半岛(北京)信息科技有限公司 Training method for generating confrontation network, image face changing method and video face changing method and device
CN113507608A (en) * 2021-06-09 2021-10-15 北京三快在线科技有限公司 Image coding method and device and electronic equipment
CN113723480B (en) * 2021-08-18 2024-03-05 北京达佳互联信息技术有限公司 Image processing method, device, electronic equipment and storage medium
CN114565941A (en) * 2021-08-24 2022-05-31 商汤国际私人有限公司 Texture generation method, device, equipment and computer readable storage medium
CN114866345B (en) * 2022-07-05 2022-12-09 支付宝(杭州)信息技术有限公司 Processing method, device and equipment for biological recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200294294A1 (en) * 2019-03-15 2020-09-17 NeoCortext Inc. Face-swapping apparatus and method
US10825219B2 (en) * 2018-03-22 2020-11-03 Northeastern University Segmentation guided image generation with adversarial networks
US20210056348A1 (en) * 2019-08-19 2021-02-25 Neon Evolution Inc. Methods and systems for image and voice processing
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene
US20220222897A1 (en) * 2019-06-28 2022-07-14 Microsoft Technology Licensing, Llc Portrait editing and synthesis

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9610510B2 (en) * 2015-07-21 2017-04-04 Disney Enterprises, Inc. Sensing and managing vehicle behavior based on occupant awareness
CN108921061B (en) * 2018-06-20 2022-08-26 腾讯科技(深圳)有限公司 Expression recognition method, device and equipment
CN109190472B (en) * 2018-07-28 2021-09-14 天津大学 Pedestrian attribute identification method based on image and attribute combined guidance
CN109325422A (en) * 2018-08-28 2019-02-12 深圳壹账通智能科技有限公司 Expression recognition method, device, terminal and computer readable storage medium
CN109508689B (en) * 2018-11-28 2023-01-03 中山大学 Face recognition method for strengthening confrontation
CN109934116B (en) * 2019-02-19 2020-11-24 华南理工大学 Standard face generation method based on confrontation generation mechanism and attention generation mechanism
CN109934767A (en) * 2019-03-06 2019-06-25 中南大学 A kind of human face expression conversion method of identity-based and expressive features conversion
CN110008846B (en) * 2019-03-13 2022-08-30 南京邮电大学 Image processing method
CN110222588B (en) * 2019-05-15 2020-03-27 合肥进毅智能技术有限公司 Human face sketch image aging synthesis method, device and storage medium
CN110796111B (en) * 2019-11-05 2020-11-10 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10825219B2 (en) * 2018-03-22 2020-11-03 Northeastern University Segmentation guided image generation with adversarial networks
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene
US20200294294A1 (en) * 2019-03-15 2020-09-17 NeoCortext Inc. Face-swapping apparatus and method
US20220222897A1 (en) * 2019-06-28 2022-07-14 Microsoft Technology Licensing, Llc Portrait editing and synthesis
US20210056348A1 (en) * 2019-08-19 2021-02-25 Neon Evolution Inc. Methods and systems for image and voice processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Chen, Mingyi, et al. "Double encoder conditional GAN for facial expression synthesis." 2018 37th Chinese Control Conference (CCC). IEEE, 2018. (Year: 2018) *
Zhang, Gang, et al. "Generative adversarial network with spatial attention for face attribute editing." Proceedings of the European conference on computer vision (ECCV). 2018. (Year: 2018) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11526971B2 (en) * 2020-06-01 2022-12-13 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for translating image and method for training image translation model
CN113706388A (en) * 2021-09-24 2021-11-26 上海壁仞智能科技有限公司 Image super-resolution reconstruction method and device

Also Published As

Publication number Publication date
WO2021088556A1 (en) 2021-05-14
CN110796111A (en) 2020-02-14
CN110796111B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
US20220028031A1 (en) Image processing method and apparatus, device, and storage medium
CN112215927B (en) Face video synthesis method, device, equipment and medium
US20210224601A1 (en) Video sequence selection method, computer device, and storage medium
US20230049533A1 (en) Image gaze correction method, apparatus, electronic device, computer-readable storage medium, and computer program product
CN111401216B (en) Image processing method, model training method, image processing device, model training device, computer equipment and storage medium
US20230082605A1 (en) Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium
US20230072627A1 (en) Gaze correction method and apparatus for face image, device, computer-readable storage medium, and computer program product face image
CN111553267B (en) Image processing method, image processing model training method and device
US20210192701A1 (en) Image processing method and apparatus, device, and storage medium
CN113761153B (en) Picture-based question-answering processing method and device, readable medium and electronic equipment
CN115565238B (en) Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
CN111833360B (en) Image processing method, device, equipment and computer readable storage medium
WO2024051480A1 (en) Image processing method and apparatus, computer device, and storage medium
CN116958323A (en) Image generation method, device, electronic equipment, storage medium and program product
CN113657272B (en) Micro video classification method and system based on missing data completion
US20240046471A1 (en) Three-dimensional medical image recognition method and apparatus, device, storage medium, and product
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN110047118B (en) Video generation method, device, computer equipment and storage medium
CN113538254A (en) Image restoration method and device, electronic equipment and computer readable storage medium
CN113011320A (en) Video processing method and device, electronic equipment and storage medium
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN117540007B (en) Multi-mode emotion analysis method, system and equipment based on similar mode completion
WO2024066549A1 (en) Data processing method and related device
CN112463936A (en) Visual question answering method and system based on three-dimensional information
CN115731101A (en) Super-resolution image processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SUN, TIANYU;HUANG, HAOZHI;LIU, WEI;SIGNING DATES FROM 20210924 TO 20210926;REEL/FRAME:057745/0578

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS