WO2024104144A1 - 图像合成方法和装置、存储介质及电子设备 - Google Patents

图像合成方法和装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2024104144A1
WO2024104144A1 PCT/CN2023/128423 CN2023128423W WO2024104144A1 WO 2024104144 A1 WO2024104144 A1 WO 2024104144A1 CN 2023128423 W CN2023128423 W CN 2023128423W WO 2024104144 A1 WO2024104144 A1 WO 2024104144A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
sample
target
loss value
initial
Prior art date
Application number
PCT/CN2023/128423
Other languages
English (en)
French (fr)
Inventor
贺珂珂
朱俊伟
邰颖
汪铖杰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024104144A1 publication Critical patent/WO2024104144A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/206Drawing of charts or graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the present application relates to the field of computers, and in particular to an image synthesis method and device, a storage medium, and an electronic device.
  • the conventional image synthesis method is to input a source image containing source image identity information (e.g., source image identity information such as eye information and nose information) and a template image containing template image background information (e.g., template image background information such as facial angle and facial expression), so as to output a synthesized image that can maintain the template image background information in the template image and is as similar as possible to the source image identity information contained in the source image.
  • source image identity information e.g., source image identity information such as eye information and nose information
  • template image background information e.g., template image background information such as facial angle and facial expression
  • the embodiments of the present application provide an image synthesis method and device, a storage medium and an electronic device to at least solve the technical problem that the synthesized image effect is poor and unnatural because the object in the template image to be synthesized has a large posture or the object is occluded.
  • an image synthesis method comprising: obtaining a source image to be processed and a template image, wherein the source image includes identity information of the source image to be synthesized, and the template image includes background information of the template image to be synthesized; performing a target synthesis operation on the source image and the template image to obtain an initial synthesized image, wherein the initial synthesized image includes the identity information of the source image and the background information of the template image and a partial area to be corrected; performing a target correction operation on the source image, the template image and the initial synthesized image to obtain a target residual image; synthesizing the initial synthesized image and the target residual image into a target synthesized image, wherein the target synthesized image includes the identity information of the source image and the background information of the template image and the partial area corrected according to the target residual image.
  • an image synthesis device including: an acquisition module, used to acquire a source image and a template image to be processed, wherein the source image includes the identity information of the source image to be synthesized, and the template image includes the background information of the template image to be synthesized; a first synthesis module, used to perform a target synthesis operation on the source image and the template image to obtain an initial synthesized image, wherein the initial synthesized image includes the source image identity information and the template image background information and a partial area to be corrected; a correction module, used to perform a target correction operation on the source image, the template image and the initial synthesized image to obtain a target residual image; a second synthesis module, used to synthesize the initial synthesized image and the target residual image into a target synthesized image, wherein the target synthesized image includes the source image identity information and the template image background information and the partial area corrected according to the target residual image.
  • a computer program product or a computer program wherein the computer program product or the computer program comprises computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the computer instructions are read from a computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above image synthesis method.
  • an electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the image synthesis method through the computer program.
  • a robust image synthesis result can be achieved in the presence of large postures and occlusions, meeting the needs of face-changing in some difficult scenarios, thereby optimizing the effect of synthesizing images in the presence of large postures or occlusions, making the synthesized images more robust and more natural, and further solving the technical problem of poor and unnatural synthesized images due to large postures or occlusions of the objects in the template image to be synthesized.
  • FIG1 is a schematic diagram of an application environment of an image synthesis method according to an embodiment of the present application.
  • FIG2 is a flow chart of an image synthesis method according to an embodiment of the present application.
  • FIG3 is a schematic diagram of a posture of a target object in a three-dimensional space according to an embodiment of the present application
  • FIG4 is a schematic diagram of a target object with a target posture according to an embodiment of the present application.
  • FIG5 is a schematic diagram of a target object being blocked according to an embodiment of the present application.
  • FIG6 is a schematic diagram of an image synthesis method according to an embodiment of the present application.
  • FIG7 is a schematic diagram of another image synthesis method according to an embodiment of the present application.
  • FIG8 is a schematic diagram of another image synthesis method according to an embodiment of the present application.
  • FIG9 is a schematic diagram of another image synthesis method according to an embodiment of the present application.
  • FIG10 is a schematic diagram of the structure of an image synthesis device according to an embodiment of the present application.
  • FIG11 is a schematic diagram of the structure of an image synthesis product according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the structure of an electronic device according to an embodiment of the present application.
  • Generative Adversarial Network A method of unsupervised learning that learns by letting two neural networks compete with each other. It consists of a generator network and a discriminator network.
  • the generator network randomly samples from the latent space as input, and its output needs to imitate the real samples in the training set as much as possible.
  • the input of the discriminator network is the real sample or the output of the generator network. Its purpose is to distinguish the output of the generator network from the real sample as much as possible.
  • the generator network should deceive the discriminator network as much as possible.
  • the two networks compete with each other and constantly adjust parameters to finally generate a fake picture.
  • Video face swapping The definition of face swapping is to swap the input source image source to the template face template, and make the output face fake maintain the expression, angle, background and other information of the template face.
  • an image synthesis method is provided.
  • the above-mentioned image synthesis method can be applied to a hardware environment composed of a server 101 and a terminal device 103 as shown in Figure 1.
  • the server 101 is connected to the terminal device 103 via a network, and can be used to provide services for the terminal device or an application installed on the terminal device.
  • the application can be a video application, an instant messaging application, a browser application, an educational application, a game application, etc.
  • a database 105 can be set on the server 101 or independently of the server 101 to provide data storage services for the server 101, for example, a game data storage server.
  • the above-mentioned network may include, but is not limited to: a wired network, a wireless network, wherein the wired network includes: a local area network, a metropolitan area network and a wide area network, and the wireless network includes: Bluetooth, WIFI and other networks that realize wireless communication.
  • the terminal device 103 may be a terminal equipped with an application program, and may include but is not limited to at least one of the following: a mobile phone (such as an Android phone, an iOS phone, etc.), a laptop, a tablet computer, a PDA, a MID (Mobile Internet Devices), a PAD, a desktop computer, a smart TV, an intelligent voice interaction device, a smart home appliance, a vehicle terminal, an aircraft, and other computer devices.
  • the server 101 may be a single server, or a server cluster consisting of multiple servers, or a cloud server.
  • the above-mentioned image synthesis method can be implemented in the terminal device 103 through the following steps:
  • the target object in the template image has a target posture or is blocked.
  • the above-mentioned image synthesis method can also be implemented by a server, for example, the server 101 shown in FIG. or implemented by the terminal device and the server together.
  • the image synthesis method includes:
  • S202 obtaining a source image and a template image to be processed, wherein the source image includes identity information of a source image to be synthesized, the template image includes background information of a template image to be synthesized, and a target object in the template image has a target posture or is occluded.
  • the above-mentioned source image may include but is not limited to an image that requires the use of identity information
  • the above-mentioned identity information may include but is not limited to facial features, facial features, etc. in the image
  • the above-mentioned template image may include but is not limited to an image that requires the use of background information
  • the above-mentioned background information may include but is not limited to expression features, angle features, background features, etc. in the image.
  • the above-mentioned target object may include but is not limited to people, animals, game characters, film and television characters, virtual images, etc. included in the template image.
  • the above-mentioned target posture can be understood as the posture of the above-mentioned target object in the template image belongs to a large posture, and the above-mentioned large posture can be understood as a posture with a yaw angle greater than a preset angle value.
  • FIG3 is a schematic diagram of the posture of a target object in a three-dimensional space according to an embodiment of the present application.
  • the face (or head) movement of the target object in the three-dimensional space mainly has three angles, namely yaw (yaw angle), pitch (pitch angle), and roll (roll angle), which are generally referred to as face posture angles.
  • yaw yaw angle
  • pitch pitch
  • roll roll
  • face posture angles mainly has three angles, namely yaw (yaw angle), pitch (pitch angle), and roll (roll angle), which are generally referred to as face posture angles.
  • These three angles correspond to three situations, corresponding to left and right rotation of the face (rotation along the y-axis), up and down rotation (rotation along the x-axis), and sideways rotation (rotation along the z-axis).
  • Figure 4 is a schematic diagram of a target object with a target posture according to an embodiment of the present application.
  • a preset angle value e.g. 45°
  • the above-mentioned target object being occluded can be understood as the face of the target object includes an object, and the object occludes a partial area of the face of the target object.
  • Figure 5 is a schematic diagram of a target object being blocked according to an embodiment of the present application.
  • a face object in a three-dimensional space wears a mask. Since the mask blocks other facial areas except the eyes, it can be understood that the target object is blocked by the mask, that is, the target object is blocked.
  • the target synthesis operation may include but is not limited to inputting the source image and the template image into a pre-trained image synthesis model for synthesis.
  • the initial synthesized image is a synthesized image whose effect needs to be improved.
  • the partial area to be corrected may include but is not limited to an area where the combination area of the source image and the template image appears blurred or ghosted, etc.
  • Fig. 6 is a schematic diagram of an image synthesis method according to an embodiment of the present application.
  • the source image and the template image are input into the synthesis network (image synthesis model) to obtain an initial synthesized image, wherein the nose portion of the initial synthesized image has a double image, and this area (nose portion) is the partial area to be corrected.
  • the target correction operation may include but is not limited to:
  • the image is input into the pre-trained image correction model for synthesis.
  • the above initial synthetic image is a synthetic image whose effect needs to be improved, wherein the above partial area may include but is not limited to the area where the combination of the above source image and the template image appears blurred or ghosted and needs to be corrected. Correcting the partial area (or correcting the partial area to be corrected) can be understood as correcting the above partial area to be corrected according to the target residual image.
  • the target residual image may include but is not limited to an image output by the image correction model, which is used to correct the partial area to be corrected.
  • Figure 7 is a schematic diagram of another image synthesis method according to an embodiment of the present application.
  • the source image, the template image and the initial synthesized image are input into the correction network to obtain a target residual image, wherein a ghost image appears in the nose part of the initial synthesized image, and this area is the above-mentioned partial area to be corrected.
  • a target residual image for correcting the partial area can be generated.
  • the above-mentioned synthesis of the initial synthesized image and the target residual image into the target synthesized image may include but is not limited to a superposition operation.
  • the superposition operation superimposes the pixel values of each pixel in the initial synthesized image and the target residual image to obtain the above-mentioned target synthesized image.
  • the target synthetic image includes both the source image identity information and the template image background information. Meanwhile, the partial area to be corrected in the initial synthetic image is displayed normally in the target synthetic image, and is displayed as the corrected partial area.
  • the above-mentioned image synthesis method may include but is not limited to being implemented based on artificial intelligence.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies.
  • Basic artificial intelligence technologies generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operating/interactive systems, mechatronics, and other technologies.
  • Artificial intelligence software technologies mainly include computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer vision is a science that studies how to make machines "see”. To put it more specifically, it refers to machine vision such as using cameras and computers to replace human eyes to identify and measure targets, and further perform graphic processing so that computer processing becomes an image that is more suitable for human eye observation or transmission to instruments for detection.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous positioning and map construction, and other technologies, as well as common biometric recognition technologies such as face recognition and fingerprint recognition.
  • GAN Generative Adversarial Networks
  • the source image and the template image are input into a generative adversarial network to perform a target synthesis operation, and then the source image, the template image and the initial synthesized image are input into another generative adversarial network to perform a target correction operation to obtain a target residual image. Finally, the initial synthesized image and the target residual image are synthesized into the above-mentioned target synthesized image.
  • a target synthesis operation is performed on a source image and a template image to obtain an initial synthesized image, including: inputting the source image and the template image into a target synthesis network to obtain an initial synthesized image.
  • the initial synthesized image is obtained in the target synthesis network in the following manner: performing a splicing operation on the source image and the template image to obtain a first spliced image, wherein the number of channels of the first spliced image is the sum of the number of channels of the source image and the template image; performing an encoding operation on the first spliced image to obtain first intermediate layer feature information with a gradually increasing number of channels; performing a decoding operation on the first intermediate layer feature information to obtain an initial synthesized image with a gradually decreasing number of channels, wherein the number of channels of the initial synthesized image is the same as the number of channels of the source image.
  • the stitching operation may include but is not limited to performing feature extraction operations on the source image and the template image respectively, and then superimposing the extracted feature maps to obtain the first stitched image.
  • a first spliced image of 512*512*6 dimensions is obtained.
  • the encoding operation may include but is not limited to performing a convolution operation on the first stitched image
  • the decoding operation may include but is not limited to performing a deconvolution operation on the first stitched image.
  • the encoding operation is used, for example, to represent the input as a feature vector (for feature extraction).
  • the decoding operation is used, for example, to represent the feature vector as an output (including classification).
  • the input first spliced image is of 512*512*6 dimensions, which is gradually encoded into 256*256*32 dimensions, 128*128*64 dimensions, 64*64*128 dimensions, 32*32*256 dimensions, and so on, to obtain the first intermediate layer feature information.
  • the first intermediate layer feature information is sent to the decoder, which mainly performs deconvolution operations to gradually double the resolution, and decodes the first intermediate layer feature information into 32*32*256 dimensions, 64*64*128 dimensions, 128*128*64 dimensions, 256*256*32 dimensions, 512*512*3 dimensions, and finally obtains the initial composite image.
  • the above method also includes: training the first synthesis network to obtain a target synthesis network.
  • the target synthesis network is obtained by training the first synthesis network in the following manner: obtaining a first sample source image, a first sample template image, and a label image, wherein the label image is a predetermined image obtained by synthesizing the first sample source image and the first sample template image; performing a splicing operation on the first sample source image and the first sample template image to obtain a first sample spliced image, wherein the number of channels of the first sample spliced image is the sum of the number of channels of the first sample source image and the first sample template image;
  • the first synthesis network performs an encoding operation on the first sample spliced image to obtain first sample intermediate layer feature information with a gradually increasing number of channels; the first synthesis network performs a decoding operation on the first sample intermediate layer feature information to obtain a first sample initial synthesis image with a gradually decreasing number of channels, wherein the number of channels of the first sample initial
  • the stitching operation may include but is not limited to performing feature extraction operations on the first sample source image and the first sample template image respectively, and then superimposing the extracted feature maps to obtain the first sample stitching image.
  • the number of RGB channels of the first sample source image and the first sample template image are both 3, and the size is 512*512 pixels.
  • a first sample spliced image with a dimension of 512*512*6 is obtained.
  • the encoding operation may include but is not limited to performing a convolution operation on the first sample spliced image
  • the decoding operation may include but is not limited to performing a deconvolution operation on the first sample spliced image.
  • the input first sample spliced image is 512*512*6 dimensions, which are gradually encoded into 256*256*32 dimensions, 128*128*64 dimensions, 64*64*128 dimensions, 32*32*256 dimensions, and so on, to obtain the above-mentioned first sample intermediate layer feature information, and send the first sample intermediate layer feature information to the decoder.
  • the decoder mainly performs deconvolution operations, gradually doubles the resolution, and decodes the first intermediate layer feature information into 32*32*256 dimensions, 64*64*128 dimensions, 128*128*64 dimensions, 256*256*32 dimensions, 512*512*3 dimensions, and finally obtains the sample initial synthesis result.
  • the first target loss value may include but is not limited to the overall loss value of the first synthetic network.
  • the first loss condition may be a preset loss condition, for example, the first target loss value is less than the first preset value.
  • the first synthetic network is determined as the target synthetic network.
  • the parameters of the first synthetic network are adjusted until the first target loss value meets the first loss condition.
  • a first target loss value of a first synthesis network is calculated according to a first sample initial synthesis image, a first sample source image and a label image, including: using a pre-trained feature extraction module to perform a feature extraction operation on the label image to extract feature information of different levels of the label image to obtain a first group of sample feature maps, wherein each sample feature map in the first group of sample feature maps corresponds to feature information extracted from the label image at one level; using the feature extraction module to perform the feature extraction operation on the first sample initial synthesis image to extract feature information of different levels of the first sample initial synthesis image to obtain a second group of sample feature maps, wherein each sample feature map in the second group of sample feature maps corresponds to feature information extracted from the first sample initial synthesis image at one level; calculating a first loss value according to the first group of sample feature maps and the second group of sample feature maps, wherein the first loss value is calculated by feature information extracted from the label image at each level under different levels and feature information extracted from the first sample initial synthesis image; jointly
  • the above-mentioned label image is an image predetermined in the training process, which is the target of the current training process and can be pre-generated manually, for example, a true value image pre-generated by manual annotation.
  • the feature extraction module may include but is not limited to a pre-trained AlexNet network for extracting features of images at different layers and calculating LPIPS (Learned Perceptual Image Patch Similarity). Similarity) loss.
  • LPIPS Learning Perceptual Image Patch Similarity
  • FIG8 is a schematic diagram of another image synthesis method according to an embodiment of the present application.
  • low-level features can represent low-level features such as lines and colors
  • high-level features can represent high-level features such as components. Therefore, the similarity of different images can be measured by the features extracted by AlexNet.
  • the first loss value is calculated based on the first set of sample feature maps and the second set of sample feature maps in the following manner:
  • LPIPS_loss
  • the reconstruction loss value of the above-mentioned first synthetic network may include but is not limited to Reconstruction_loss+D_loss+G_loss, wherein Reconstruction_loss (corresponding to the aforementioned reconstruction loss value), G_loss is the loss value of the generator, and the above-mentioned D_loss is the loss value of the discriminator.
  • a first target loss value of a first synthesis network is calculated based on a first sample initial synthesis image, a first sample source image and a label image, including: performing a recognition operation on the first sample initial synthesis image to obtain a first sample feature vector, wherein the first sample feature vector is used to represent the source image identity information in the first sample initial synthesis image; performing a recognition operation on the first sample source image to obtain a second sample feature vector, wherein the second sample feature vector is used to represent the source image identity information in the first sample source image; calculating a second loss value based on the first sample feature vector and the second sample feature vector, wherein the second loss value represents the similarity between the first sample feature vector and the second sample feature vector; jointly determining the second loss value and the reconstruction loss value of the first synthesis network as the first target loss value, wherein the reconstruction loss value is a loss value for performing encoding operations and decoding operations.
  • the above recognition operation may include but is not limited to being implemented by a face recognition network, which is used to extract face features, and the dimension of this feature is generally 1024. Since the identity information of the synthesized image is as close to the identity information of the source image as possible, the features of the face are extracted to constrain.
  • Ai and Bi represent the components of vector A and vector B respectively.
  • Vector A is the first sample feature vector mentioned above, and vector B This is the second sample feature vector mentioned above.
  • a target correction operation is performed on a source image, a template image, and an initial synthesized image to obtain a target residual image, including: inputting the source image, the template image, and the initial synthesized image into a target correction network to obtain a target residual image.
  • the target residual image is obtained by: performing a splicing operation on the source image, the template image, and the initial synthesized image to obtain a second spliced image, wherein the number of channels of the second spliced image is the sum of the number of channels of the source image, the template image, and the initial synthesized image; performing an encoding operation on the second spliced image to obtain second intermediate layer feature information with a gradually increasing number of channels; performing a decoding operation on the second intermediate layer feature information to obtain a target residual image with a gradually decreasing number of channels, wherein the number of channels of the target residual image is the same as the number of channels of the initial synthesized image.
  • the stitching operation may include but is not limited to performing feature extraction operations on the source image, the template image, and the initial synthesized image respectively, and then superimposing the extracted feature maps to obtain the second stitched image.
  • the number of RGB channels of the source image, template image and initial synthesized image are all 3, and the size is 512*512 pixels. Then, after splicing the source image, template image and initial synthesized image, a second spliced image with a dimension of 512*512*9 is obtained.
  • the encoding operation may include but is not limited to performing a convolution operation on the second stitched image
  • the decoding operation may include but is not limited to performing a deconvolution operation on the second stitched image
  • the input second spliced image is of 512*512*9 dimension, which is gradually encoded into 256*256*18 dimension, 128*128*32 dimension, 64*64*64 dimension, 32*32*128 dimension, and so on, to obtain the above-mentioned second intermediate layer feature information, and send the second intermediate layer feature information to the decoder.
  • the decoder mainly performs deconvolution operation, gradually doubles the resolution, and decodes the second intermediate layer feature information into 32*32*128 dimension, 64*64*64 dimension, 128*128*32 dimension, 256*256*16 dimension, 512*512*9 dimension, and finally obtains the target residual image.
  • the first target loss value may include but is not limited to the overall loss value of the first synthetic network.
  • the first loss condition may be a preset loss condition, for example, the first target loss value is less than the first preset value.
  • the first synthetic network is determined as the target synthetic network.
  • the parameters of the first synthetic network are adjusted until the first target loss value meets the first loss condition.
  • the above method also includes: training the initial correction network to obtain the target correction network.
  • the target correction network is obtained by training the initial correction network in the following manner: obtaining a second sample source image, a second sample template image, a label residual image, and a second sample initial synthetic image, wherein the second sample initial synthetic image is an image obtained after the second sample source image and the second sample template image perform a target synthesis operation, the label residual image is determined according to the label image and the second sample initial synthetic image, and the label image is a predetermined image obtained after the second sample source image and the second sample template image are synthesized; performing a splicing operation on the second sample source image, the second sample template image, and the second sample initial synthetic image to obtain a second sample spliced image, wherein the second sample spliced image
  • the number of channels is the sum of the number of channels of the second sample source image, the second sample template image and the second sample initial synthesized image; an encoding operation is performed on the second sample spliced image to obtain
  • the above-mentioned stitching operation may include but is not limited to performing feature extraction operations on the second sample source image, the second sample template image and the second sample initial synthesized image respectively, and then superimposing the extracted feature maps to obtain the above-mentioned second sample stitching image.
  • the number of RGB channels of the second sample source image, the second sample template image and the second sample initial synthesized image are all 3, and the size is 512*512 pixels. Then, after splicing the second sample source image, the second sample template image and the second sample initial synthesized image, a second sample spliced image with a dimension of 512*512*9 is obtained.
  • the encoding operation may include but is not limited to performing a convolution operation on the second sample spliced image
  • the decoding operation may include but is not limited to performing a deconvolution operation on the second sample spliced image.
  • the input second sample spliced image is of 512*512*9 dimension, which is gradually encoded into 256*256*18 dimension, 128*128*32 dimension, 64*64*64 dimension, 32*32*128 dimension, and so on, to obtain the above-mentioned second sample intermediate layer feature information, and send the second sample intermediate layer feature information to the decoder.
  • the decoder mainly performs deconvolution operation, gradually doubles the resolution, and decodes the second sample intermediate layer feature information into 32*32*128 dimension, 64*64*64 dimension, 128*128*32 dimension, 256*256*16 dimension, 512*512*9 dimension, and finally obtains the sample residual image.
  • the second target loss value may include but is not limited to the overall loss value of the initial correction network, and the second loss condition may be a preset loss condition, for example, the second target loss value is less than the second preset value.
  • the initial correction network is determined as the target correction network, and when the second target loss value does not meet the second loss condition, the parameters of the initial correction network are adjusted until the second target loss value meets the second loss condition.
  • the above-mentioned label image is an image predetermined during the training process.
  • the image is the target of the current training process and can be pre-generated manually, for example, by manually annotating a pre-generated true value image.
  • the above-mentioned label residual image is a difference image between the label image and the second sample initial composite image.
  • a second target loss value of the initial correction network is calculated based on a second sample source image, a label residual image, a second sample initial synthesized image, and a sample residual image, including: calculating a third loss value based on the sample residual image and the label residual image; jointly determining the third loss value and the reconstruction loss value of the second synthesis network as the second target loss value, wherein the second synthesis network is used to generate a second sample initial synthesized image, and the reconstruction loss value is a loss value for performing encoding operations and decoding operations.
  • the reconstruction loss value of the above-mentioned initial correction network may include but is not limited to Reconstruction_loss+D_loss+G_loss, wherein Reconstruction_loss (corresponding to the aforementioned reconstruction loss value), G_loss is the loss value of the generator, and the above-mentioned D_loss is the loss value of the discriminator.
  • the third loss value and the reconstruction loss value of the first synthesis network are jointly determined as the second target loss value, including: synthesizing the second sample initial synthesis image and the sample residual image into a sample target synthesis image; using a pre-trained feature extraction module to perform a feature extraction operation on the sample target synthesis image to obtain a third group of sample feature maps, wherein the feature extraction module is used to extract feature information of different levels, and each sample feature map in the third group of sample feature maps corresponds to a level from the sample target synthesis
  • the method comprises the following steps: first, performing a feature extraction operation on the label image using a feature extraction module to obtain a first group of sample feature maps, wherein each sample feature map in the first group of sample feature maps corresponds to a level of feature information extracted from the label image; second, calculating a fourth loss value according to the third group of sample feature maps and the first group of sample feature maps, wherein the fourth loss value is calculated by feature information extracted from the sample target synthetic image and feature information extracted from the label image at corresponding levels
  • the above-mentioned label image is an image predetermined in the training process, which is the target of the current training process and can be pre-generated manually, for example, a true value image pre-generated by manual annotation.
  • the above-mentioned feature extraction module may include but is not limited to a pre-trained Alexnet network, which is used to extract features of the image at different layers and calculate the LPIPS (Learned Perceptual Image Patch Similarity) loss.
  • LPIPS Learning Perceptual Image Patch Similarity
  • low-level features can represent low-level features such as lines and colors
  • high-level features can represent high-level features such as parts. Therefore, the similarity of different images can be measured by the features extracted by AlexNet.
  • the third loss value and the reconstruction loss value of the first synthesis network are jointly determined as the second target loss value, including: performing a recognition operation on the sample target synthetic image to obtain a third sample feature vector, wherein the third sample feature vector is used to represent the source image identity information in the sample target synthetic image; performing a recognition operation on the second sample source image to obtain a fourth sample feature vector, wherein the fourth sample feature vector is used to represent the source image identity information in the second sample source image; calculating the fifth loss value of the second synthesis network based on the third sample feature vector and the fourth sample feature vector, wherein the second target loss value includes the fifth loss value, and the fifth loss value represents the similarity between the third sample feature vector and the fourth sample feature vector; and jointly determining the fifth loss value with the third loss value, the fourth loss value, and the reconstruction loss value of the second synthesis network as the second target loss value.
  • the above recognition operation may include but is not limited to being implemented by a face recognition network, which is used to extract face features, and the dimension of this feature is generally 1024. Since the identity information of the synthesized image is as close to the identity information of the source image as possible, the features of the face are extracted to constrain.
  • extract identification features from the sample target synthetic image to obtain fake2_id_features extract identification features from the second sample source image to obtain source_id_features
  • calculate ID estimation loss corresponding to the fifth loss value mentioned above
  • Ai and Bi represent the components of vector A and vector B respectively
  • vector A is the third sample feature vector mentioned above
  • vector B is the fourth sample feature vector mentioned above.
  • the above method before performing a target synthesis operation on the source image and the template image to obtain an initial synthesized image, also includes: performing object detection on the source image and the template image respectively to obtain object regions in the source image and the template image; performing a registration operation on the object region to determine key point information in the object region, wherein the key point information is used to represent the object in the object region; and cropping the source image and the template image respectively according to the key point information to obtain the source image and the template image for performing the target synthesis operation.
  • the above object detection may include but is not limited to preprocessing the input image to obtain a cropped source image and template image.
  • the source image and the template image are face images, specifically including:
  • Face registration is an image preprocessing technology that can locate the coordinates of the key points of the facial features.
  • the number of key points of the facial features is a pre-set fixed value that can be defined according to different semantic situations (usually 5 points, 68 points, 90 points, etc.).
  • this application requires two additional pre-trained models to assist in the learning of the synthesis network, including but not limited to: a face recognition network and a pre-trained Alexnet network.
  • the face recognition network is used to extract facial features.
  • the dimension of this feature is generally 1024 dimensions. Since the identity of the face to be generated (corresponding to the aforementioned initial synthetic image) and the source (corresponding to the aforementioned source image) should be as close as possible, facial features are extracted to constrain.
  • the pre-trained Alexnet network is used to extract features of the image at different layers and calculate the LPIPS loss. In a deep network model, low-level features can represent low-level features such as lines and colors, and high-level features can represent high-level features such as components. Therefore, the overall similarity can be measured by comparing the features extracted from two images using Alexnet.
  • the training process of the first stage synthesis network is as follows:
  • S11 prepare face-changing data, including a triplet pair of source (corresponding to the aforementioned source image), template (corresponding to the aforementioned template image), and gt (corresponding to the aforementioned label image).
  • the synthesis network can be generally divided into two parts: encoder and decoder.
  • the encoder continuously halves the size of the input image through convolution calculations, and the number of channels gradually increases.
  • the synthesis network input is gradually encoded from 512*512*6 dimensions (two images are spliced together as input, and the number of RGB channels of each image is 3) to 256*256*32 dimensions, 128*128*64 dimensions, 64*64*128 dimensions, 32*32*256 dimensions, and so on.
  • the decoder mainly performs deconvolution operations, gradually doubles the image resolution, and decodes the result obtained in S12 into 32*32*256 dimensions, 64*64*128 dimensions, 128*128*64 dimensions, 256*256* 32 dimensions, 512*512*3 dimensions, and finally the face-changing result fake1 is obtained.
  • identification features for example, identity information, including but not limited to facial features, facial features, etc. in the image
  • identification features for example, identity information, including but not limited to facial features, facial features, etc. in the image
  • identification features for example, identity information, including but not limited to facial features and facial features in the image
  • S16 calculate the feature loss of the face-swapped result fake1.
  • This loss function calculates the difference at the feature level of the two images (fake1, gt), called LPIPSLoss.
  • LPIPS_loss
  • ID_loss 1-cosine_similarity (fake1_id_features, source_id_features). Cosine similarity is calculated as follows:
  • Ai and Bi represent the components of vector A and vector B respectively
  • vector A is, for example, the fake1_id_features mentioned above
  • vector B is, for example, the source_id_features mentioned above.
  • the generative adversarial network of the embodiment of the present application also includes a discriminator network, which is used to determine whether the generated synthetic image (face-changing result) is real.
  • the loss of the first stage Reconstruction_loss + LPIPS_loss + ID_loss + D_loss + G_loss, optimize the parameters of the synthetic network.
  • the second stage corrects the training of the network:
  • the second stage of correction network training begins.
  • the structure of the correction network is similar to that of the synthetic network.
  • S22 takes source, template and the first stage face-changing result fake1 as input and sends it to the correction network.
  • the correction network can also include an encoder and a decoder structure, but its output is the residual map diff_map.
  • the final face-changing result fake2 fake1 + diff_map.
  • the fake2 image is the result of correcting the fake1 image by correcting the residual image diff_map.
  • the reconstruction loss of the newly added residual map of the correction network is the difference between the label residual image and the sample residual image. The smaller the difference between the two, the better.
  • Diff_reconstruction_loss
  • S26 to calculate other loss functions of the modified network, it can be similar to the loss function of the synthetic network in the first stage, and the input fake1 image can be replaced with the fake2 image.
  • the specific application process includes the following steps:
  • (5) is a module that needs to be trained in the actual use of the method of this technology, so as to cooperate and interact with other modules. First, it is necessary to receive image input from the video acquisition module, then perform face detection, and crop the face area. Then enter the method of this technology to perform face swapping. Finally, the results are displayed.
  • double-layer faces can be eliminated in large postures, a good effect of the synthesized image can be maintained, and a stable face-changing effect of the video can be maintained even under occlusion.
  • an image synthesis device for implementing the above-mentioned image synthesis method is also provided. As shown in FIG10 , the device includes the following modules.
  • the acquisition module 1002 is used to acquire a source image to be processed and a template image, wherein the source image includes identity information of the source image to be synthesized, and the template image includes background information of the template image to be synthesized.
  • the target object in the template image may have a target posture or be blocked.
  • the first synthesis module 1004 is used to perform a target synthesis operation on the source image and the template image to obtain an initial synthesized image.
  • the initial composite image includes the source image identity information, the template image background information and the partial area to be corrected.
  • the correction module 1006 is used to perform a target correction operation on the source image, the template image and the initial synthesized image to obtain a target residual image.
  • the target residual image is used to correct the partial area.
  • the second synthesis module 1008 is used to synthesize the initial synthesis image and the target residual image into a target synthesis image.
  • the target synthesis image includes the source image identity information and the template image background information and the partial area corrected according to the target residual image.
  • the device is used to perform a target synthesis operation on the source image and the template image to obtain an initial synthesized image in the following manner: input the source image and the template image into a target synthesis network together to obtain the initial synthesized image; wherein, the initial synthesized image is obtained in the target synthesis network in the following manner: perform a splicing operation on the source image and the template image to obtain a first spliced image, wherein the number of channels of the first spliced image is the sum of the number of channels of the source image and the template image; perform an encoding operation on the first spliced image to obtain first intermediate layer feature information with a gradually increasing number of channels; perform a decoding operation on the first intermediate layer feature information to obtain the initial synthesized image with a gradually decreasing number of channels, wherein the number of channels of the initial synthesized image is the same as the number of channels of the source image.
  • the device is also used to: train a first synthesis network to obtain the target synthesis network, wherein the target synthesis network is obtained by training the first synthesis network in the following manner: obtaining a first sample source image, a first sample template image and a label image, wherein the label image is a predetermined image that is expected to be obtained by synthesizing the first sample source image and the first sample template image; performing a splicing operation on the first sample source image and the first sample template image to obtain a first sample spliced image, wherein the number of channels of the first sample spliced image is the sum of the number of channels of the first sample source image and the first sample template image; performing an encoding operation on the first sample spliced image through the first synthesis network to obtain first sample intermediate layer feature information with a gradually increasing number of channels; performing a decoding operation on the first sample intermediate layer feature information through the first synthesis network to obtain the first sample initial synthesis image with a gradually decreasing number of channels, wherein the number
  • the device is used to calculate the first target loss value of the first synthesis network according to the first sample initial synthesis image, the first sample source image and the label image in the following manner, including: using a pre-trained feature extraction module to perform a feature extraction operation on the label image, extracting feature information of different levels of the label image, and obtaining a first group of sample feature maps, wherein each sample feature map in the first group of sample feature maps corresponds to a level of feature information extracted from the label image; using the feature extraction module to perform the feature extraction operation on the first sample initial synthesis image to extract feature information of different levels of the first sample initial synthesis image, and obtain a second group of sample feature maps, wherein each sample feature map in the second group of sample feature maps corresponds to a level of feature information extracted from the first sample initial synthesis image; calculating a first loss value according to the first group of sample feature maps and the second group of sample feature maps, wherein the first loss value is calculated by feature information extracted from the label image and feature information extracted from the first sample initial synthesis
  • the device is used to calculate the first target loss value of the first synthesis network according to the first sample initial synthesis image, the first sample source image and the label image in the following manner: perform a recognition operation on the first sample initial synthesis image to obtain a first sample feature vector, wherein the first sample feature vector is used to represent the source image identity information in the first sample initial synthesis image; perform the recognition operation on the first sample source image to obtain a second sample feature vector, wherein the second sample feature vector is used to represent the source image identity information in the first sample source image; calculate a second loss value according to the first sample feature vector and the second sample feature vector, wherein the second loss value represents the similarity between the first sample feature vector and the second sample feature vector; jointly determine the second loss value and the reconstruction loss value of the first synthesis network as the first target loss value, wherein the reconstruction loss value is the loss value for performing the encoding operation and the decoding operation.
  • the device is used to perform a target correction operation on the source image, the template image and the initial synthetic image in the following manner to obtain a target residual image: the source image, the template image and the initial synthetic image are input into a target correction network together to obtain the target residual image; wherein, the target residual image is obtained in the target correction network in the following manner: a splicing operation is performed on the source image, the template image and the initial synthetic image to obtain a second spliced image, wherein the number of channels of the second spliced image is the sum of the number of channels of the source image, the template image and the initial synthetic image; an encoding operation is performed on the second spliced image to obtain second intermediate layer feature information with a gradually increasing number of channels; a decoding operation is performed on the second intermediate layer feature information to obtain the target residual image with a gradually decreasing number of channels, wherein the number of channels of the target residual image is the same as the number of channels of the initial synthetic image.
  • the device is also used to: train an initial correction network to obtain the target correction network, wherein the initial correction network is trained to obtain the target correction network in the following manner: obtain a second sample source image, a second sample template image, a label residual image and a second sample initial synthetic image, wherein the second sample initial synthetic image is an image obtained after the second sample source image and the second sample template image perform the target synthesis operation, the label residual image is determined according to the label image and the second sample initial synthetic image, and the label image is a predetermined image expected to be obtained after the second sample source image and the second sample template image are synthesized; perform a splicing operation on the second sample source image, the second sample template image and the second sample initial synthetic image to obtain a second sample A stitched image, wherein the number of channels of the second sample stitched image is the sum of the number of channels of the second sample source image, the second sample template image and the second sample initial synthesized image; an encoding operation is performed on the second sample stitched image to obtain second sample
  • the device is used to calculate the second target loss value of the initial correction network according to the second sample source image, the label residual image, the second sample initial synthetic image and the sample residual image in the following manner: calculating a third loss value according to the sample residual image and the label residual image; jointly determining the third loss value and the reconstruction loss value of the second synthetic network as the second target loss value, wherein the second synthetic network is used to generate the second sample
  • the initial synthesized image, the reconstruction loss value is the loss value of performing the encoding operation and the decoding operation.
  • the device is used to determine the third loss value and the reconstruction loss value of the first synthesis network as the second target loss value in the following manner: synthesize the second sample initial synthesis image and the sample residual image into a sample target synthesis image; use a pre-trained feature extraction module to perform a feature extraction operation on the sample target synthesis image to obtain a third group of sample feature maps, wherein the feature extraction module is used to extract feature information of different levels, and each sample feature map in the third group of sample feature maps corresponds to a level of feature information extracted from the sample target synthesis image; use the feature extraction module to perform the feature extraction operation on the label image to obtain a first group of sample feature maps, wherein each sample feature map in the first group of sample feature maps corresponds to a level of feature information extracted from the label image; calculate a fourth loss value based on the third group of sample feature maps and the first group of sample feature maps, wherein the fourth loss value is calculated by the feature information extracted from the sample target synthesis image and the feature information extracted from the label image at the
  • the device is used to determine the third loss value and the reconstruction loss value of the first synthesis network as the second target loss value in the following manner: perform a recognition operation on the sample target synthetic image to obtain a third sample feature vector, wherein the third sample feature vector is used to represent the source image identity information in the sample target synthetic image; perform the recognition operation on the second sample source image to obtain a fourth sample feature vector, wherein the fourth sample feature vector is used to represent the source image identity information in the second sample source image; calculate the fifth loss value of the second synthesis network based on the third sample feature vector and the fourth sample feature vector, wherein the second target loss value includes the fifth loss value, and the fifth loss value represents the similarity between the third sample feature vector and the fourth sample feature vector; determine the fifth loss value together with the third loss value, the fourth loss value and the reconstruction loss value of the second synthesis network as the second target loss value.
  • the device is also used for: performing a target synthesis operation on the source image and the template image, and before obtaining an initial synthesized image, performing object detection on the source image and the template image respectively to obtain object regions in the source image and the template image; performing a registration operation on the object region to determine key point information in the object region, wherein the key point information is used to represent the object in the object region; and cropping the source image and the template image respectively according to the key point information to obtain the source image and the template image used to perform the target synthesis operation.
  • FIG. 11 schematically shows a block diagram of a computer system structure of an electronic device for implementing an embodiment of the present application.
  • the computer system 1100 of the electronic device shown in FIG. 11 is merely an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.
  • a computer system 1100 includes a central processing unit (CPU) 1101, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1102 or a program loaded from a storage part 1108 into a random access memory (RAM) 1103. Various programs and data required for system operation are also stored in the random access memory 1103.
  • the central processing unit 1101, the read-only memory 1102, and the random access memory 1103 are connected to each other via a bus 1104.
  • An input/output interface 1105 i.e., an I/O interface
  • I/O interface is also connected to the bus 1104.
  • the following components are connected to the input/output interface 1105: an input section 1106 including a keyboard, a mouse, etc.; an output section 1107 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1108 including a hard disk, etc.; and a communication section 1109 including a network interface card such as a LAN card, a modem, etc.
  • the communication section 1109 performs communication processing via a network such as the Internet.
  • a drive 1110 is also connected to the input/output interface 1105 as needed.
  • a removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1110 as needed, so that a computer program read therefrom is installed into the storage section 1108 as needed.
  • the process described in each method flow chart can be implemented as a computer software program.
  • an embodiment of the present application includes a computer program product, which includes a computer program carried on a computer readable medium, and the computer program contains a program code for executing the method shown in the flow chart.
  • the computer program can be downloaded and installed from the network through the communication part 1109, and/or installed from the removable medium 1111.
  • the central processor 1101 various functions defined in the system of the present application are executed.
  • an electronic device for implementing the above-mentioned image synthesis method is also provided, and the electronic device may be a terminal device or a server as shown in FIG1.
  • the embodiment of the present application is described by taking the electronic device as a terminal device as an example.
  • the electronic device includes a memory 1202 and a processor 1204, and a computer program is stored in the memory 1202, and the processor 1204 is configured to execute the steps in any of the above-mentioned method embodiments through the computer program.
  • the above-mentioned electronic device may be located in at least one network device among multiple network devices of a computer network.
  • the processor may be configured to perform the following steps through a computer program:
  • the initial synthesized image includes the identity information of the source image and the background information of the template image and a partial area to be corrected
  • the initial synthesized image and the target residual image are synthesized into a target synthesized image, wherein the target synthesized image includes source image identity information, template image background information, and a partial area corrected according to the target residual image.
  • the structure shown in FIG12 is for illustration only, and the electronic device may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a PDA, a mobile Internet device (MID), a PAD, and other terminal devices.
  • FIG12 does not limit the structure of the above-mentioned electronic device.
  • the electronic device may also include more or fewer components (such as a network interface, etc.) than those shown in FIG12, or have a configuration different from that shown in FIG12.
  • the memory 1202 can be used to store software programs and modules, such as the program instructions/modules corresponding to the image synthesis method and device in the embodiments of the present application.
  • the processor 1204 executes various functional applications and data processing by running the software programs and modules stored in the memory 1202, that is, to implement the above-mentioned image synthesis method.
  • the memory 1202 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 1202 may further include a memory remotely located relative to the processor 1204, and these remote memories may be connected to the terminal via a network.
  • the memory 1202 may be used to store information such as a synthesized image, but is not limited thereto.
  • the memory 1202 may include, but is not limited to, the acquisition module 1002, the first synthesis module 1004, the correction module 1006, and the second synthesis module 1008 in the image synthesis device.
  • other module units in the image synthesis device may also be included but are not limited thereto, which will not be described in detail in this example.
  • the transmission device 1206 is used to receive or send data via a network.
  • the network may include a wired network and a wireless network.
  • the transmission device 1206 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers via a network cable so as to communicate with the Internet or a local area network.
  • the transmission device 1206 is a radio frequency (RF) module, which is used to communicate with the Internet wirelessly.
  • RF radio frequency
  • the electronic device further includes: a display 1208 for displaying the target composite image; and a connection bus 1210 for connecting various module components in the electronic device.
  • the terminal device or server may be a node in a distributed system, wherein the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the multiple nodes through network communication.
  • the nodes may form a peer-to-peer (P2P) network, and any form of computing device, such as a server, terminal or other electronic device, may become a node in the blockchain system by joining the peer-to-peer network.
  • P2P peer-to-peer
  • a computer-readable storage medium is provided, and a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the image synthesis method provided in various example implementations of the above-mentioned image synthesis aspects.
  • all or part of the steps in the various methods of the above embodiments can be completed by instructing the hardware related to the terminal device through a program.
  • the program can be stored in a computer-readable storage medium, and the storage medium may include: a flash drive, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk, etc.
  • the integrated units in the above embodiments are implemented in the form of software functional units and sold or used as independent products, they can be stored in the above computer-readable storage medium.
  • the technical solution of the present application, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling one or more computer devices (which may be personal computers, servers or network devices, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the disclosed client can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division.
  • multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

本申请公开了一种图像合成方法和装置、存储介质及电子设备。其中,该方法包括:首先,获取包括待合成的源图像身份信息的源图像和包括待合成的模板图像背景信息,对源图像和模板图像执行目标合成操作,得到初始合成图像。其中,初始合成图像包括源图像身份信息和模板图像背景信息以及待修正的部分区域。对源图像、模板图像以及初始合成图像执行目标修正操作,得到目标残差图像,将初始合成图像与目标残差图像合成为目标合成图像。本申请可以应用于包括但不限于基于人工智能的图像处理领域等,本申请解决了由于待合成的模板图像中对象具有大姿态或对象被遮挡,导致合成的图像效果较差,不够自然的技术问题。

Description

图像合成方法和装置、存储介质及电子设备
本申请要求2022年11月14日提交的申请号为202211422368.8、发明名称为“图像合成方法和装置、存储介质及电子设备”的中国专利申请的优先权。
技术领域
本申请涉及计算机领域,具体而言,涉及一种图像合成方法和装置、存储介质及电子设备。
背景技术
目前,图像合成具有非常多的应用场景,但是常规的图像合成方式是输入包含源图像身份信息(例如,眼睛信息、鼻子信息等源图像身份信息)的源图像和包含模板图像背景信息(例如,面部角度、面部表情等模板图像背景信息)的模板图像,以输出能保持模板图像中的模板图像背景信息,同时与源图像包含的源图像身份信息尽可能的相似的合成图像。目前的大多数图像合成算法,例如,当模板图像中的对象具有较大的姿态或模板图像的对象被遮挡时,合成图像的效果不佳。
发明内容
本申请实施例提供了一种图像合成方法和装置、存储介质及电子设备,以至少解决由于待合成的模板图像中对象具有大姿态或对象被遮挡,导致合成的图像效果较差,不够自然的技术问题。
根据本申请实施例的一个方面,提供了一种图像合成方法,包括:获取待处理的源图像和模板图像,其中,所述源图像包括待合成的源图像身份信息,所述模板图像包括待合成的模板图像背景信息;对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像,其中,所述初始合成图像包括所述源图像身份信息和所述模板图像背景信息以及待修正的部分区域;对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像;将所述初始合成图像与所述目标残差图像合成为目标合成图像,其中,所述目标合成图像包括所述源图像身份信息和所述模板图像背景信息以及根据所述目标残差图像修正后的所述部分区域。
根据本申请实施例的另一方面,还提供了一种图像合成装置,包括:获取模块,用于获取待处理的源图像和模板图像,其中,所述源图像包括待合成的源图像身份信息,所述模板图像包括待合成的模板图像背景信息;第一合成模块,用于对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像,其中,所述初始合成图像包括所述源图像身份信息和所述模板图像背景信息以及待修正的部分区域;修正模块,用于对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像;第二合成模块,用于将所述初始合成图像与所述目标残差图像合成为目标合成图像,其中,所述目标合成图像包括所述源图像身份信息和所述模板图像背景信息以及根据所述目标残差图像修正后的所述部分区域。
根据本申请实施例的又一方面,提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器 从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行如以上图像合成方法。
根据本申请实施例的又一方面,还提供了一种电子设备,包括存储器和处理器,上述存储器中存储有计算机程序,上述处理器被设置为通过所述计算机程序执行上述的图像合成方法。
在本申请实施例中,达到了在有大姿态和遮挡下依旧能够取得鲁棒的图像合成结果,满足了一些难度较大的场景下的换脸需求,从而实现了优化了大姿态或对象被遮挡的情况下,合成图像的效果,使得合成的图像更加鲁棒,也更加自然的技术效果,进而解决了由于待合成的模板图像中对象具有大姿态或对象被遮挡,导致合成的图像效果较差,不够自然的技术问题。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是根据本申请实施例的图像合成方法的应用环境的示意图;
图2是根据本申请实施例的图像合成方法的流程图;
图3是根据本申请实施例的三维空间中目标对象的姿态示意图;
图4是根据本申请实施例的具有目标姿态的目标对象示意图;
图5是根据本申请实施例的目标对象被遮挡的示意图;
图6是根据本申请实施例的一种图像合成方法示意图;
图7是根据本申请实施例的又一种图像合成方法示意图;
图8是根据本申请实施例的再一种图像合成方法示意图;
图9是根据本申请实施例的另一种图像合成方法的示意图;
图10是根据本申请实施例的一种图像合成装置的结构示意图;
图11是根据本申请实施例的一种图像合成产品的结构示意图;
图12是根据本申请实施例的一种电子设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地 列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
首先,在对本申请实施例进行描述的过程中出现的部分名词或者术语适用于如下解释:
生成对抗网络(Generative Adversarial Network,简称GAN):非监督式学习的一种方法,通过让两个神经网络相互博弈的方式进行学习,由一个生成网络与一个判别网络组成。生成网络从潜在空间(latent space)中随机取样作为输入,其输出结果需要尽量模仿训练集中的真实样本。判别网络的输入则为真实样本或生成网络的输出,其目的是将生成网络的输出从真实样本中尽可能分辨出来。而生成网络则要尽可能地欺骗判别网络。两个网络相互对抗、不断调整参数,最终生成以假乱真的图片。
视频换脸:换脸的定义是将输入的源图source换到模板人脸中template上,并使输出人脸fake保持模板人脸的表情、角度、背景等信息。
GT:Ground truth,真值。
下面结合实施例对本申请进行说明:
根据本申请实施例的一个方面,提供了一种图像合成方法。在本申请实施例中,上述图像合成方法可以应用于如图1所示的由服务器101和终端设备103所构成的硬件环境中。如图1所示,服务器101通过网络与终端设备103进行连接,可用于为终端设备或终端设备上安装的应用程序提供服务。应用程序可以是视频应用程序、即时通信应用程序、浏览器应用程序、教育应用程序、游戏应用程序等。可在服务器101上或独立于服务器101设置数据库105,用于为服务器101提供数据存储服务,例如,游戏数据存储服务器。上述网络可以包括但不限于:有线网络,无线网络,其中,有线网络包括:局域网、城域网和广域网,无线网络包括:蓝牙、WIFI及其他实现无线通信的网络。终端设备103可以是配置有应用程序的终端,可以包括但不限于以下至少之一:手机(如Android手机、iOS手机等)、笔记本电脑、平板电脑、掌上电脑、MID(Mobile Internet Devices,移动互联网设备)、PAD、台式电脑、智能电视、智能语音交互设备、智能家电、车载终端、飞行器等计算机设备。上述服务器101可以是单一服务器,也可以是由多个服务器组成的服务器集群,或者是云服务器。
结合图1所示,上述图像合成方法可以在终端设备103通过如下步骤实现:
S1,在终端设备103上获取待处理的源图像和模板图像,其中,源图像包括待合成的源图像身份信息,模板图像包括待合成的模板图像背景信息。模板图像中的目标对象具有目标姿态或者目标对象被遮挡。
S2,在终端设备103上对源图像和模板图像执行目标合成操作,得到初始合成图像,其中,初始合成图像包括源图像身份信息和模板图像背景信息以及待修正的部分区域;
S3,在终端设备103上对源图像、模板图像以及初始合成图像执行目标修正操作,得到目标残差图像,其中,目标残差图像用于修正部分区域;
S4,在终端设备103上将初始合成图像与目标残差图像合成为目标合成图像,其中,目标合成图像包括源图像身份信息和模板图像背景信息以及根据目标残差图像修正后的部分区域。
在本申请实施例中,上述图像合成方法还可以通过服务器实现,例如,图1所示的服务器101 中实现;或由终端设备和服务器共同实现。
上述仅是一种示例,本申请实施例不做具体的限定。
作为一种示例实施方式,如图2所示,上述图像合成方法包括:
S202,获取待处理的源图像和模板图像,其中,源图像包括待合成的源图像身份信息,模板图像包括待合成的模板图像背景信息,模板图像中的目标对象具有目标姿态或者目标对象被遮挡。
在本申请实施例中,上述源图像可以包括但不限于需要使用身份信息的图像,上述身份信息可以包括但不限于图像中的面部特征、五官特征等,上述模板图像可以包括但不限于需要使用背景信息的图像,上述背景信息可以包括但不限于图像中的表情特征、角度特征、背景特征等。
在本申请实施例中,上述目标对象可以包括但不限于模板图像中包括的人物、动物、游戏角色、影视角色、虚拟形象等,上述目标姿态可以理解为上述目标对象在模板图像中的姿态属于大姿态,上述大姿态可以理解为偏航角大于预设角度值的姿态。
示例性地,图3是根据本申请实施例的三维空间中目标对象的姿态示意图。如图3所示,三维空间中目标对象的人脸(或头部)运动(例如沿着世界坐标系的x轴、y轴、z轴)主要有三种角度,分别为yaw(偏航角)、pitch(俯仰角)、roll(翻滚角),一般称之为人脸姿态角度。这三种角度对应三种情况,对应的是人脸左右转动(沿着y轴转动)、上下转动(沿着x轴转动)、侧面转动(沿着z轴转动)。
在一个示例性的实施例中,图4是根据本申请实施例的具有目标姿态的目标对象示意图。如图4所示,当模板图像中目标对象的人脸偏航角(例如,90°)大于预设角度值(例如,45°)时,则可理解为该模板图像中目标对象具有目标姿态。
在本申请实施例中,上述目标对象被遮挡可以理解为目标对象的面部包括某个物体,该物体遮挡了目标对象面部的部分区域。
示例性地,图5是根据本申请实施例的目标对象被遮挡的示意图。如图5所示,三维空间中人脸对象佩戴了口罩,由于口罩遮挡了除眼睛之外的其他面部区域,此时,可以理解为上述目标对象被口罩遮挡,也即,上述目标对象被遮挡了。
S204,对源图像和模板图像执行目标合成操作,得到初始合成图像,其中,初始合成图像包括源图像身份信息和模板图像背景信息以及待修正的部分区域。
在本申请实施例中,上述目标合成操作可以包括但不限于将源图像和模板图像输入预训练的图像合成模型中进行合成。上述初始合成图像是效果有待改善的合成图像。其中,上述待修正部分区域可以包括但不限于上述源图像与模板图像的结合区域出现了虚化或重影等显示异常的区域。
示例性地,图6是根据本申请实施例的一种图像合成方法示意图。如图6所示,将源图像和模板图像输入合成网络(图像合成模型)中,得到初始合成图像,其中,初始合成图像的鼻子部分出现了重影,该区域(鼻子部分)即为上述待修正的部分区域。
S206,对源图像、模板图像以及初始合成图像执行目标修正操作,得到目标残差图像。目标残差图像用于修正待修正的部分区域。
在本申请实施例中,上述目标修正操作可以包括但不限于将源图像、模板图像以及初始合成图 像输入预训练的图像修正模型中进行合成。上述初始合成图像是效果有待改善的合成图像,其中,上述部分区域可以包括但不限于上述源图像与模板图像的结合区域出现了虚化或重影等待修正的区域。修正部分区域(或修正待修正的部分区域)可以理解为将上述待修正的部分区域根据目标残差图像进行修正。
在本申请实施例中,上述目标残差图像可以包括但不限于由上述图像修正模型输出的图像,用于修正上述待修正的部分区域。
示例性地,图7是根据本申请实施例的又一种图像合成方法的示意图,如图7所示,将源图像、模板图像以及初始合成图像输入修正网络中,得到目标残差图像,其中,初始合成图像的鼻子部分出现了重影,该区域即为上述待修正的部分区域,通过上述修正网络,可以生成修正该部分区域的目标残差图像。
S208,将初始合成图像与目标残差图像合成为目标合成图像,其中,目标合成图像包括源图像身份信息和模板图像背景信息以及根据目标残差图像修正后的部分区域。
在本申请实施例中,上述将初始合成图像与目标残差图像合成为目标合成图像可以包括但不限于叠加操作。该叠加操作将初始合成图像与目标残差图像中各个像素的像素值叠加,得到上述目标合成图像。
上述目标合成图像既包括了源图像身份信息,还包括了模板图像背景信息。同时,对于初始合成图像中待修正的部分区域,在目标合成图像中显示正常,显示为修正后的部分区域。
在一个示例性的实施例中,上述图像合成方法可以包括但不限于基于人工智能的方式实现。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
示例性地,可以包括但不限于使用生成对抗网络执行上述目标合成操作和上述目标修正操作。 生成对抗网络(GAN,Generative Adversarial Networks)是一种深度学习模型,是近年来复杂分布上无监督学习最具前景的方法之一。生成对抗网络通过其框架中(至少)两个模块:生成模型(Generative Model)和判别模型(Discriminative Model)的互相博弈学习产生相当好的输出。
再比如,通过将源图像和模板图像输入一个生成对抗网络执行目标合成操作,得到初始合成图像,再将源图像、模板图像以及初始合成图像输入另一个生成对抗网络执行目标修正操作,得到目标残差图像,最终,将初始合成图像和目标残差图像合成为上述目标合成图像。
现有的方法无法优化大姿态和遮挡下的合成图像的效果,易出现双层脸或存在遮挡时,合成图像身份信息发生抖动的问题,进而导致合成的图像效果较差。而通过本申请实施例,达到了在有大姿态和遮挡下依旧能够取得鲁棒的图像合成结果,满足了一些难度较大的场景下的换脸需求,从而实现了优化了大姿态或对象被遮挡的情况下,合成图像的效果,使得合成的图像更加鲁棒,也更加自然的技术效果,进而解决了由于待合成的模板图像中对象具有大姿态或对象被遮挡,导致合成的图像效果较差,不够自然的技术问题。
作为一种示例的方案,对源图像和模板图像执行目标合成操作,得到初始合成图像,包括:将源图像和模板图像共同输入目标合成网络,得到初始合成图像。其中,在目标合成网络中通过如下方式得到初始合成图像:对源图像和模板图像执行拼接操作,得到第一拼接图像,其中,第一拼接图像的通道数是源图像和模板图像的通道数之和;对第一拼接图像执行编码操作,得到通道数逐渐增加的第一中间层特征信息;对第一中间层特征信息执行解码操作,得到通道数逐渐减少的初始合成图像,其中,初始合成图像的通道数与源图像的通道数相同。
在本申请实施例中,上述拼接操作可以包括但不限于对源图像和模板图像分别执行特征提取操作,再将提取到的特征图进行叠加,得到上述第一拼接图像。
例如,假设源图像和模板图像的RGB通道数均为3,尺寸均为512*512像素,则将源图像和模板图像拼接后,得到512*512*6维度的第一拼接图像。
在本申请实施例中,上述编码操作可以包括但不限于对第一拼接图像执行卷积操作,上述解码操作可以包括但不限于对第一拼接图像执行反卷积操作。编码操作例如用于将输入表示为特征向量(进行特征提取)。解码操作例如用于将特征向量表示为输出(包括进行分类)。
例如,输入的第一拼接图像为512*512*6维度,将其逐步编码为256*256*32维度,128*128*64维度,64*64*128维度,32*32*256维度,以此类推,得到上述第一中间层特征信息。将第一中间层特征信息送入解码器,解码器主要是进行反卷积运算,将分辨率逐渐增倍,将第一中间层特征信息解码为32*32*256维度,64*64*128维度,128*128*64维度,256*256*32维度,512*512*3维度,最终得到初始合成图像。
作为一种示例的方案,上述方法还包括:对第一合成网络进行训练得到目标合成网络。其中,通过如下方式对第一合成网络进行训练得到目标合成网络:获取第一样本源图像、第一样本模板图像以及标签图像,其中,标签图像是期望第一样本源图像与第一样本模板图像合成后得到的预先确定的图像;对第一样本源图像和第一样本模板图像执行拼接操作,得到第一样本拼接图像,其中,第一样本拼接图像的通道数是第一样本源图像和第一样本模板图像的通道数之和;通过所述 第一合成网络对第一样本拼接图像执行编码操作,得到通道数逐渐增加的第一样本中间层特征信息;通过所述第一合成网络对第一样本中间层特征信息执行解码操作,得到通道数逐渐减少的第一样本初始合成图像,其中,第一样本初始合成图像的通道数与第一样本源图像的通道数相同;根据第一样本初始合成图像、第一样本源图像以及标签图像计算第一合成网络的第一目标损失值;在第一目标损失值符合第一损失条件的情况下,将第一合成网络确定为目标合成网络。
在本申请实施例中,上述拼接操作可以包括但不限于对第一样本源图像和第一样本模板图像分别执行特征提取操作,再将提取到的特征图进行叠加,得到上述第一样本拼接图像。
例如,第一样本源图像和第一样本模板图像的RGB通道数均为3,尺寸均为512*512像素,则将第一样本源图像和第一样本模板图像拼接后,得到512*512*6维度的第一样本拼接图像。
在本申请实施例中,上述编码操作可以包括但不限于对第一样本拼接图像执行卷积操作,上述解码操作可以包括但不限于对第一样本拼接图像执行反卷积操作。
例如,输入的第一样本拼接图像为512*512*6维度,逐步编码为256*256*32维度,128*128*64维度,64*64*128维度,32*32*256维度,以此类推,得到上述第一样本中间层特征信息,并将第一样本中间层特征信息送入解码器。解码器主要是进行反卷积运算,将分辨率逐渐增倍,将第一中间层特征信息解码为32*32*256维度,64*64*128维度,128*128*64维度,256*256*32维度,512*512*3维度,最终得到样本初始合成结果。
在本申请实施例中,上述第一目标损失值可以包括但不限于第一合成网络的整体损失值。上述第一损失条件可以是预设的损失条件,例如,第一目标损失值小于第一预设值。此时,将第一合成网络确定为目标合成网络。当第一目标损失值未符合第一损失条件的情况下,调整第一合成网络的参数,直至第一目标损失值符合第一损失条件。
作为一种示例的方案,根据第一样本初始合成图像、第一样本源图像以及标签图像计算第一合成网络的第一目标损失值,包括:利用预训练的特征提取模块对所述标签图像执行特征提取操作,以提取所述标签图像不同层级的特征信息,得到第一组样本特征图,其中,所述第一组样本特征图中每个样本特征图对应一个层级的从所述标签图像提取到的特征信息;利用所述特征提取模块对所述第一样本初始合成图像执行所述特征提取操作,以提取所述第一样本初始合成图像不同层级的特征信息,得到第二组样本特征图,其中,所述第二组样本特征图中每个样本特征图对应一个层级的从所述第一样本初始合成图像提取到的特征信息;根据所述第一组样本特征图和所述第二组样本特征图计算第一损失值,其中,所述第一损失值由不同层级下各个层级的从所述标签图像提取到的特征信息和从所述第一样本初始合成图像提取到的特征信息计算得到;将所述第一损失值与所述第一合成网络的重建损失值共同确定为所述第一目标损失值,其中,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
在本申请实施例中,上述标签图像是训练过程中预先确定的图像,该图像是当前训练过程的目标,可以由人工方式预先生成,例如,通过人工标注预先生成的真值图像。
在本申请实施例中,上述特征提取模块可以包括但不限于预训练好的alexnet网络,用于提取图像在不同层的特征,计算LPIPS(Learned Perceptual Image Patch Similarity,学习感知图像块相 似度)损失。
示例性地,图8是根据本申请实施例的再一种图像合成方法示意图。如图8所示,在深度网络模型中,低层的特征能表示线条,颜色等低级特征,高层的特征能表示部件等高级特征。因此可以通过alexnet提取的特征来衡量不同图像的相似程度。
例如,对标签图像执行特征提取操作,得到的第一组样本特征图,具体而言,可以包括但不限于如下4个层级的样本特征图:gt_img_fea1,gt_img_fea2,gt_img_fea3,gt_img_fea4=alexnet_feature(gt)。例如,对第一样本初始合成图像执行特征提取操作,得到第二组样本特征图可以包括但不限于如下4个层级的样本特征图:result_fea1,result_fea2,result_fea3,result_fea4=alexnet_feature(fake1)。
通过如下方式根据第一组样本特征图和第二组样本特征图计算第一损失值:
LPIPS_loss=|result_fea1-gt_img_fea1|+|result_fea2-gt_img_fea2|+|result_fea3-gt_img_fea3|+|result_fea4-gt_img_fea4|。
在本申请实施例中,以上述第一合成网络是生成对抗网络为例,上述第一合成网络的重建损失值可以包括但不限于Reconstruction_loss+D_loss+G_loss构成,其中,Reconstruction_loss(对应于前述的重建损失值),G_loss为生成器的损失值,上述D_loss为判别器的损失值。
作为一种示例的方案,根据第一样本初始合成图像、第一样本源图像以及标签图像计算第一合成网络的第一目标损失值,包括:对第一样本初始合成图像执行识别操作,得到第一样本特征向量,其中,第一样本特征向量用于表示第一样本初始合成图像中的源图像身份信息;对第一样本源图像执行识别操作,得到第二样本特征向量,其中,第二样本特征向量用于表示第一样本源图像中的源图像身份信息;根据第一样本特征向量和第二样本特征向量计算第二损失值,其中,第二损失值表示第一样本特征向量和第二样本特征向量之间的相似度;将第二损失值与第一合成网络的重建损失值共同确定为第一目标损失值,其中,重建损失值为执行编码操作和解码操作的损失值。
在本申请实施例中,上述识别操作可以包括但不限于由人脸识别网络实现,人脸识别网络用于提取人脸特征,这个特征的维度一般是1024维。由于需要合成的图像的身份信息和源图像的身份信息越接近越好,所以提取人脸的特征来约束。
例如,对第一样本初始合成图像提取识别特征,获得fake1_id_features,对第一样本源图像提取识别特征,获得source_id_features,计算ID估计损失(对应于前述的第二损失值),采用cosine相似度(还可以包括但不限于欧式距离),由于期望生成的第一样本初始合成图像和第一样本源图像越像越好,则ID_loss=1-cosine_similarity(fake1_id_features,source_id_features)。Cosine相似度计算如下:
其中,Ai和Bi分别表示向量A和向量B的各分量,向量A即为上述第一样本特征向量,向量B 即为上述第二样本特征向量。
作为一种示例的方案,对源图像、模板图像以及初始合成图像执行目标修正操作,得到目标残差图像,包括:将源图像、模板图像以及初始合成图像共同输入目标修正网络,得到目标残差图像。其中,在目标修正网络中通过如下方式得到目标残差图像:对源图像、模板图像以及初始合成图像执行拼接操作,得到第二拼接图像,其中,第二拼接图像的通道数是源图像、模板图像以及初始合成图像的通道数之和;对第二拼接图像执行编码操作,得到通道数逐渐增加的第二中间层特征信息;对第二中间层特征信息执行解码操作,得到通道数逐渐减少的目标残差图像,其中,目标残差图像的通道数与初始合成图像的通道数相同。
在本申请实施例中,上述拼接操作可以包括但不限于对源图像、模板图像以及初始合成图像分别执行特征提取操作,再将提取到的特征图进行叠加,得到上述第二拼接图像。
例如,源图像、模板图像以及初始合成图像的RGB通道数均为3,尺寸均为512*512像素,则将源图像、模板图像以及初始合成图像拼接后,得到512*512*9维度的第二拼接图像。
在本申请实施例中,上述编码操作可以包括但不限于对第二拼接图像执行卷积操作,上述解码操作可以包括但不限于对第二拼接图像执行反卷积操作。
例如,输入的第二拼接图像为512*512*9维度,逐步编码为256*256*18维度,128*128*32维度,64*64*64维度,32*32*128维度,以此类推,得到上述第二中间层特征信息,并将第二中间层特征信息送入解码器,解码器主要是进行反卷积运算,将分辨率逐渐增倍,将第二中间层特征信息解码为32*32*128维度,64*64*64维度,128*128*32维度,256*256*16维度,512*512*9维度,最终得到目标残差图像。
在本申请实施例中,上述第一目标损失值可以包括但不限于第一合成网络的整体损失值。上述第一损失条件可以是预设的损失条件,例如,第一目标损失值小于第一预设值。此时,将第一合成网络确定为目标合成网络。当第一目标损失值未符合第一损失条件的情况下,调整第一合成网络的参数,直至第一目标损失值符合第一损失条件。
作为一种示例的方案,上述方法还包括:对初始修正网络进行训练得到目标修正网络。其中,通过如下方式对初始修正网络进行训练得到目标修正网络:获取第二样本源图像、第二样本模板图像、标签残差图像以及第二样本初始合成图像,其中,第二样本初始合成图像是第二样本源图像与第二样本模板图像执行目标合成操作后得到的图像,标签残差图像根据标签图像和第二样本初始合成图像确定,标签图像是期望第二样本源图像与第二样本模板图像合成后得到的预先确定的图像;对第二样本源图像、第二样本模板图像以及第二样本初始合成图像执行拼接操作,得到第二样本拼接图像,其中,第二样本拼接图像的通道数是第二样本源图像、第二样本模板图像以及第二样本初始合成图像的通道数之和;对第二样本拼接图像执行编码操作,得到通道数逐渐增加的第二样本中间层特征信息;对第二样本中间层特征信息执行解码操作,得到通道数逐渐减少的样本残差图像,其中,样本残差图像的通道数与第二样本初始合成图像的通道数相同;根据第二样本源图像、标签残差图像、第二样本初始合成图像以及样本残差图像计算初始修正网络的第二目标损失值;在第二目标损失值符合第二损失条件的情况下,将初始修正网络确定为目标修正网络。
在本申请实施例中,上述拼接操作可以包括但不限于对第二样本源图像、第二样本模板图像以及第二样本初始合成图像分别执行特征提取操作,再将提取到的特征图进行叠加,得到上述第二样本拼接图像。
例如,第二样本源图像、第二样本模板图像以及第二样本初始合成图像的RGB通道数均为3,尺寸均为512*512像素,则将第二样本源图像、第二样本模板图像以及第二样本初始合成图像拼接后,得到512*512*9维度的第二样本拼接图像。
在本申请实施例中,上述编码操作可以包括但不限于对第二样本拼接图像执行卷积操作,上述解码操作可以包括但不限于对第二样本拼接图像执行反卷积操作。
例如,输入的第二样本拼接图像为512*512*9维度,逐步编码为256*256*18维度,128*128*32维度,64*64*64维度,32*32*128维度,以此类推,得到上述第二样本中间层特征信息,并将第二样本中间层特征信息送入解码器,解码器主要是进行反卷积运算,将分辨率逐渐增倍,将第二样本中间层特征信息解码为32*32*128维度,64*64*64维度,128*128*32维度,256*256*16维度,512*512*9维度,最终得到样本残差图像。
在本申请实施例中,上述第二目标损失值可以包括但不限于初始修正网络的整体损失值,上述第二损失条件可以是预设的损失条件,例如,第二目标损失值小于第二预设值。此时,将初始修正网络确定为目标修正网络,当第二目标损失值未符合第二损失条件的情况下,调整初始修正网络的参数,直至第二目标损失值符合第二损失条件。
在本申请实施例中,上述标签图像是训练过程中预先确定的图像。该图像是当前训练过程的目标,可以由人工方式预先生成,例如,通过人工标注预先生成的真值图像。上述标签残差图像是标签图像与第二样本初始合成图像的差值图像。例如,gt_diff_map(对应于前述的标签残差图像)=gt(对应于前述的标签图像)–fake1(对应于前述的第二样本初始合成图像)。
作为一种示例的方案,根据第二样本源图像、标签残差图像、第二样本初始合成图像以及样本残差图像计算初始修正网络的第二目标损失值,包括:根据样本残差图像和标签残差图像计算第三损失值;将第三损失值与第二合成网络的重建损失值共同确定为第二目标损失值,其中,第二合成网络用于生成第二样本初始合成图像,重建损失值为执行编码操作和解码操作的损失值。
在本申请实施例中,上述根据样本残差图像和标签残差图像计算第三损失值可以包括但不限于Diff_reconstruction_loss,Diff_reconstruction_loss=|gt_diff_map–diff_map|,以使得样本残差图像和标签残差图像之间的差异越小越好,其中,diff_map表示样本残差图像。
在本申请实施例中,以上述初始修正网络是生成对抗网络为例,上述初始修正网络的重建损失值可以包括但不限于Reconstruction_loss+D_loss+G_loss构成,其中,Reconstruction_loss(对应于前述的重建损失值),G_loss为生成器的损失值,上述D_loss为判别器的损失值。
作为一种示例的方案,将第三损失值与第一合成网络的重建损失值共同确定为第二目标损失值,包括:将第二样本初始合成图像与样本残差图像合成为样本目标合成图像;利用预训练的特征提取模块对样本目标合成图像执行特征提取操作,得到第三组样本特征图,其中,特征提取模块用于提取不同层级的特征信息,第三组样本特征图中每个样本特征图对应一个层级的从样本目标合 成图像提取到的特征信息;利用特征提取模块对标签图像执行特征提取操作,得到第一组样本特征图,其中,第一组样本特征图中每个样本特征图对应一个层级的从标签图像提取到的特征信息;根据第三组样本特征图和第一组样本特征图计算第四损失值,其中,第四损失值由不同层级下的对应层级的从样本目标合成图像提取到的特征信息和从标签图像提取到的特征信息计算得到;将第四损失值与第三损失值以及第二合成网络的重建损失值共同确定为第二目标损失值。
在本申请实施例中,上述标签图像是训练过程中预先确定的图像,该图像是当前训练过程的目标,可以由人工方式预先生成,例如,通过人工标注预先生成的真值图像。
在本申请实施例中,上述特征提取模块可以包括但不限于预训练好的alexnet网络,用于提取图像在不同层的特征,计算LPIPS(Learned Perceptual Image Patch Similarity,学习感知图像块相似度)损失。
如图9所示,在深度网络模型中,低层的特征能表示线条,颜色等低级特征,高层的特征能表示部件等高级特征。因此可以通过alexnet提取的特征来衡量不同图像的相似程度。
例如,对标签图像执行特征提取操作,得到的第一组样本特征图,具体而言,可以包括但不限于如下4个层级的样本特征图:gt_img_fea1,gt_img_fea2,gt_img_fea3,gt_img_fea4=alexnet_feature(gt)。例如,对样本目标合成图像执行特征提取操作,得到第三组样本特征图可以包括但不限于如下4个层级的样本特征图:result_fea1,result_fea2,result_fea3,result_fea4=alexnet_feature(fake2)。
通过如下根据第一组样本特征图和第三组样本特征图计算第四损失值:
LPIPS_loss=|result_fea1-gt_img_fea1|+|result_fea2-gt_img_fea2|+|result_fea3-gt_img_fea3
|+|result_fea4-gt_img_fea4|
作为一种示例的方案,将第三损失值与第一合成网络的重建损失值共同确定为第二目标损失值,包括:对样本目标合成图像执行识别操作,得到第三样本特征向量,其中,第三样本特征向量用于表示样本目标合成图像中的源图像身份信息;对第二样本源图像执行识别操作,得到第四样本特征向量,其中,第四样本特征向量用于表示第二样本源图像中的源图像身份信息;根据第三样本特征向量和第四样本特征向量计算第二合成网络的第五损失值,其中,第二目标损失值包括第五损失值,第五损失值表示第三样本特征向量和第四样本特征向量之间的相似度;将第五损失值与第三损失值、第四损失值以及第二合成网络的重建损失值共同确定为第二目标损失值。
在本申请实施例中,上述识别操作可以包括但不限于由人脸识别网络实现,人脸识别网络用于提取人脸特征,这个特征的维度一般是1024维。由于需要合成的图像的身份信息和源图像的身份信息越接近越好,所以提取人脸的特征来约束。
例如,对样本目标合成图像提取识别特征,获得fake2_id_features,对第二样本源图像提取识别特征,获得source_id_features,计算ID估计损失(对应于前述的第五损失值),采用cosine相似度(还可以包括但不限于欧式距离),由于期望生成的样本目标合成图像和第二样本源图像越像越好,则ID_loss=1-cosine_similarity(fake2_id_features,source_id_features)。Cosine相似度计算如下:
其中,Ai和Bi分别表示向量A和向量B的各分量,向量A即为上述第三样本特征向量,向量B即为上述第四样本特征向量。
作为一种示例的方案,对源图像和模板图像执行目标合成操作,得到初始合成图像之前,上述方法还包括:对源图像和模板图像分别进行对象检测,得到源图像和模板图像中的对象区域;对对象区域进行配准操作,确定对象区域中的关键点信息,其中,关键点信息用于表示对象区域中的对象;根据关键点信息分别裁剪源图像和模板图像,得到用于执行目标合成操作的源图像和模板图像。
在本申请实施例中,上述对象检测可以包括但不限于对输入的图像进行预处理获得剪裁好的源图像和模板图像。以源图像和模板图像均是人脸图像为例,具体包括:
S1,由于输入的源图像和模板图像中,人脸往往占据一个较小位置,所以需要先进行人脸检测,获得目标人脸区域(对应于前述的对象区域)。
S2,在人脸区域内进行人脸配准,获得人脸的关键点,重点是人的眼睛和嘴角的关键点。人脸配准是一项图像预处理技术,能够定位出人脸五官关键点坐标。五官的关键点数量为预先设定的固定值,可根据不同的语义情形来定义(通常有5点、68点、90点等固定值)。
S3,根据人脸关键点,获得裁剪后的人脸图(对应于前述的用于执行目标合成操作的源图像和模板图像)。
同时,本申请需要两个额外的已经预训练的模型来辅助合成网络的学习,包括但不限于:人脸识别网络、预训练好的alexnet网络。人脸识别网络用于提取人脸特征。这个特征的维度一般是1024维。由于需要生成的人脸(对应于前述的初始合成图像)和source(对应于前述的源图像)的人脸的身份越接近越好,所以提取人脸的特征来约束。预训练好的alexnet网络用于提取图像在不同层的特征,计算LPIPS损失。在深度网络模型中,低层的特征能表示线条、颜色等低级特征,高层的特征能表示部件等高级特征。因此可以通过比较两个图像用alexnet提取的特征来衡量整体的相似程度。
第一阶段合成网络的训练过程如下:
S11,准备换脸数据,包括source(对应于前述的源图像)、template(对应于前述的模板图像)、gt(对应于前述的标签图像)三元组对。
S12,进行第一阶段的换脸。合成网络总体可以分为编码器和解码器两部分。编码器通过卷积计算将输入图像的尺寸不断减半,通道逐渐增加。具体地,比如合成网络输入从512*512*6维度(两张图拼接起来作为输入,每张图的RGB通道数为3),逐步编码为256*256*32维度、128*128*64维度、64*64*128维度、32*32*256维度以此类推。
S13,将S12中的得到的结果送入解码器。解码器主要是进行反卷积运算,将图像分辨率逐渐增倍,将S12中得到的结果解码为32*32*256维度、64*64*128维度、128*128*64维度、256*256* 32维度、512*512*3维度,最终得到换脸结果fake1图。
S14,对fake1图提取识别特征(例如,身份信息,包括但不限于图像中的面部特征、五官特征等),获得fake1_id_features。
S15,对source图提取识别特征(例如,身份信息,包括但不限于图像中的面部特征、五官特征等),获得source_id_features。
S16,计算换脸结果fake1图的特征损失。这个损失函数是在两张图(fake1,gt)的特征级别上计算差异,称为LPIPSLoss。首先,利用预训练好的alexnet网络提取fake1图和gt图不同层的网络的特征,然后比较两张图对应层间的差异。希望这两张图不同层的网络特征差异越小越好。
在上述过程中,特征的提取例如为:
result_fea1,result_fea2,result_fea3,result_fea4=alexnet_feature(fake1);
gt_img_fea1,gt_img_fea2,gt_img_fea3,gt_img_fea4=alexnet_feature(gt)。
LPIPS的计算例如为:
LPIPS_loss=|result_fea1-gt_img_fea1|+|result_fea2-gt_img_fea2|+|result_fea3-gt_img_fea3
|+|result_fea4-gt_img_fea4|。
S17,采用cosine相似度计算身份信息(ID)估计损失。希望生成的fake1图和source越像(身份信息越接近)越好。ID_loss=1-cosine_similarity(fake1_id_features,source_id_features)。Cosine相似度计算如下:
其中,Ai和Bi分别表示向量A和向量B的各分量,向量A例如为上述fake1_id_features,向量B例如为上述source_id_features。
S18,计算对抗损失。本申请实施例的生成对抗网络还包括判别器网络,用于判断生成的合成图像(换脸结果)是否真实。对抗损失值包括如下内容:
D_loss=-logD(gt)-log(1-D(fake1));
G_loss=log(1-D(fake1));
S19,第一阶段的loss=Reconstruction_loss+LPIPS_loss+ID_loss+D_loss+G_loss,进行合成网络参数的优化。
第二阶段修正网络的训练:
待第一阶段的合成网络训练完成后,开始第二阶段的修正网络的训练。修正网络的结构例如和合成网络结构类似。
S21,修正网络要学习的目标是一个残差图gt_diff_map(对应于前述的标签残差图像)=gt-fake1。
S22,将source、template和第一阶段的换脸结果fake1图作为输入,送入修正网络。
S23,同关于合成网络的S12以及S13的描述类似,修正网络也可以是包括编码器和解码器的 结构,但其输出为残差图diff_map。
S24,最终的换脸结果fake2=fake1+diff_map。fake2图是通过残差图diff_map修正对fake1图修正的结果。
S25,修正网络新增残差图的重构损失,为标签残差图像与样本残差图像的差异,两者差异越小越好。Diff_reconstruction_loss=|gt_diff_map-diff_map|。
S26,为计算修正网络的其他损失函数,可以与第一阶段的合成网络的loss函数类似,将输入fake1图替换为fake2图即可。
S27,最终第二阶段的损失函数loss=Diff_reconstruction_loss+Reconstruction_loss+LPIPS_loss+ID_loss+D_loss+G_loss,以便于进行模型参数的优化。
在一个示例性的实施例中,可以包括但不限于如下示例性步骤:
S31,将任意的source图和template图进行测试,送入合成网络,获得fake1;
S32,将Source图、template图和fake1图送入修正网络,获得diff_map;
S33,最终换脸结果fake2=fake1+diff_map。
具体应用过程包括如下步骤:
(1)视频采集->(2)图像输入->(3)人脸检测->(4)人脸区域的裁剪->(5)进行2阶段换脸->(6)结果展示。
其中,(5)为本技术的方法在实际使用过程中,需要训练的模块,以和其他模块进行合作交互。首先需要从视频采集模块中接收图像输入,然后进行人脸检测,并且裁剪出人脸区域。之后进入本技术的方法,进行换脸。最后进行结果的展示。
通过本申请实施例,可以在大姿态下能够消除双层脸,保持合成图像较好的效果,还可以在遮挡下依旧能够保持视频稳定的换脸效果。
在本申请的具体实施方式中,涉及到用户信息等相关的数据,当本申请以上实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。除此之外,本申请所有实施例的描述可以互为补充和结合。
根据本申请实施例的另一个方面,还提供了一种用于实施上述图像合成方法的图像合成装置。如图10所示,该装置包括以下模块。
获取模块1002用于获取待处理的源图像和模板图像,其中,所述源图像包括待合成的源图像身份信息,所述模板图像包括待合成的模板图像背景信息。所述模板图像中的目标对象可以具有目标姿态或者所述目标对象被遮挡。
第一合成模块1004用于对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像。 所述初始合成图像包括所述源图像身份信息和所述模板图像背景信息以及待修正的部分区域。
修正模块1006用于对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像。所述目标残差图像用于修正所述部分区域。
第二合成模块1008用于将所述初始合成图像与所述目标残差图像合成为目标合成图像。所述目标合成图像包括所述源图像身份信息和所述模板图像背景信息以及根据所述目标残差图像修正后的所述部分区域。
作为一种示例的方案,所述装置用于通过如下方式对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像:将所述源图像和所述模板图像共同输入目标合成网络,得到所述初始合成图像;其中,在所述目标合成网络中通过如下方式得到所述初始合成图像:对所述源图像和所述模板图像执行拼接操作,得到第一拼接图像,其中,所述第一拼接图像的通道数是所述源图像和所述模板图像的通道数之和;对所述第一拼接图像执行编码操作,得到所述通道数逐渐增加的第一中间层特征信息;对所述第一中间层特征信息执行解码操作,得到所述通道数逐渐减少的所述初始合成图像,其中,所述初始合成图像的所述通道数与所述源图像的所述通道数相同。
作为一种示例的方案,所述装置还用于:对第一合成网络进行训练得到所述目标合成网络,其中,通过如下方式对所述第一合成网络进行训练得到所述目标合成网络:获取第一样本源图像、第一样本模板图像以及标签图像,其中,所述标签图像是期望所述第一样本源图像与所述第一样本模板图像合成后得到的预先确定的图像;对所述第一样本源图像和所述第一样本模板图像执行拼接操作,得到第一样本拼接图像,其中,所述第一样本拼接图像的通道数是所述第一样本源图像和所述第一样本模板图像的通道数之和;通过所述第一合成网络对所述第一样本拼接图像执行编码操作,得到通道数逐渐增加的第一样本中间层特征信息;通过所述第一合成网络对所述第一样本中间层特征信息执行解码操作,得到通道数逐渐减少的所述第一样本初始合成图像,其中,所述第一样本初始合成图像的通道数与所述第一样本源图像的通道数相同;根据所述第一样本初始合成图像、所述第一样本源图像以及所述标签图像计算所述第一合成网络的第一目标损失值;在所述第一目标损失值符合第一损失条件的情况下,将所述第一合成网络确定为所述目标合成网络。
作为一种示例的方案,所述装置用于通过如下方式根据所述第一样本初始合成图像、所述第一样本源图像以及所述标签图像计算所述第一合成网络的第一目标损失值,包括:利用预训练的特征提取模块对所述标签图像执行特征提取操作,提取所述标签图像不同层级的特征信息,得到第一组样本特征图,其中,所述第一组样本特征图中每个样本特征图对应一个层级的从所述标签图像提取到的特征信息;利用所述特征提取模块对所述第一样本初始合成图像执行所述特征提取操作,以提取所述第一样本初始合成图像不同层级的特征信息,得到第二组样本特征图,其中,所述第二组样本特征图中每个样本特征图对应一个层级的从所述第一样本初始合成图像提取到的特征信息;根据所述第一组样本特征图和所述第二组样本特征图计算第一损失值,其中,所述第一损失值由不同层级下的各个层级的从所述标签图像提取到的特征信息和从所述第一样本初始合成图像提取到的特征信息计算得到;将所述第一损失值与所述第一合成网络的重建损失值共同确定为所述第一目标损失值,其中,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
作为一种示例的方案,所述装置用于通过如下方式根据所述第一样本初始合成图像、所述第一样本源图像以及所述标签图像计算所述第一合成网络的第一目标损失值:对所述第一样本初始合成图像执行识别操作,得到第一样本特征向量,其中,所述第一样本特征向量用于表示所述第一样本初始合成图像中的源图像身份信息;对所述第一样本源图像执行所述识别操作,得到第二样本特征向量,其中,所述第二样本特征向量用于表示所述第一样本源图像中的源图像身份信息;根据所述第一样本特征向量和所述第二样本特征向量计算第二损失值,其中,所述第二损失值表示所述第一样本特征向量和所述第二样本特征向量之间的相似度;将所述第二损失值与所述第一合成网络的重建损失值共同确定为所述第一目标损失值,其中,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
作为一种示例的方案,所述装置用于通过如下方式对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像:将所述源图像、所述模板图像以及所述初始合成图像共同输入目标修正网络,得到所述目标残差图像;其中,在所述目标修正网络中通过如下方式得到所述目标残差图像:对所述源图像、所述模板图像以及所述初始合成图像执行拼接操作,得到第二拼接图像,其中,所述第二拼接图像的通道数是所述源图像、所述模板图像以及所述初始合成图像的通道数之和;对所述第二拼接图像执行编码操作,得到所述通道数逐渐增加的第二中间层特征信息;对所述第二中间层特征信息执行解码操作,得到所述通道数逐渐减少的所述目标残差图像,其中,所述目标残差图像的所述通道数与所述初始合成图像的所述通道数相同。
作为一种示例的方案,所述装置还用于:对初始修正网络进行训练得到所述目标修正网络,其中,通过如下方式对所述初始修正网络进行训练得到所述目标修正网络:获取第二样本源图像、第二样本模板图像、标签残差图像以及第二样本初始合成图像,其中,所述第二样本初始合成图像是所述第二样本源图像与所述第二样本模板图像执行所述目标合成操作后得到的图像,所述标签残差图像根据标签图像和所述第二样本初始合成图像确定,所述标签图像是期望所述第二样本源图像与所述第二样本模板图像合成后得到的预先确定的图像;对所述第二样本源图像、所述第二样本模板图像以及所述第二样本初始合成图像执行拼接操作,得到第二样本拼接图像,其中,所述第二样本拼接图像的通道数是所述第二样本源图像、所述第二样本模板图像以及所述第二样本初始合成图像的通道数之和;对所述第二样本拼接图像执行编码操作,得到通道数逐渐增加的第二样本中间层特征信息;对所述第二样本中间层特征信息执行解码操作,得到通道数逐渐减少的所述样本残差图像,其中,所述样本残差图像的通道数与所述第二样本初始合成图像的通道数相同;根据所述第二样本源图像、所述标签残差图像、所述第二样本初始合成图像以及所述样本残差图像计算所述初始修正网络的第二目标损失值;在所述第二目标损失值符合第二损失条件的情况下,将所述初始修正网络确定为所述目标修正网络。
作为一种示例的方案,所述装置用于通过如下方式根据所述第二样本源图像、所述标签残差图像、所述第二样本初始合成图像以及所述样本残差图像计算所述初始修正网络的第二目标损失值:根据所述样本残差图像和所述标签残差图像计算第三损失值;将所述第三损失值与第二合成网络的重建损失值共同确定为所述第二目标损失值,其中,所述第二合成网络用于生成所述第二样本 初始合成图像,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
作为一种示例的方案,所述装置用于通过如下方式将所述第三损失值与第一合成网络的重建损失值共同确定为所述第二目标损失值:将所述第二样本初始合成图像与所述样本残差图像合成为样本目标合成图像;利用预训练的特征提取模块对所述样本目标合成图像执行特征提取操作,得到第三组样本特征图,其中,所述特征提取模块用于提取不同层级的特征信息,所述第三组样本特征图中每个样本特征图对应一个层级的从所述样本目标合成图像提取到的特征信息;利用所述特征提取模块对所述标签图像执行所述特征提取操作,得到第一组样本特征图,其中,所述第一组样本特征图中每个样本特征图对应一个层级的从所述标签图像提取到的特征信息;根据所述第三组样本特征图和所述第一组样本特征图计算第四损失值,其中,所述第四损失值由不同层级下的对应层级的从所述样本目标合成图像提取到的特征信息和从所述标签图像提取到的特征信息计算得到;将所述第四损失值与所述第三损失值以及所述第二合成网络的重建损失值共同确定为所述第二目标损失值。
作为一种示例的方案,所述装置用于通过如下方式将所述第三损失值与第一合成网络的重建损失值共同确定为所述第二目标损失值:对所述样本目标合成图像执行识别操作,得到第三样本特征向量,其中,所述第三样本特征向量用于表示所述样本目标合成图像中的源图像身份信息;对所述第二样本源图像执行所述识别操作,得到第四样本特征向量,其中,所述第四样本特征向量用于表示所述第二样本源图像中的源图像身份信息;根据所述第三样本特征向量和所述第四样本特征向量计算所述第二合成网络的第五损失值,其中,所述第二目标损失值包括所述第五损失值,所述第五损失值表示所述第三样本特征向量和所述第四样本特征向量之间的相似度;将所述第五损失值与所述第三损失值、所述第四损失值以及所述第二合成网络的重建损失值共同确定为所述第二目标损失值。
作为一种示例的方案,所述装置还用于:所述对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像之前,对所述源图像和所述模板图像分别进行对象检测,得到所述源图像和所述模板图像中的对象区域;对所述对象区域进行配准操作,确定所述对象区域中的关键点信息,其中,所述关键点信息用于表示所述对象区域中的对象;根据所述关键点信息分别裁剪所述源图像和所述模板图像,得到用于执行所述目标合成操作的所述源图像和所述模板图像。
图11示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。
图11示出的电子设备的计算机系统1100仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图11所示,计算机系统1100包括中央处理器1101(Central Processing Unit,CPU),其可以根据存储在只读存储器1102(Read-Only Memory,ROM)中的程序或者从存储部分1108加载到随机访问存储器1103(Random Access Memory,RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器1103中,还存储有系统操作所需的各种程序和数据。中央处理器1101、在只读存储器1102以及随机访问存储器1103通过总线1104彼此相连。输入/输出接口1105(Input/Output接口,即I/O接口)也连接至总线1104。
以下部件连接至输入/输出接口1105:包括键盘、鼠标等的输入部分1106;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1107;包括硬盘等的存储部分1108;以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分1109。通信部分1109经由诸如因特网的网络执行通信处理。驱动器1110也根据需要连接至输入/输出接口1105。可拆卸介质1111,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1110上,以便于从其上读出的计算机程序根据需要被安装入存储部分1108。
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1109从网络上被下载和安装,和/或从可拆卸介质1111被安装。在该计算机程序被中央处理器1101执行时,执行本申请的系统中限定的各种功能。
根据本申请实施例的又一个方面,还提供了一种用于实施上述图像合成方法的电子设备,该电子设备可以是图1所示的终端设备或服务器。本申请实施例以该电子设备为终端设备为例来说明。如图12所示,该电子设备包括存储器1202和处理器1204,该存储器1202中存储有计算机程序,该处理器1204被设置为通过计算机程序执行上述任一项方法实施例中的步骤。
在本申请实施例中,上述电子设备可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本申请实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
获取待处理的源图像和模板图像,其中,源图像包括待合成的源图像身份信息,模板图像包括待合成的模板图像背景信息;
对源图像和模板图像执行目标合成操作,得到初始合成图像,其中,初始合成图像包括源图像身份信息和模板图像背景信息以及待修正的部分区域;
对源图像、模板图像以及初始合成图像执行目标修正操作,得到目标残差图像;
将初始合成图像与目标残差图像合成为目标合成图像,其中,目标合成图像包括源图像身份信息和模板图像背景信息以及根据所述目标残差图像修正后的部分区域。
图12所示的结构仅为示意,电子装置电子设备也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图12其并不对上述电子装置电子设备的结构造成限定。例如,电子装置电子设备还可包括比图12中所示更多或者更少的组件(如网络接口等),或者具有与图12所示不同的配置。
其中,存储器1202可用于存储软件程序以及模块,如本申请实施例中的图像合成方法和装置对应的程序指令/模块,处理器1204通过运行存储在存储器1202内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的图像合成方法。存储器1202可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器1202可进一步包括相对于处理器1204远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移 动通信网及其组合。其中,存储器1202具体可以但不限于用于存储合成图像等信息。作为一种示例,如图12所示,上述存储器1202中可以但不限于包括上述图像合成装置中的获取模块1002、第一合成模块1004、修正模块1006以及第二合成模块1008。此外,还可以包括但不限于上述图像合成装置中的其他模块单元,本示例中不再赘述。
上述的传输装置1206用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置1206包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置1206为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
此外,上述电子设备还包括:显示器1208,用于显示上述目标合成图像;和连接总线1210,用于连接上述电子设备中的各个模块部件。
在其他实施例中,上述终端设备或者服务器可以是一个分布式系统中的一个节点,其中,该分布式系统可以为区块链系统,该区块链系统可以是由该多个节点通过网络通信的形式连接形成的分布式系统。其中,节点之间可以组成点对点(P2P,Peer To Peer)网络,任意形式的计算设备,比如服务器、终端等电子设备都可以通过加入该点对点网络而成为该区块链系统中的一个节点。
根据本申请的一个方面,提供了一种计算机可读存储介质,计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述图像合成方面的各种示例实现方式中提供的图像合成方法。
在本申请实施例中,上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (15)

  1. 一种图像合成方法,其特征在于,包括:
    获取待处理的源图像和模板图像,其中,所述源图像包括待合成的源图像身份信息,所述模板图像包括待合成的模板图像背景信息;
    对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像,其中,所述初始合成图像包括所述源图像身份信息和所述模板图像背景信息以及待修正的部分区域;
    对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像;
    将所述初始合成图像与所述目标残差图像合成为目标合成图像,其中,所述目标合成图像包括所述源图像身份信息和所述模板图像背景信息以及根据所述目标残差图像修正后的所述部分区域。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像,包括:
    将所述源图像和所述模板图像共同输入目标合成网络,得到所述初始合成图像;
    其中,在所述目标合成网络中通过如下方式得到所述初始合成图像:
    对所述源图像和所述模板图像执行拼接操作,得到第一拼接图像,其中,所述第一拼接图像的通道数是所述源图像和所述模板图像的通道数之和;
    对所述第一拼接图像执行编码操作,得到所述通道数逐渐增加的第一中间层特征信息;
    对所述第一中间层特征信息执行解码操作,得到所述通道数逐渐减少的所述初始合成图像,其中,所述初始合成图像的所述通道数与所述源图像的所述通道数相同。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    对第一合成网络进行训练得到所述目标合成网络,其中,通过如下方式对所述第一合成网络进行训练得到所述目标合成网络:
    获取第一样本源图像、第一样本模板图像以及标签图像,其中,所述标签图像是期望所述第一样本源图像与所述第一样本模板图像合成后得到的预先确定的图像;
    对所述第一样本源图像和所述第一样本模板图像执行拼接操作,得到第一样本拼接图像,其中,所述第一样本拼接图像的通道数是所述第一样本源图像和所述第一样本模板图像的通道数之和;
    通过所述第一合成网络对所述第一样本拼接图像执行编码操作,得到通道数逐渐增加的第一样本中间层特征信息;
    通过所述第一合成网络对所述第一样本中间层特征信息执行解码操作,得到通道数逐渐减少的所述第一样本初始合成图像,其中,所述第一样本初始合成图像的通道数与所述第一样本源图像的通道数相同;
    根据所述第一样本初始合成图像、所述第一样本源图像以及所述标签图像计算所述第一合成网络的第一目标损失值;
    在所述第一目标损失值符合第一损失条件的情况下,将所述第一合成网络确定为所述目标合成网络。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述第一样本初始合成图像、所述第 一样本源图像以及所述标签图像计算所述第一合成网络的第一目标损失值,包括:
    利用预训练的特征提取模块对所述标签图像执行特征提取操作,以提取所述标签图像不同层级的特征信息,得到第一组样本特征图,其中,所述第一组样本特征图中每个样本特征图对应一个层级的从所述标签图像提取到的特征信息;
    利用所述特征提取模块对所述第一样本初始合成图像执行所述特征提取操作,以提取所述第一样本初始合成图像不同层级的特征信息,得到第二组样本特征图,其中,所述第二组样本特征图中每个样本特征图对应一个层级的从所述第一样本初始合成图像提取到的特征信息;
    根据所述第一组样本特征图和所述第二组样本特征图计算第一损失值,其中,所述第一损失值由不同层级下各个层级的从所述标签图像提取到的特征信息和从所述第一样本初始合成图像提取到的特征信息计算得到;
    将所述第一损失值与所述第一合成网络的重建损失值共同确定为所述第一目标损失值,其中,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述第一样本初始合成图像、所述第一样本源图像以及所述标签图像计算所述第一合成网络的第一目标损失值,包括:
    对所述第一样本初始合成图像执行识别操作,得到第一样本特征向量,其中,所述第一样本特征向量用于表示所述第一样本初始合成图像中的源图像身份信息;
    对所述第一样本源图像执行所述识别操作,得到第二样本特征向量,其中,所述第二样本特征向量用于表示所述第一样本源图像中的源图像身份信息;
    根据所述第一样本特征向量和所述第二样本特征向量计算第二损失值,其中,所述第二损失值表示所述第一样本特征向量和所述第二样本特征向量之间的相似度;
    将所述第二损失值与所述第一合成网络的重建损失值共同确定为所述第一目标损失值,其中,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
  6. 根据权利要求1所述的方法,其特征在于,所述对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像,包括:
    将所述源图像、所述模板图像以及所述初始合成图像共同输入目标修正网络,得到所述目标残差图像;
    其中,在所述目标修正网络中通过如下方式得到所述目标残差图像:
    对所述源图像、所述模板图像以及所述初始合成图像执行拼接操作,得到第二拼接图像,其中,所述第二拼接图像的通道数是所述源图像、所述模板图像以及所述初始合成图像的通道数之和;
    对所述第二拼接图像执行编码操作,得到所述通道数逐渐增加的第二中间层特征信息;
    对所述第二中间层特征信息执行解码操作,得到所述通道数逐渐减少的所述目标残差图像,其中,所述目标残差图像的所述通道数与所述初始合成图像的所述通道数相同。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    对初始修正网络进行训练得到所述目标修正网络,其中,通过如下方式对所述初始修正网络进行训练得到所述目标修正网络:
    获取第二样本源图像、第二样本模板图像、标签残差图像以及第二样本初始合成图像,其中,所述第二样本初始合成图像是所述第二样本源图像与所述第二样本模板图像执行所述目标合成操作后得到的图像,所述标签残差图像根据标签图像和所述第二样本初始合成图像确定,所述标签图像是期望所述第二样本源图像与所述第二样本模板图像合成后得到的预先确定的图像;
    对所述第二样本源图像、所述第二样本模板图像以及所述第二样本初始合成图像执行拼接操作,得到第二样本拼接图像,其中,所述第二样本拼接图像的通道数是所述第二样本源图像、所述第二样本模板图像以及所述第二样本初始合成图像的通道数之和;
    对所述第二样本拼接图像执行编码操作,得到通道数逐渐增加的第二样本中间层特征信息;
    对所述第二样本中间层特征信息执行解码操作,得到通道数逐渐减少的所述样本残差图像,其中,所述样本残差图像的通道数与所述第二样本初始合成图像的通道数相同;
    根据所述第二样本源图像、所述标签残差图像、所述第二样本初始合成图像以及所述样本残差图像计算所述初始修正网络的第二目标损失值;
    在所述第二目标损失值符合第二损失条件的情况下,将所述初始修正网络确定为所述目标修正网络。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述第二样本源图像、所述标签残差图像、所述第二样本初始合成图像以及所述样本残差图像计算所述初始修正网络的第二目标损失值,包括:
    根据所述样本残差图像和所述标签残差图像计算第三损失值;
    将所述第三损失值与第二合成网络的重建损失值共同确定为所述第二目标损失值,其中,所述第二合成网络用于生成所述第二样本初始合成图像,所述重建损失值为执行所述编码操作和所述解码操作的损失值。
  9. 根据权利要求8所述的方法,其特征在于,所述将所述第三损失值与第一合成网络的重建损失值共同确定为所述第二目标损失值,包括:
    将所述第二样本初始合成图像与所述样本残差图像合成为样本目标合成图像;
    利用预训练的特征提取模块对所述样本目标合成图像执行特征提取操作,得到第三组样本特征图,其中,所述特征提取模块用于提取不同层级的特征信息,所述第三组样本特征图中每个样本特征图对应一个层级的从所述样本目标合成图像提取到的特征信息;
    利用所述特征提取模块对所述标签图像执行所述特征提取操作,得到第一组样本特征图,其中,所述第一组样本特征图中每个样本特征图对应一个层级的从所述标签图像提取到的特征信息;
    根据所述第三组样本特征图和所述第一组样本特征图计算第四损失值,其中,所述第四损失值由不同层级下的对应层级的从所述样本目标合成图像提取到的特征信息和从所述标签图像提取到的特征信息计算得到;
    将所述第四损失值与所述第三损失值以及所述第二合成网络的重建损失值共同确定为所述第二目标损失值。
  10. 根据权利要求9所述的方法,其特征在于,所述将所述第三损失值与第一合成网络的重建 损失值共同确定为所述第二目标损失值,包括:
    对所述样本目标合成图像执行识别操作,得到第三样本特征向量,其中,所述第三样本特征向量用于表示所述样本目标合成图像中的源图像身份信息;
    对所述第二样本源图像执行所述识别操作,得到第四样本特征向量,其中,所述第四样本特征向量用于表示所述第二样本源图像中的源图像身份信息;
    根据所述第三样本特征向量和所述第四样本特征向量计算所述第二合成网络的第五损失值,其中,所述第二目标损失值包括所述第五损失值,所述第五损失值表示所述第三样本特征向量和所述第四样本特征向量之间的相似度;
    将所述第五损失值与所述第三损失值、所述第四损失值以及所述第二合成网络的重建损失值共同确定为所述第二目标损失值。
  11. 根据权利要求1至10中任一项所述的方法,其特征在于,所述对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像之前,所述方法还包括:
    对所述源图像和所述模板图像分别进行对象检测,得到所述源图像和所述模板图像中的对象区域;
    对所述对象区域进行配准操作,确定所述对象区域中的关键点信息,其中,所述关键点信息用于表示所述对象区域中的对象;
    根据所述关键点信息分别裁剪所述源图像和所述模板图像,得到用于执行所述目标合成操作的所述源图像和所述模板图像。
  12. 一种图像合成装置,其特征在于,包括:
    获取模块,用于获取待处理的源图像和模板图像,其中,所述源图像包括待合成的源图像身份信息,所述模板图像包括待合成的模板图像背景信息;
    第一合成模块,用于对所述源图像和所述模板图像执行目标合成操作,得到初始合成图像,其中,所述初始合成图像包括所述源图像身份信息和所述模板图像背景信息以及待修正的部分区域;
    修正模块,用于对所述源图像、所述模板图像以及所述初始合成图像执行目标修正操作,得到目标残差图像;
    第二合成模块,用于将所述初始合成图像与所述目标残差图像合成为目标合成图像,其中,所述目标合成图像包括所述源图像身份信息和所述模板图像背景信息以及根据所述目标残差图像修正后的所述部分区域。
  13. 一种计算机可读的存储介质,其特征在于,所述计算机可读的存储介质包括存储的程序,其中,所述程序可被终端设备或计算机运行时执行所述权利要求1至11任一项中所述的方法。
  14. 一种计算机程序产品,包括计算机程序/指令,其特征在于,该计算机程序/指令被处理器执行时实现权利要求1至11任一项中所述方法的步骤。
  15. 一种电子设备,包括存储器和处理器,其特征在于,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行所述权利要求1至11任一项中所述的方法。
PCT/CN2023/128423 2022-11-14 2023-10-31 图像合成方法和装置、存储介质及电子设备 WO2024104144A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211422368.8 2022-11-14
CN202211422368.8A CN116958306A (zh) 2022-11-14 2022-11-14 图像合成方法和装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2024104144A1 true WO2024104144A1 (zh) 2024-05-23

Family

ID=88451681

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/128423 WO2024104144A1 (zh) 2022-11-14 2023-10-31 图像合成方法和装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN116958306A (zh)
WO (1) WO2024104144A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958306A (zh) * 2022-11-14 2023-10-27 腾讯科技(深圳)有限公司 图像合成方法和装置、存储介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10810725B1 (en) * 2018-12-07 2020-10-20 Facebook, Inc. Automated detection of tampered images
CN113240575A (zh) * 2021-05-12 2021-08-10 中国科学技术大学 人脸伪造视频效果增强方法
CN113706421A (zh) * 2021-10-27 2021-11-26 深圳市慧鲤科技有限公司 一种图像处理方法及装置、电子设备和存储介质
CN116958306A (zh) * 2022-11-14 2023-10-27 腾讯科技(深圳)有限公司 图像合成方法和装置、存储介质及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10810725B1 (en) * 2018-12-07 2020-10-20 Facebook, Inc. Automated detection of tampered images
CN113240575A (zh) * 2021-05-12 2021-08-10 中国科学技术大学 人脸伪造视频效果增强方法
CN113706421A (zh) * 2021-10-27 2021-11-26 深圳市慧鲤科技有限公司 一种图像处理方法及装置、电子设备和存储介质
CN116958306A (zh) * 2022-11-14 2023-10-27 腾讯科技(深圳)有限公司 图像合成方法和装置、存储介质及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEI LAING ZI: "Shanghai Jiao Tong University found a new way to "change faces": no fear of death lighting, straight male perspective | Open Source", FINANCE, 16 June 2021 (2021-06-16), XP093169982, Retrieved from the Internet <URL:https://finance.sina.com.cn/tech/csj/2021-06-16/doc-ikqciyzi9938396.shtml> *

Also Published As

Publication number Publication date
CN116958306A (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
JP7476428B2 (ja) 画像の視線補正方法、装置、電子機器、コンピュータ可読記憶媒体及びコンピュータプログラム
WO2021052375A1 (zh) 目标图像生成方法、装置、服务器及存储介质
CN112037320B (zh) 一种图像处理方法、装置、设备以及计算机可读存储介质
CN106682632B (zh) 用于处理人脸图像的方法和装置
WO2022078041A1 (zh) 遮挡检测模型的训练方法及人脸图像的美化处理方法
EP3992919B1 (en) Three-dimensional facial model generation method and apparatus, device, and medium
CN111008927B (zh) 一种人脸替换方法、存储介质及终端设备
CN111652123B (zh) 图像处理和图像合成方法、装置和存储介质
CN111583399A (zh) 图像处理方法、装置、设备、介质和电子设备
WO2024104144A1 (zh) 图像合成方法和装置、存储介质及电子设备
JP2016085579A (ja) 対話装置のための画像処理装置及び方法、並びに対話装置
US20230100427A1 (en) Face image processing method, face image processing model training method, apparatus, device, storage medium, and program product
CN115272570A (zh) 虚拟表情生成方法、装置、电子设备和存储介质
CN114821675B (zh) 对象的处理方法、系统和处理器
WO2023184817A1 (zh) 图像处理方法、装置、计算机设备、计算机可读存储介质及计算机程序产品
CN115171199B (zh) 图像处理方法、装置及计算机设备、存储介质
CN112528902A (zh) 一种基于3d人脸模型的视频监控动态人脸识别方法及装置
CN111898571A (zh) 动作识别系统及方法
CN111028318A (zh) 一种虚拟人脸合成方法、系统、装置和存储介质
CN115147261A (zh) 图像处理方法、装置、存储介质、设备及产品
CN113886510A (zh) 一种终端交互方法、装置、设备及存储介质
CN111325252B (zh) 图像处理方法、装置、设备、介质
Liu et al. 3DFP-FCGAN: Face completion generative adversarial network with 3D facial prior
Jabbar et al. FD-stackGAN: face de-occlusion using stacked generative adversarial networks
CN115665361A (zh) 虚拟环境中的视频融合方法和在线视频会议通信方法