WO2024060669A1 - Action migration method and apparatus, and terminal device and storage medium - Google Patents

Action migration method and apparatus, and terminal device and storage medium Download PDF

Info

Publication number
WO2024060669A1
WO2024060669A1 PCT/CN2023/097712 CN2023097712W WO2024060669A1 WO 2024060669 A1 WO2024060669 A1 WO 2024060669A1 CN 2023097712 W CN2023097712 W CN 2023097712W WO 2024060669 A1 WO2024060669 A1 WO 2024060669A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
map
segmentation
foreground image
foreground
Prior art date
Application number
PCT/CN2023/097712
Other languages
French (fr)
Chinese (zh)
Inventor
刘鑫辰
刘武
杨权威
梅涛
Original Assignee
北京京东尚科信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2024060669A1 publication Critical patent/WO2024060669A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • This application relates to the field of image processing technology, for example, to action migration methods, devices, terminal equipment and storage media.
  • Action transfer refers to generating a new video based on the source image and the driver video.
  • the new video contains the character in the source image, and the character performs the same actions as the character in the driver video.
  • an affine transformation (which may be called a warp operation) is usually performed on the source image or its encoded feature image according to the driving video to generate a new video.
  • This application provides action migration methods, devices, terminal equipment and storage media, which can better adapt to scenes with drastically different postures, ensure the authenticity of the characters in the generated videos, and improve the user's visual experience.
  • this application provides an action migration method, including:
  • the key point connection diagram is used to characterize the driving posture of the first object
  • the second foreground image is fused with the first background image of the source image to obtain a motion transition image.
  • this application provides an action migration device, including:
  • the image acquisition module is configured to acquire the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the first The driving posture of the object;
  • a first generation module configured to generate a second segmentation map of each preset area that conforms to the driving posture based on the key point connection map and the first segmentation map;
  • a second generation module configured to generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image;
  • a synthesis module configured to fuse the second foreground image with the first background image of the source image to obtain an action migration image.
  • this application provides a terminal device, including:
  • processors one or more processors
  • memory configured to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the above-mentioned action migration method.
  • the present application provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the above-mentioned action migration method is implemented.
  • Figure 1 is a flow chart of an action migration method provided by an embodiment of the present application.
  • Figure 2 is a flow chart of another action migration method provided by an embodiment of the present application.
  • Figure 3 is a flow chart of another action migration method provided by an embodiment of the present application.
  • Figure 4 is a schematic architectural diagram of a local generation network provided by an embodiment of the present application.
  • Figure 5 is a flow chart of another action migration method provided by an embodiment of the present application.
  • Figure 6 is a flow chart of another action migration method provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of the architecture of an overall synthesis network provided by an embodiment of the present application.
  • Figure 8 is a flow chart of another action migration method provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of an action migration device provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of the hardware structure of a terminal device provided by an embodiment of the present application.
  • Figure 1 is a flow chart of an action migration method provided by an embodiment of the present application.
  • the action migration method provided by an embodiment of the present application can be applied to the situation of migrating the posture of objects in images and/or videos, such as the situation of human body movement migration.
  • the method can be executed by an action migration device, which is implemented in software and/or hardware, for example, configured in a terminal device, such as a computer device.
  • the action migration method provided in the embodiment of this application may include the following steps:
  • action migration refers to the process of generating a new image based on the source image and the driving image.
  • the new image contains the second object in the source image, and the second object does the same thing as the first object in the driving image.
  • the driving image refers to an image with a driving posture.
  • the driving images may be video frames in the driving video.
  • the first object in the driving image refers to a person or other area of interest, which is not limited here.
  • the key point connection diagram refers to a posture connection diagram in which multiple key points of the first object are connected in a predefined connection manner.
  • the source image refers to an image with a source pose of the second object
  • the second object in the source image refers to a person or other area of interest.
  • the first segmentation map refers to the segmentation map of each preset area of the second object, which may include but is not limited to the body part area segmentation map and the background area segmentation map.
  • the body part of the second object may include but is not limited to the head. , tops, bottoms, shoes and limbs, etc., are not limited here.
  • the number of channels of the first segmentation map can be determined according to the number of divided body parts, such as 18 channels, 6 channels, 5 channels, etc., which are not limited here.
  • the first object and the second object are used to distinguish the difference between the objects in the source image and the objects in the driving image, and are not necessarily used to describe a specific order or sequence.
  • the first object may be a driving character
  • the second object may be a source character
  • the preset area may be a body part of the character or other areas of interest.
  • the human body key point detection model OpenPose can be used to predict the driving image and obtain the two-dimensional key points of the driving character.
  • OpenPose is an open source two-dimensional key point detection model of the human body; the two-dimensional key points of the driving character are connected according to the predefined connection method to obtain the key point connection diagram of the driving character.
  • the key point connection diagram may be a Red-Green-Blue (RGB) key point connection diagram, and H ⁇ W represents the resolution of the image.
  • RGB Red-Green-Blue
  • this application uses human Analytical self-correction (Self Correction for Human Parsing, SCHP) model is used to obtain the 18-channel semantic segmentation map of the source image; taking into account the texture characteristics of different parts of the human body, the 18-channel semantic segmentation map channels are merged into 6 channels, namely the head head, tops, bottoms, shoes, limbs and background, thereby obtaining the first segmentation map of the preset area of the source character in the source image
  • SCHP Human Analytical self-correction
  • the second segmentation map refers to a preset area analysis map of the second object in the driving posture.
  • the second segmentation map includes a preset area of the second object that conforms to the driving posture.
  • the key point connection map and the first segmentation map can be input into a pre-trained first neural network model to obtain a second segmentation map, thereby realizing the transformation of the segmentation map corresponding to the second object from the source pose to the driving pose.
  • the network structure of the first neural network model is not limited here, for example, the network structure can be composed of at least one encoder and at least one decoder.
  • the second foreground image refers to a preset area foreground image of the second object in the driving posture, where the foreground image refers to the object area in the image excluding the background.
  • the second foreground image is composed of a plurality of preset area foreground images of the second object that conform to the driving posture.
  • the second segmentation images of multiple preset areas and the first foreground image of the source image can be input to the pre-trained second neural network model to obtain the foreground images of the preset area under multiple driving postures of the second object, and The preset area foreground images under multiple driving postures are combined to obtain a second foreground image, which realizes the transformation of the foreground image corresponding to the second object from the source posture to the driving posture.
  • the network structure of the second neural network model is not limited here.
  • the network structure may consist of at least one encoder and at least one decoder.
  • the motion transfer image refers to an image with the source image background as the background and including the second object in the driving posture. That is, the motion transfer image refers to an image in which the first object's posture is transferred to the second object.
  • the foreground in the second foreground image can be embedded into the corresponding position of the first background image of the source image to achieve image fusion.
  • the first foreground image and the second foreground image can also be texture aligned, and the foreground in the texture-aligned second foreground image can be embedded into the corresponding position of the first background image of the source image to achieve image fusion.
  • the image fusion method is not limited here.
  • An action migration method uses the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image to obtain each The second segmentation map of the preset area that conforms to the driving posture realizes the transformation of the segmentation map corresponding to the second object from the source posture to the driving posture; according to the second segmentation map of multiple preset areas and the first foreground image of the source image, Generate a second foreground image of the second object in the driving posture, realize the transformation of the foreground image corresponding to the second object from the source posture to the driving posture, and assign the texture of the second object of the source image to the second segmentation map;
  • the foreground image is fused with the first background image of the source image to obtain a realistic motion migration image, and this application abandons the original Warp operation, which can better adapt to scenarios with different postures and ensure the authenticity of the characters in the generated video. Improve the user's visual experience.
  • Figure 2 is a flow chart of another action migration method provided by an embodiment of the present application.
  • the method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments.
  • the action migration method provided in this embodiment is explained.
  • After generating a second segmentation map of each preset area that conforms to the driving posture it also includes: determining alignment parameters based on the first segmentation map and the second segmentation map; Before generating the second foreground image of the second object in the driving posture, the method further includes: transforming the first foreground image according to the alignment parameter to align the first foreground image with the second segmentation image.
  • the method of this embodiment may include:
  • S220 Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
  • the alignment parameters refer to parameters used to align the first foreground image and the second segmentation image. Aligning the first foreground image with the second segmentation image through the alignment parameter can avoid the situation where there is a huge difference in size and spatial position between the source character and the driver character.
  • determining the alignment parameter according to the first segmentation map and the second segmentation map includes at least one of the following: determining the scaling parameter according to the size of the preset area in the first segmentation map and the second segmentation map; The center coordinates of the preset areas in the first segmentation map and the second segmentation map determine the displacement parameters.
  • the scaling parameter refers to the parameter that controls the zoom size of the image.
  • the first mask height is determined based on the first segmentation map, and the second mask height is determined based on the second segmentation map; the scaling parameter is determined based on the first mask height and the second mask height.
  • the displacement parameter is a parameter that characterizes the position offset of the image.
  • the first mask center coordinate is determined based on the first segmentation map, and the second mask center coordinate is determined based on the second segmentation map; the displacement parameter is determined based on the first mask center coordinate and the second mask center coordinate.
  • R represents the scaling parameter
  • H s represents the height of the human body mask in the first segmentation image.
  • c represents the displacement parameter
  • c x represents the difference in the horizontal direction between the first mask center coordinate and the second mask center coordinate
  • cy y represents the vertical difference between the first mask center coordinate and the second mask center coordinate value.
  • the first foreground image is transformed through the alignment parameter to align the first foreground image with the second segmentation image, thereby generating the second object in the driving posture for the subsequent second segmentation image and first foreground image based on multiple preset areas.
  • the second foreground image lays the foundation for image quality and improves the image quality of the second foreground image.
  • the embodiment of the present application adds "determine the alignment parameter according to the first segmentation map and the second segmentation map; transform the first foreground image according to the alignment parameter, so that the first foreground image and the second segmentation map Align".
  • the action migration method proposed in the embodiment of the present application and the above-mentioned embodiment belong to the same concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiment, and this embodiment has the same effect as the above-mentioned embodiment.
  • FIG 3 is a flow chart of another action migration method provided by an embodiment of the present application.
  • the method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments.
  • the action migration method provided in this embodiment is explained.
  • the second segmentation map is generated through the first generative adversarial network, and the step of generating the second segmentation map through the first generative adversarial network includes: encoding the first segmentation map through a first encoder to obtain a first feature map;
  • the encoder encodes the key point connection map to obtain the second feature map; the first decoder decodes the fusion map of the first feature map and the second feature map to obtain the second segmentation map.
  • the method in this embodiment may include:
  • S330 Encode the key point connection map through the second encoder to obtain the second feature map.
  • S340 Use the first decoder to decode the fusion map of the first feature map and the second feature map to obtain a second segmentation map.
  • the second segmentation map may be generated by a first generative adversarial network
  • the first generative adversarial network may include a first encoder, a second encoder, and a first decoder.
  • the first encoder is used to encode the first segmentation graph
  • the second encoder is used to encode the key point connection graph
  • the first decoder is used to decode the fusion graph of the encoding results of the first encoder and the second encoder, This results in a clearer and sharper second segmentation image.
  • the first encoder, the second encoder and the first decoder are used to distinguish the different functions of the encoder or decoder, and are not necessarily used to describe a specific order or sequence.
  • the method may further include: obtaining a historical second segmentation map corresponding to the first preset number of video frames of the current video frame; and using a third encoder to encode at least one historical second segmentation map.
  • the second segmentation map is encoded to obtain the third feature map; correspondingly, the fusion map of the first feature map and the second feature map is decoded through the first decoder to obtain the second segmentation map, including: The fusion map of the first feature map, the second feature map and the third feature map is decoded to obtain a second segmentation map.
  • the historical second segmentation map refers to one or more second segmentation images before the current video frame.
  • the historical second segmentation map is input to the first generative adversarial network, so that the first generative adversarial network can effectively extract the relationship between different video frames, thereby improving the temporal consistency of the video.
  • the method further includes: decoding the fusion map of the second feature map and the third feature map through a second decoder to obtain the optical flow parameters and weight parameters; after obtaining After the second segmentation map, the method further includes: adjusting the second segmentation map according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameters and the weight parameters.
  • the second decoder is used to decode the fusion map of the second feature map and the third feature map to obtain optical flow parameters and weight parameters, where the optical flow parameters refer to the movement of the moving object on the observation imaging plane.
  • the instantaneous speed of pixel movement is used to decode the fusion map of the second feature map and the third feature map to obtain optical flow parameters and weight parameters, where the optical flow parameters refer to the movement of the moving object on the observation imaging plane. The instantaneous speed of pixel movement.
  • Figure 4 is an architectural schematic diagram of a local generation network provided by an embodiment of the present application.
  • the local generation network is used to generate images of each area of the source character in the driving posture; the local generation network includes a first generative adversarial network.
  • the first generative adversarial network can be a Layout generative adversarial network (GAN) with a vid2vid framework.
  • the first generative adversarial network includes Three encoders and two decoders.
  • the first encoder may be an encoder Among them, l is the identifier of Layout GAN, which is used to indicate that the first encoder belongs to the Layout GAN network.
  • the second encoder may be an encoder Multiple keypoint connection graphs for encoding stitching along channels Get the second feature map Among them, t represents the current moment, t-1 and t-2 represent the two historical consecutive moments before the current moment, and the third encoder can be an encoder Used to encode the two historical second segmentation maps generated at previous moments. Get the third feature map through first decoder Decode the added features get original result Similarly, through the second decoder Decoding additive features Obtain the optical flow parameter O and its weight parameter w. To obtain the final result of the second segmentation map Layout GAN can be formulated as:
  • Warp(I,O) represents the affine transformation of image I based on the optical flow parameter O.
  • the optical flow and warp operations used in this implementation are based on adjacent frames, and their purpose is to improve the temporal consistency of the generated video, rather than based on the transformation between the source image and the driving image.
  • the training step of the first generative adversarial network may include: obtaining a third segmentation map of each preset area of the first object in the sample driving image; determining a first loss of the second segmentation map corresponding to the sample source image and the third segmentation map corresponding to the sample driving image; and training the first generative adversarial network according to the first loss.
  • the third segmentation map refers to a segmentation map driving a preset area of the first object in the image.
  • the first generative adversarial network can be trained in advance through multiple sample-driven images and sample source images, where the sample-driven images and sample source images can be paired training data, that is, the characters in the sample-driven images and the sample source images are the same
  • a forward human video frame in a video is selected as a sample source image, and the video is used as a sample-driven video.
  • the reason why the forward human body video frame is selected as the source image is that it contains more appearance details of the source character.
  • the first loss may be a cross-entropy loss, by calculating the cross-entropy loss of the second segmentation map corresponding to the sample source image and the third segmentation map corresponding to the sample-driven image, and based on the cross-entropy The loss adjusts the network parameters of the first generative adversarial network so that the cross loss gradually decreases and becomes stable until the network training is completed and the first generative adversarial network is obtained.
  • the embodiments of the present application refine the technical features of generating the second segmentation map.
  • the embodiments of the present application and the action migration method proposed in the above embodiments belong to the same concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment is different from the above embodiments.
  • Example has the same effect.
  • FIG. 5 is a flow chart of another action migration method provided by an embodiment of the present application.
  • the method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments.
  • the action migration method provided in this embodiment is explained.
  • the second foreground image is generated through the second generative adversarial network, and the step of generating the second foreground image through the second generative adversarial network may include: encoding the second segmentation images of the plurality of preset areas by using a fourth encoder to obtain The fourth feature map; the first foreground image is encoded through the fifth encoder to obtain the fifth feature map; the fusion map of the fourth feature map and the fifth feature map is decoded through the third decoder to obtain the second foreground image .
  • the method in this embodiment may include:
  • S420 Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
  • S430 Use the fourth encoder to encode the second segmentation maps of the plurality of preset areas to obtain a fourth feature map.
  • S440 Encode the first foreground image through the fifth encoder to obtain the fifth feature map.
  • S450 Decode the fusion image of the fourth feature map and the fifth feature map through a third decoder to obtain a second foreground image.
  • S460 Fusing the second foreground image with the first background image of the source image to obtain an action migration image.
  • the second foreground image is generated by a second generative adversarial network.
  • the second generative adversarial network may include a fourth encoder, a fifth encoder and a third decoder, where the fourth encoder is used to encode multiple The second segmentation image of the preset area is encoded, the fifth encoder is used to encode the first foreground image, and the third decoder is used to decode the fusion image of the fourth feature map and the fifth feature map to obtain the second foreground image. picture.
  • the fourth encoder, the fifth encoder and the third decoder are used to distinguish between encoders or decoders with different functions, and are not necessarily used to describe a specific order or sequence.
  • the first foreground image is encoded by the fifth encoder to obtain the fifth feature map, which may include: obtaining a preset number of videos related to the current video frame The historical second foreground image corresponding to the frame; the fifth encoder encodes the fusion image of the first foreground image and at least one historical second foreground image to obtain a fifth feature map.
  • the historical second foreground image refers to one or more second foreground images before the current video frame.
  • the historical second foreground image is input into the second generative adversarial network so that the second generative adversarial network can effectively extract different visual images.
  • the relationship between video frames is improved to improve the timing consistency of the video.
  • Figure 4 is an architectural schematic diagram of a local generation network provided by an embodiment of the present application; the local generation network also includes a second generation adversarial network.
  • the second generation adversarial network in this embodiment can be a Region with a vid2vid framework.
  • GAN network Region GAN is only used to generate the initial region image, so this embodiment uses a generator to generate 5 regions of the human body (no background region is generated). Doing so not only saves computing resources, but also prevents the model from overfitting.
  • the second generative adversarial network may include a fourth encoder fifth encoder and third decoder Among them, r is the identifier of Region GAN, which is used to indicate that it belongs to the Region GAN network.
  • the mask of the i-th preset area Encode to obtain the fourth feature map via fifth encoder Codes spliced along the channel dimension Get the fifth feature map in, Represents the historical second foreground image before the current time t, I s,i represents the first foreground image of the i-th preset region, so the original Region GAN can be expressed as:
  • this application proposes to use a global alignment module (GAM) to perform affine transformation on the first foreground image FG s to match the corresponding second segmentation map. First through L s and Calculate the human mask M s and Then the first foreground image FG s of the source image I s can be obtained.
  • GAM global alignment module
  • the training step of the second generative adversarial network may include: obtaining a third segmentation map of each preset area of the first object in the sample-driven image; determining the second foreground corresponding to the sample source image The second loss between the map, and the foreground ground truth map corresponding to the sample source image; determining the second foreground map corresponding to the sample source image, and the third loss of the third segmentation map corresponding to the sample driving image; according to the second loss Second loss and third loss, the second generative adversarial network is trained.
  • the second loss may include reconstruction loss and perceptual loss;
  • the third loss may be adversarial loss, that is, image distribution loss.
  • the second generative adversarial network can be pre-trained using a plurality of sample driving images and sample source images, wherein the sample driving images and sample source images can be paired training data.
  • the second loss may include reconstruction loss and perceptual loss. The reconstruction loss and perceptual loss are calculated between the second foreground image corresponding to the sample source image and the foreground truth image corresponding to the sample source image, and the adversarial loss between the second foreground image corresponding to the sample source image and the third segmentation image corresponding to the sample driving image is calculated.
  • the network parameters of the second generative adversarial network are adjusted so that the reconstruction loss, perceptual loss and adversarial loss gradually decrease and tend to be stable until the network training is completed, thereby obtaining the second generative adversarial network.
  • this embodiment uses L1 reconstruction loss. Compared with L2 reconstruction loss, L1 reconstruction loss pays more attention to the subtle differences between the generated image and the real image.
  • the calculation formula is:
  • the perceptual loss is used to constrain the generated image and the real image to be close in the multi-dimensional feature space.
  • Perceptual loss includes feature content loss and feature style loss, which can be expressed as:
  • j represents the jth layer of the pre-trained Visual Geometry Group (VGG)-19 model
  • G represents the Gram matrix for calculating the feature map.
  • this application uses the multi-scale conditional discriminator proposed in pix2pixHD. it combines images and the corresponding area mask as input. Its expression is:
  • ⁇ rcc and ⁇ per are the weights of reconstruction loss and perceptual loss respectively.
  • the training process of the first generative adversarial network and the second generative adversarial network includes the training of the discriminator.
  • the loss function is:
  • the embodiment of the present application adds detailed features for determining the second foreground image.
  • the action migration method proposed in the embodiments of the present application and the above-mentioned embodiments belong to the same concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiments, and this embodiment is different from the above-mentioned embodiments. Have the same effect.
  • Figure 6 is a flow chart of another action migration method provided by an embodiment of the present application.
  • the method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments.
  • the action migration method provided in this embodiment is explained.
  • the method After generating the second foreground image of the second object in the driving posture, the method further includes: determining the texture enhancement parameters according to the first foreground image and the second foreground image; and performing the processing on the second foreground image according to the texture enhancement parameters and the first foreground image. Texture enhancement.
  • the method in this embodiment may include:
  • S520 Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
  • S530 Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
  • S550 Perform texture enhancement on the second foreground image according to the texture enhancement parameters and the first foreground image.
  • S560 Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.
  • the texture enhancement parameters refer to adjustment parameters used to enhance the image texture.
  • the texture enhancement parameters are used to align the features of the first foreground image with the features of the second foreground image to retain more details, such as clothing. Texture and body edges etc.
  • determining the texture enhancement parameters according to the first foreground image and the second foreground image may include: encoding the first foreground image through a sixth encoder to obtain a sixth feature map; and encoding the first foreground image through a seventh encoder.
  • the second foreground image is encoded to obtain the seventh feature map; the sixth feature map and the seventh feature map are expanded by channels to obtain the eighth feature map and the ninth feature map respectively; the eighth feature map and the ninth feature map are Correlation matrix as a texture enhancement parameter.
  • Figure 7 is a schematic architectural diagram of an overall synthesis network provided by an embodiment of the present application.
  • the overall synthesis network is used to integrate the images of different regions generated by the local generation network to generate the final action migration image. At the same time, it generates An appropriate background image is added to the action transfer image, and the loss of the overall synthesis network is the loss between the generated action transfer image and the driving image.
  • this embodiment proposes a texture alignment module (TAM) to better integrate feature maps
  • TAM texture alignment module
  • H 1,i represents the characteristics of H 1 at i
  • H 2,j represents the characteristics of H 2 at j.
  • performing texture enhancement on the second foreground image according to the texture enhancement parameter and the first foreground image may include: determining the texture enhancement map according to the eighth feature map and the texture enhancement parameter; adding the texture enhancement map and the ninth The fusion map of the feature map is integrated by channel to obtain the tenth feature map; the tenth feature map is decoded through the fourth decoder to obtain the texture-enhanced second foreground map.
  • the tenth feature map is obtained by the following formula
  • the second foreground image generation after the entire texture enhancement can be formulated as:
  • the embodiments of the present application add the technical feature of texture enhancement to achieve better fusion of feature maps.
  • the action migration method proposed in the embodiment of the present application and the above-mentioned embodiment belong to the same concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiment, and this embodiment has the same effect as the above-mentioned embodiment.
  • FIG 8 is a flow chart of another action migration method provided by an embodiment of the present application.
  • the method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments.
  • the action migration method provided in this embodiment is explained.
  • Fusion of the second foreground image with the first background image of the source image may include: determining the pose mask image based on the second segmentation map and the key point connection map; determining the second pose mask image based on the pose mask image and the first background image. Background image; fuse the second foreground image with the second background image.
  • the method in this embodiment may include:
  • S620 Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
  • S630 Generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
  • the pose mask map It refers to a soft mask image containing the posture of the first object; the corresponding position of the first background image is masked through the attitude mask image to obtain a second background image that covers the first object.
  • the fifth decoder decodes the fusion map of the seventh feature map and the eleventh feature map to obtain the pose mask map.
  • the second object includes a virtual object.
  • the virtual objects can be digital people, virtual customer service, and virtual anchors.
  • the second object may also be a real person object.
  • the action migration method of this embodiment can be applied to the action driving and generation of digital people, virtual customer service, and virtual anchors, thereby improving the fidelity and richness of virtual character actions and improving the user experience.
  • this application also designs a discriminator specifically for the face area to generate real faces.
  • the avatars of the source image and the action transfer image are input into the discriminator to train the model, making the foreground faces generated by the generator more realistic.
  • a source image I s and a driving video in Represents the video frame driving the video at time t.
  • the goal of this embodiment is to generate a new video in which the source image
  • the person in the image is doing the actions that drive the person in the video.
  • the entire scheme can be formulated as:
  • F( ⁇ , ⁇ ) represents the generation model in this embodiment
  • N represents the number of driving video frames
  • this embodiment selects a forward human video frame in a video as the source image, and this video is used as the driving video.
  • the reason for choosing the forward human video frame as the source image is that it contains more appearance details of the source character.
  • This embodiment adopts a step-by-step training strategy. First, Layout GAN and Region GAN were trained for 10 rounds respectively. Then the output of Region GAN is used to train the overall synthesis network for 10 rounds.
  • the selection of the driving video is not restricted, as long as it is any clear motion video of a single person. And this embodiment can perform end-to-end inference.
  • the generation framework proposed in this embodiment has achieved the best or equivalent results on two public datasets (iPER and SoloDance datasets).
  • the experimental results on the two datasets are shown in Tables 1 and 2, respectively, where Structural Similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) are evaluation indicators based on similarity, and the larger the value, the better the quality of the generated image.
  • SSIM Structural Similarity
  • PSNR Peak Signal to Noise Ratio
  • LPIPS Learned Perceptual Image Patch Similarity
  • Fréchet Inception Distance are evaluation indicators based on feature distance, and the smaller the value, the better the quality of the generated image.
  • TCM Temporally Consistent Mode
  • the experimental results of the embodiment of the present application have achieved the best results in all evaluation indicators on the iPER dataset.
  • Table 2 although the SSIM and PSNR indicators are not optimal, the corresponding Mask-SSIM and Mask-PSNR indicators are optimal (the values in brackets are obtained by setting the background area of the image to 0 through the human body mask, and then calculating the SSIM and PSNR indicators). This shows that the quality of the human body image generated by our method is better than C2F.
  • this application can better handle the situation of drastic changes in posture while retaining the appearance details of the source character.
  • the motion transfer images generated by this application generally have clearer facial details. This is due to the progressive generative model in this application, where the initial face region image provides an important template for the final clear face.
  • the embodiment of the present application adds the technical details of determining the second background image based on the above embodiment.
  • the action migration method proposed in the embodiment of the present application and the above embodiment belongs to the same concept, and the technical details not described in detail in the present embodiment can be referred to the above embodiment, and the present embodiment has the same effect as the above embodiment.
  • FIG. 9 is a schematic structural diagram of a motion migration device provided by an embodiment of the present application.
  • the embodiment of the present application may be applicable to motion migration of objects in images or videos, such as human body motion migration.
  • the action migration device provided by this application, the action migration method provided in the above embodiments can be implemented.
  • the action migration device in the embodiment of the present application may include:
  • the image acquisition module 710 is configured to acquire a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the key point connection diagram is used to characterize the driving posture of the first object; the first generation module 720 is configured to generate a second segmentation diagram of each preset area that conforms to the driving posture according to the key point connection diagram and the first segmentation diagram; the second generation module 730 is configured to generate a second segmentation diagram of each preset area according to the multiple The second segmentation map of the preset area and the first foreground map of the source image are used to generate a second foreground map of the second object under the driving posture; the synthesis module 740 is configured to fuse the second foreground map with the first background map of the source image to obtain the action migration image.
  • the action migration device further includes:
  • the alignment parameter determination module is configured to determine the alignment parameters based on the first segmentation map and the second segmentation map; the image alignment module is configured to transform the first foreground image based on the alignment parameters to align the first foreground image with the second segmentation map. .
  • the alignment parameters are determined according to the first segmentation map and the second segmentation map, including at least one of the following:
  • the scaling parameter is determined according to the size of the preset area in the first segmentation map and the second segmentation map; the displacement parameter is determined based on the center coordinates of the preset area in the first segmentation map and the second segmentation map.
  • the first generation module 720 includes:
  • the first encoding unit is configured to encode the first segmentation map through the first encoder to obtain the first feature map; the second encoding unit is configured to encode the key point connection map through the second encoder to obtain the second feature map; The first decoding unit is configured to decode the fusion map of the first feature map and the second feature map through the first decoder to obtain the second segmentation map.
  • the device further includes:
  • the historical second segmentation map acquisition module is configured to acquire the historical second segmentation map corresponding to the first preset number of video frames of the current video frame;
  • the historical segmentation map encoding module is configured to use a third encoder to encode at least one historical second segmentation map.
  • the segmented map is encoded to obtain the third feature map;
  • the first decoding unit is also set to:
  • the fusion map of the first feature map, the second feature map and the third feature map is decoded by the first decoder to obtain a second segmentation map.
  • the device is further configured to:
  • the second decoder decodes the fusion map of the second feature map and the third feature map to obtain the optical flow parameters and weight parameters; according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameters and weight parameters to adjust the second segmentation map.
  • the training steps of the first generative adversarial network include:
  • the second generation module 730 includes:
  • the fourth encoding unit is configured to encode the second segmentation images of the plurality of preset areas through the fourth encoder to obtain the fourth feature map; the fifth encoding unit is configured to encode the first foreground image through the fifth encoder. Encoding to obtain the fifth feature map; the third encoding unit is configured to decode the fusion map of the fourth feature map and the fifth feature map through the third decoder to obtain the second foreground image.
  • the fifth coding unit is also set to:
  • the training steps of the second generative adversarial network include:
  • the device further includes:
  • the texture enhancement parameter determination module is configured to determine the texture enhancement parameters based on the first foreground image and the second foreground image; the texture enhancement module is configured to perform texture enhancement on the second foreground image based on the texture enhancement parameters and the first foreground image.
  • the texture enhancement parameter determination module is set to:
  • the first foreground image is encoded by the sixth encoder to obtain the sixth feature map; the second foreground image is encoded by the seventh encoder to obtain the seventh feature map; the sixth feature map and the seventh feature map are divided into channels Expand to obtain the eighth feature map and the ninth feature map respectively; use the correlation matrices of the eighth feature map and the ninth feature map as texture enhancement parameters.
  • the texture enhancement module is configured to:
  • the eighth feature map and texture enhancement parameters determine the texture enhancement map; integrate the fusion map of the texture enhancement map and the ninth feature map by channel to obtain the tenth feature map; decode the tenth feature map through the fourth decoder, Obtain the second foreground image after texture enhancement.
  • synthesis module 740 is configured to:
  • the attitude mask image is determined; based on the attitude mask image and the first background image, the second background image is determined; and the second foreground image and the second background image are fused.
  • the second object includes a virtual object.
  • the action migration device provided by the embodiments of this application belongs to the same concept as the action migration method provided by the above embodiments.
  • Technical details that are not described in detail in the embodiments of this application can be referred to the above embodiments, and And the embodiments of the present application have the same effect as the above-mentioned embodiments.
  • FIG10 is a schematic diagram of the hardware structure of a terminal device provided in an embodiment of the present application.
  • the terminal device 900 in the embodiment of the present application may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (Portable Android Devices, PADs), portable multimedia players (Portable Media Players, PMPs), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital televisions (TVs), desktop computers, etc.
  • PDAs personal digital assistants
  • PADs Portable multimedia players
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital televisions (TVs), desktop computers, etc.
  • the terminal device 900 shown in FIG10 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present application.
  • the terminal device 900 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 901, which may perform a variety of appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage device 908 to a random access memory (RAM) 903.
  • a processing device 901 e.g., a central processing unit, a graphics processing unit, etc.
  • RAM random access memory
  • the processing device 901, the ROM 902, and the RAM 903 are connected to each other via a bus 904.
  • An input/output (I/O) interface 905 is also connected to the bus 904.
  • the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) , an output device 907 such as a speaker, a vibrator, etc.; a storage device 908 including a magnetic tape, a hard disk, etc.; and a communication device 909.
  • the communication device 909 may allow the terminal device 900 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 10 shows the terminal device 900 having various means, it is not required to implement or have all the illustrated means. More or fewer means may alternatively be implemented or provided.
  • the process described above with reference to the flowchart may be implemented as a computer software program.
  • embodiments of the present application include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 909, or from storage device 908, or from ROM 902.
  • the processing device 901 When the computer program is executed by the processing device 901, the above-mentioned functions provided by the embodiments of the present application or defined in the action migration method are performed.
  • the terminal provided by the embodiment of the present application and the action migration method provided by the above embodiment belong to the same concept.
  • Technical details that are not described in detail in the embodiment of the present application can be referred to the above embodiment, and the embodiment of the present application has the same features as the above embodiment. Effect.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the action migration method or the action migration method provided by the above embodiments is implemented.
  • the computer-readable storage medium mentioned above in the embodiment of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof.
  • Examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard drives, RAM, ROM, Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory (EPROM) or flash memory (FLASH), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above .
  • the computer-readable storage medium may be any tangible medium containing or storing a program, which may be used by or in combination with an instruction execution system, apparatus or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code contained on a computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wires, optical cables, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
  • the client and server can communicate using any currently known or future developed network protocol, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • HTTP HyperText Transfer Protocol
  • Communications e.g., communications network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any current network for knowledge or future research and development.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet e.g., the Internet
  • end-to-end networks e.g., ad hoc end-to-end networks
  • the computer-readable storage medium may be included in the terminal device, or may exist independently without being installed in the terminal device.
  • the terminal device stores and carries one or more programs.
  • the terminal device executes the one or more programs.
  • the key point connection diagram is used to characterize the driving posture of the first object; according to the key points Connect the map and the first segmentation map to generate a second segmentation map of each preset area that conforms to the driving posture; generate a second segmentation map in the driving posture based on the second segmentation map of multiple preset areas and the first foreground image of the source image.
  • the second foreground image of the object fuse the second foreground image with the first background image of the source image to obtain a motion migration image.
  • Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of this application can be implemented in software or hardware. Among them, the name of a unit does not constitute a limitation on the unit itself.
  • exemplary hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

Provided in the present application are an action migration method and apparatus, and a terminal device and a storage medium. The method comprises: acquiring a key-point connection graph of a first object in a drive image and a first segmentation image of each preset area of a second object in a source image, wherein the key-point connection graph is used for representing a drive posture of the first object; according to the key-point connection graph and the first segmentation image, generating a second segmentation image, which conforms to the drive posture, of each preset area; generating a second foreground image of the second object in the drive posture according to the second segmentation images of a plurality of preset areas and a first foreground image of the source image; and fusing the second foreground image with a first background image of the source image to obtain an action migration image. By means of the technical solution, a realistic action migration image is obtained.

Description

动作迁移方法、装置、终端设备及存储介质Action migration method, device, terminal equipment and storage medium
本申请要求在2022年09月21日提交中国专利局、申请号为202211154081.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202211154081.1, which was submitted to the China Patent Office on September 21, 2022. The entire content of this application is incorporated into this application by reference.
技术领域Technical Field
本申请涉及图像处理技术领域,例如涉及动作迁移方法、装置、终端设备及存储介质。This application relates to the field of image processing technology, for example, to action migration methods, devices, terminal equipment and storage media.
背景技术Background technique
动作迁移是指基于源图像和驱动视频生成新的视频,新的视频中包含源图像中的人物,且该人物做着和驱动视频中人物相同的动作。Action transfer refers to generating a new video based on the source image and the driver video. The new video contains the character in the source image, and the character performs the same actions as the character in the driver video.
相关技术中,通常根据驱动视频对源图像或其编码后的特征图像进行仿射变换(可称为Warp操作),以生成新的视频。In the related art, an affine transformation (which may be called a warp operation) is usually performed on the source image or its encoded feature image according to the driving video to generate a new video.
相关技术中至少存在以下技术问题:There are at least the following technical problems in related technologies:
当源图像中人物和驱动视频中人物的姿态具有剧烈差异时,无法准确进行仿射变换,从而生成视频中人物的真实性差,严重影响了用户的视觉体验。When the postures of the characters in the source image and the characters in the driving video are drastically different, the affine transformation cannot be accurately performed, resulting in poor authenticity of the characters in the generated video, seriously affecting the user's visual experience.
发明内容Summary of the invention
本申请提供动作迁移方法、装置、终端设备及存储介质,能够更好地适应姿态差异剧烈的场景,保证生成视频中人物的真实性,提高用户的视觉体验。This application provides action migration methods, devices, terminal equipment and storage media, which can better adapt to scenes with drastically different postures, ensure the authenticity of the characters in the generated videos, and improve the user's visual experience.
第一方面,本申请提供了一种动作迁移方法,包括:In the first aspect, this application provides an action migration method, including:
获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;所述关键点连接图用于表征所述第一对象的驱动姿态;Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the driving posture of the first object;
根据所述关键点连接图和所述第一分割图,生成所述每个预设区域的符合所述驱动姿态的第二分割图;Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map;
根据多个预设区域的第二分割图和所述源图像的第一前景图,生成所述驱动姿态下所述第二对象的第二前景图;Generate a second foreground image of the second object in the driving posture according to the second segmentation map of the plurality of preset areas and the first foreground image of the source image;
将所述第二前景图与所述源图像的第一背景图进行融合,得到动作迁移图像。The second foreground image is fused with the first background image of the source image to obtain a motion transition image.
第二方面,本申请提供了一种动作迁移装置,包括: In the second aspect, this application provides an action migration device, including:
图像获取模块,设置为获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;所述关键点连接图用于表征所述第一对象的驱动姿态;The image acquisition module is configured to acquire the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the first The driving posture of the object;
第一生成模块,设置为根据所述关键点连接图和所述第一分割图,生成所述每个预设区域的符合所述驱动姿态的第二分割图;A first generation module configured to generate a second segmentation map of each preset area that conforms to the driving posture based on the key point connection map and the first segmentation map;
第二生成模块,设置为根据多个预设区域的第二分割图和所述源图像的第一前景图,生成所述驱动姿态下所述第二对象的第二前景图;A second generation module configured to generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image;
合成模块,设置为将所述第二前景图与所述源图像的第一背景图进行融合,得到动作迁移图像。A synthesis module configured to fuse the second foreground image with the first background image of the source image to obtain an action migration image.
第三方面,本申请提供了一种终端设备,包括:In the third aspect, this application provides a terminal device, including:
一个或多个处理器;one or more processors;
存储器,设置为存储一个或多个程序;memory configured to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述的动作迁移方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the above-mentioned action migration method.
第四方面,本申请提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的动作迁移方法。In a fourth aspect, the present application provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the above-mentioned action migration method is implemented.
附图说明Description of drawings
图1为本申请实施例提供的一种动作迁移方法的流程图;Figure 1 is a flow chart of an action migration method provided by an embodiment of the present application;
图2为本申请实施例提供的另一种动作迁移方法的流程图;Figure 2 is a flow chart of another action migration method provided by an embodiment of the present application;
图3为本申请实施例提供的另一种动作迁移方法的流程图;Figure 3 is a flow chart of another action migration method provided by an embodiment of the present application;
图4为本申请实施例提供的一种局部生成网络的架构示意图;Figure 4 is a schematic architectural diagram of a local generation network provided by an embodiment of the present application;
图5为本申请实施例提供的另一种动作迁移方法的流程图;Figure 5 is a flow chart of another action migration method provided by an embodiment of the present application;
图6为本申请实施例提供的另一种动作迁移方法的流程图;Figure 6 is a flow chart of another action migration method provided by an embodiment of the present application;
图7为本申请实施例提供的一种整体合成网络的架构示意图;Figure 7 is a schematic diagram of the architecture of an overall synthesis network provided by an embodiment of the present application;
图8为本申请实施例提供的另一种动作迁移方法的流程图;Figure 8 is a flow chart of another action migration method provided by an embodiment of the present application;
图9为本申请实施例提供的一种动作迁移装置的结构示意图;Figure 9 is a schematic structural diagram of an action migration device provided by an embodiment of the present application;
图10为本申请实施例提供的一种终端设备的硬件结构示意图。Figure 10 is a schematic diagram of the hardware structure of a terminal device provided by an embodiment of the present application.
具体实施方式 Detailed ways
以下将参照本申请实施例中的附图,通过实施方式描述本申请的技术方案,所描述的实施例是本申请一部分实施例。本申请技术方案中对数据的获取、存储、使用、处理等均符合国家法律法规的相关规定。The following will describe the technical solutions of the present application through implementation modes with reference to the accompanying drawings in the embodiments of the present application. The described embodiments are part of the embodiments of the present application. The acquisition, storage, use and processing of data in the technical solution of this application all comply with the relevant provisions of national laws and regulations.
图1为本申请实施例提供的一种动作迁移方法的流程图,本申请实施例提供的动作迁移方法可适用于对图像和/或视频中对象姿态进行迁移的情况,例如人体动作迁移的情况。该方法可由动作迁移装置来执行,该装置采用软件和/或硬件的方式实现,例如配置于终端设备中,例如配置于计算机设备中。Figure 1 is a flow chart of an action migration method provided by an embodiment of the present application. The action migration method provided by an embodiment of the present application can be applied to the situation of migrating the posture of objects in images and/or videos, such as the situation of human body movement migration. . The method can be executed by an action migration device, which is implemented in software and/or hardware, for example, configured in a terminal device, such as a computer device.
如图1所示,本申请实施例中提供的动作迁移方法,可以包括如下步骤:As shown in Figure 1, the action migration method provided in the embodiment of this application may include the following steps:
S110、获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态。S110. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the driving posture of the first object.
本申请实施例中,动作迁移是指基于源图像和驱动图像生成新图像的过程,新图像中包含源图像中的第二对象,且该第二对象做着和驱动图像中第一对象相同的动作。其中,驱动图像是指存在驱动姿态的图像。驱动图像的数量为多个,并且多个驱动图像中对象的动作或姿态根据预设顺序发生关联变化,例如,驱动图像可以为驱动视频中的视频帧。驱动图像中的第一对象是指人物或其他感兴趣区域,在此不做限定。关键点连接图是指第一对象的多个关键点以预定义的连接方式所连接的姿态连接图,可以用于表征第一对象的驱动姿态,其中,关键点可以是第一对象躯体部位对应的关键点。源图像是指具有第二对象源姿态的图像,源图像中第二对象是指人物或其他感兴趣区域。第一分割图是指第二对象的每个预设区域的分割图,可以包括但不限于躯体部位区域分割图和背景区域分割图,例如,第二对象的躯体部位可以包括但不限于头部、上衣、下装、鞋子和四肢等,在此不做限定。第一分割图的通道数可以根据划分躯体部位的数量进行确定,例如18通道、6通道、5通道等,在此不做限定。In the embodiment of this application, action migration refers to the process of generating a new image based on the source image and the driving image. The new image contains the second object in the source image, and the second object does the same thing as the first object in the driving image. action. Among them, the driving image refers to an image with a driving posture. There are multiple driving images, and the actions or postures of the objects in the multiple driving images change in association according to a preset sequence. For example, the driving images may be video frames in the driving video. The first object in the driving image refers to a person or other area of interest, which is not limited here. The key point connection diagram refers to a posture connection diagram in which multiple key points of the first object are connected in a predefined connection manner. It can be used to represent the driving posture of the first object, where the key points can be corresponding to the body parts of the first object. key points. The source image refers to an image with a source pose of the second object, and the second object in the source image refers to a person or other area of interest. The first segmentation map refers to the segmentation map of each preset area of the second object, which may include but is not limited to the body part area segmentation map and the background area segmentation map. For example, the body part of the second object may include but is not limited to the head. , tops, bottoms, shoes and limbs, etc., are not limited here. The number of channels of the first segmentation map can be determined according to the number of divided body parts, such as 18 channels, 6 channels, 5 channels, etc., which are not limited here.
第一对象与第二对象用于区分源图像和驱动图像中对象的不同,而不必用于描述特定的顺序或先后次序。The first object and the second object are used to distinguish the difference between the objects in the source image and the objects in the driving image, and are not necessarily used to describe a specific order or sequence.
示例性地,第一对象可以为驱动人物,第二对象可以为源人物,预设区域可以为人物的躯体部位或者其他感兴趣区域。为了表示人体的姿态,可以利用人体关键点检测模型OpenPose对驱动图像进行预测,得到驱动人物的二维关键点其中,OpenPose为一种开源的人体二维关键点检测模型;根据预定义的连接方式将驱动人物的二维关键点进行连接,得到驱动人物的关键点连接图其中,关键点连接图可以为红-绿-蓝(Red-Green-Blue,RGB)关键点连接图,H×W表示图像的分辨率。为了表示人体布局,本申请使用人类 解析的自校正(Self Correction for Human Parsing,SCHP)模型来获取源图像的18通道语义分割图;考虑到人体不同部位的纹理特点,将18通道的语义分割图通道合并为6通道,分别是头部,上衣,下装,鞋子,四肢和背景,从而得到源图像中源人物的预设区域的第一分割图 For example, the first object may be a driving character, the second object may be a source character, and the preset area may be a body part of the character or other areas of interest. In order to represent the posture of the human body, the human body key point detection model OpenPose can be used to predict the driving image and obtain the two-dimensional key points of the driving character. Among them, OpenPose is an open source two-dimensional key point detection model of the human body; the two-dimensional key points of the driving character are connected according to the predefined connection method to obtain the key point connection diagram of the driving character. Among them, the key point connection diagram may be a Red-Green-Blue (RGB) key point connection diagram, and H×W represents the resolution of the image. To represent human body layout, this application uses human Analytical self-correction (Self Correction for Human Parsing, SCHP) model is used to obtain the 18-channel semantic segmentation map of the source image; taking into account the texture characteristics of different parts of the human body, the 18-channel semantic segmentation map channels are merged into 6 channels, namely the head head, tops, bottoms, shoes, limbs and background, thereby obtaining the first segmentation map of the preset area of the source character in the source image
S120、根据关键点连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图。S120. Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
本实施例中,第二分割图是指第二对象在驱动姿态下的预设区域解析图。换而言之,第二分割图中包含第二对象的符合驱动姿态的预设区域。In this embodiment, the second segmentation map refers to a preset area analysis map of the second object in the driving posture. In other words, the second segmentation map includes a preset area of the second object that conforms to the driving posture.
可以将键点连接图和第一分割图输入至预先训练完成的第一神经网络模型,得到第二分割图,实现了第二对象对应分割图由源姿态到驱动姿态的变换。其中,第一神经网络模型的网络结构在此不做限定,例如,网络结构可以由至少一个编码器和至少一个解码器组成。The key point connection map and the first segmentation map can be input into a pre-trained first neural network model to obtain a second segmentation map, thereby realizing the transformation of the segmentation map corresponding to the second object from the source pose to the driving pose. The network structure of the first neural network model is not limited here, for example, the network structure can be composed of at least one encoder and at least one decoder.
S130、根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图。S130. Generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
本实施例中,第二前景图是指第二对象在驱动姿态下的预设区域前景图,其中,前景图是指图像中除去背景外的对象区域。换而言之,第二前景图由第二对象的多个符合驱动姿态的预设区域前景图组成。In this embodiment, the second foreground image refers to a preset area foreground image of the second object in the driving posture, where the foreground image refers to the object area in the image excluding the background. In other words, the second foreground image is composed of a plurality of preset area foreground images of the second object that conform to the driving posture.
可以将多个预设区域的第二分割图和源图像的第一前景图输入至预先训练完成的第二神经网络模型,得到第二对象的多个驱动姿态下的预设区域前景图,并对多个驱动姿态下的预设区域前景图进行组合,得到第二前景图,实现了第二对象对应前景图由源姿态到驱动姿态的变换。其中,第二神经网络模型的网络结构在此不做限定,例如,网络结构可以由至少一个编码器和至少一个解码器组成。The second segmentation images of multiple preset areas and the first foreground image of the source image can be input to the pre-trained second neural network model to obtain the foreground images of the preset area under multiple driving postures of the second object, and The preset area foreground images under multiple driving postures are combined to obtain a second foreground image, which realizes the transformation of the foreground image corresponding to the second object from the source posture to the driving posture. The network structure of the second neural network model is not limited here. For example, the network structure may consist of at least one encoder and at least one decoder.
S140、将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。S140. Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.
本实施例中,动作迁移图像是指以源图像背景为背景,且包含驱动姿态下的第二对象的图像,即动作迁移图像是指将第一对象姿态迁移至第二对象的图像。In this embodiment, the motion transfer image refers to an image with the source image background as the background and including the second object in the driving posture. That is, the motion transfer image refers to an image in which the first object's posture is transferred to the second object.
在一些实施例中,可以将第二前景图中的前景嵌入至源图像的第一背景图的对应位置处,实现图像融合。在一些实施例中,还可以将第一前景图和第二前景图进行纹理对齐,将纹理对齐后的第二前景图中的前景嵌入至源图像的第一背景图的对应位置处,实现图像融合,在此不限定图像融合方式。In some embodiments, the foreground in the second foreground image can be embedded into the corresponding position of the first background image of the source image to achieve image fusion. In some embodiments, the first foreground image and the second foreground image can also be texture aligned, and the foreground in the texture-aligned second foreground image can be embedded into the corresponding position of the first background image of the source image to achieve image fusion. The image fusion method is not limited here.
本申请实施例提供的一种动作迁移方法,通过利用驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图,得到了每个 预设区域的符合驱动姿态的第二分割图,实现了第二对象对应分割图由源姿态到驱动姿态的变换;根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图,实现了第二对象对应前景图由源姿态到驱动姿态的变换,并将源图像第二对象的纹理赋给了第二分割图;将第二前景图与源图像的第一背景图进行融合,得到逼真的动作迁移图像,并且本申请抛弃了原有的Warp操作,能够更好地适应姿态差异的情景,保证生成视频中人物的真实性,提高用户的视觉体验。An action migration method provided by an embodiment of the present application uses the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image to obtain each The second segmentation map of the preset area that conforms to the driving posture realizes the transformation of the segmentation map corresponding to the second object from the source posture to the driving posture; according to the second segmentation map of multiple preset areas and the first foreground image of the source image, Generate a second foreground image of the second object in the driving posture, realize the transformation of the foreground image corresponding to the second object from the source posture to the driving posture, and assign the texture of the second object of the source image to the second segmentation map; The foreground image is fused with the first background image of the source image to obtain a realistic motion migration image, and this application abandons the original Warp operation, which can better adapt to scenarios with different postures and ensure the authenticity of the characters in the generated video. Improve the user's visual experience.
参考图2,图2为本申请实施例提供的另一种动作迁移方法的流程图,本实施例的方法与上述实施例中提供的动作迁移方法中多个方案可以结合。本实施例提供的动作迁移方法进行了说明。在生成每个预设区域的符合驱动姿态的第二分割图之后,还包括:根据第一分割图和第二分割图确定对齐参数;在根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图之前,还包括:根据对齐参数对第一前景图进行变换,以使第一前景图与第二分割图对齐。Referring to Figure 2, Figure 2 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. After generating a second segmentation map of each preset area that conforms to the driving posture, it also includes: determining alignment parameters based on the first segmentation map and the second segmentation map; Before generating the second foreground image of the second object in the driving posture, the method further includes: transforming the first foreground image according to the alignment parameter to align the first foreground image with the second segmentation image.
如图2,本实施例的方法可以包括:As shown in FIG. 2 , the method of this embodiment may include:
S210、获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态。S210. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.
S220、根据关键点连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图。S220: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
S230、根据第一分割图和第二分割图确定对齐参数。S230. Determine alignment parameters according to the first segmentation map and the second segmentation map.
S240、根据对齐参数对第一前景图进行变换,以使第一前景图与第二分割图对齐。S240. Transform the first foreground image according to the alignment parameter so that the first foreground image is aligned with the second segmentation image.
S250、根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图。S250. Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
S260、将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。S260: Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.
本实施例中,对齐参数是指用于将第一前景图与第二分割图进行对齐的参数。通过对齐参数使第一前景图与第二分割图对齐,可以避免源人物和驱动人物在大小以及空间位置存在巨大差异的情况。In this embodiment, the alignment parameters refer to parameters used to align the first foreground image and the second segmentation image. Aligning the first foreground image with the second segmentation image through the alignment parameter can avoid the situation where there is a huge difference in size and spatial position between the source character and the driver character.
在一些实施方式中,根据第一分割图和第二分割图确定对齐参数,包括下述至少一项:根据第一分割图和第二分割图中预设区域的尺寸,确定缩放参数;根据第一分割图和第二分割图中预设区域的中心坐标,确定位移参数。 In some embodiments, determining the alignment parameter according to the first segmentation map and the second segmentation map includes at least one of the following: determining the scaling parameter according to the size of the preset area in the first segmentation map and the second segmentation map; The center coordinates of the preset areas in the first segmentation map and the second segmentation map determine the displacement parameters.
缩放参数是指控制图像缩放大小的参数。基于第一分割图确定第一掩膜高度,以及基于第二分割图确定第二掩膜高度;基于第一掩膜高度和第二掩膜高度确定缩放参数。位移参数是表征图像位置偏移的参数。基于第一分割图确定第一掩膜中心坐标,以及基于第二分割图确定第二掩膜中心坐标;基于第一掩膜中心坐标和第二掩膜中心坐标确定位移参数。The scaling parameter refers to the parameter that controls the zoom size of the image. The first mask height is determined based on the first segmentation map, and the second mask height is determined based on the second segmentation map; the scaling parameter is determined based on the first mask height and the second mask height. The displacement parameter is a parameter that characterizes the position offset of the image. The first mask center coordinate is determined based on the first segmentation map, and the second mask center coordinate is determined based on the second segmentation map; the displacement parameter is determined based on the first mask center coordinate and the second mask center coordinate.
示例性地,可以通过如下公式计算得到:
For example, it can be calculated by the following formula:
其中,R表示缩放参数,表示第二分割图中人体掩膜的高度,Hs表示第一分割图中人体掩膜的高度。Among them, R represents the scaling parameter, represents the height of the human body mask in the second segmentation image, and H s represents the height of the human body mask in the first segmentation image.
以及:
c=[cx,cy]T
as well as:
c=[c x , c y ] T
其中,c表示位移参数,cx表示第一掩膜中心坐标和第二掩膜中心坐标水平方向的差值,cy表示第一掩膜中心坐标和第二掩膜中心坐标竖直方向的差值。Among them, c represents the displacement parameter, c x represents the difference in the horizontal direction between the first mask center coordinate and the second mask center coordinate, cy y represents the vertical difference between the first mask center coordinate and the second mask center coordinate value.
通过对齐参数对第一前景图进行变换,使第一前景图与第二分割图对齐,从而为后续基于多个预设区域的第二分割图和第一前景图生成驱动姿态下第二对象的第二前景图奠定了图像质量基础,提高了第二前景图的图像质量。The first foreground image is transformed through the alignment parameter to align the first foreground image with the second segmentation image, thereby generating the second object in the driving posture for the subsequent second segmentation image and first foreground image based on multiple preset areas. The second foreground image lays the foundation for image quality and improves the image quality of the second foreground image.
本申请实施例在上述实施例基础上,增加了“根据第一分割图和第二分割图确定对齐参数;根据对齐参数对第一前景图进行变换,以使第一前景图与第二分割图对齐”。此外,本申请实施例与上述实施例提出的动作迁移方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,且本实施例与上述实施例具有相同的效果。Based on the above-mentioned embodiments, the embodiment of the present application adds "determine the alignment parameter according to the first segmentation map and the second segmentation map; transform the first foreground image according to the alignment parameter, so that the first foreground image and the second segmentation map Align". In addition, the action migration method proposed in the embodiment of the present application and the above-mentioned embodiment belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiment, and this embodiment has the same effect as the above-mentioned embodiment.
参考图3,图3为本申请实施例提供的另一种动作迁移方法的流程图,本实施例的方法与上述实施例中提供的动作迁移方法中多个方案可以结合。本实施例提供的动作迁移方法进行了说明。第二分割图通过第一生成对抗网络生成,且通过第一生成对抗网络生成第二分割图的步骤,包括:通过第一编码器对第一分割图编码,得到第一特征图;通过第二编码器对关键点连接图编码,得到第二特征图;通过第一解码器对第一特征图和第二特征图的融合图进行解码,得到第二分割图。Referring to Figure 3, Figure 3 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. The second segmentation map is generated through the first generative adversarial network, and the step of generating the second segmentation map through the first generative adversarial network includes: encoding the first segmentation map through a first encoder to obtain a first feature map; The encoder encodes the key point connection map to obtain the second feature map; the first decoder decodes the fusion map of the first feature map and the second feature map to obtain the second segmentation map.
如图3,本实施例的方法可以包括:As shown in Figure 3, the method in this embodiment may include:
S310、获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态。 S310. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.
S320、通过第一编码器对第一分割图编码,得到第一特征图。S320. Use the first encoder to encode the first segmentation map to obtain the first feature map.
S330、通过第二编码器对关键点连接图编码,得到第二特征图。S330: Encode the key point connection map through the second encoder to obtain the second feature map.
S340、通过第一解码器对第一特征图和第二特征图的融合图进行解码,得到第二分割图。S340: Use the first decoder to decode the fusion map of the first feature map and the second feature map to obtain a second segmentation map.
S350、根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图。S350. Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
S360、将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。S360: Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.
本实施例中,第二分割图可以通过第一生成对抗网络生成,第一生成对抗网络可以包括第一编码器、第二编码器和第一解码器。其中,第一编码器用于对第一分割图编码,第二编码器用于对关键点连接图编码,第一解码器用于对第一编码器、第二编码器的编码结果的融合图进行解码,从而得到更清晰和锐利的第二分割图。第一编码器、第二编码器和第一解码器用于区分编码器或者解码器的作用的不同,而不必用于描述特定的顺序或先后次序。In this embodiment, the second segmentation map may be generated by a first generative adversarial network, and the first generative adversarial network may include a first encoder, a second encoder, and a first decoder. Wherein, the first encoder is used to encode the first segmentation graph, the second encoder is used to encode the key point connection graph, and the first decoder is used to decode the fusion graph of the encoding results of the first encoder and the second encoder, This results in a clearer and sharper second segmentation image. The first encoder, the second encoder and the first decoder are used to distinguish the different functions of the encoder or decoder, and are not necessarily used to describe a specific order or sequence.
在一些实施方式中,若驱动图像为视频帧,则方法还可以包括:获取与当前视频帧的前预设数量个视频帧对应的历史第二分割图;通过第三编码器对至少一个历史第二分割图进行编码,得到第三特征图;相应的,通过第一解码器对第一特征图和第二特征图的融合图进行解码,得到第二分割图,包括:通过第一解码器对第一特征图、第二特征图和第三特征图的融合图进行解码,得到第二分割图。In some embodiments, if the driving image is a video frame, the method may further include: obtaining a historical second segmentation map corresponding to the first preset number of video frames of the current video frame; and using a third encoder to encode at least one historical second segmentation map. The second segmentation map is encoded to obtain the third feature map; correspondingly, the fusion map of the first feature map and the second feature map is decoded through the first decoder to obtain the second segmentation map, including: The fusion map of the first feature map, the second feature map and the third feature map is decoded to obtain a second segmentation map.
历史第二分割图是指当前视频帧之前的一个或多个第二分割图。将历史第二分割图输入至第一生成对抗网络,以使第一生成对抗网络有效提取到不同视频帧之间的关系,从而提高视频的时序一致性。The historical second segmentation map refers to one or more second segmentation images before the current video frame. The historical second segmentation map is input to the first generative adversarial network, so that the first generative adversarial network can effectively extract the relationship between different video frames, thereby improving the temporal consistency of the video.
在上述实施例的基础上,在得到第三特征图之后,还包括:通过第二解码器对第二特征图和第三特征图的融合图进行解码,得到光流参数和权重参数;在得到第二分割图之后,还包括:根据与当前视频帧的前一视频帧对应的历史第二分割图、光流参数和权重参数,对第二分割图进行调整。Based on the above embodiment, after obtaining the third feature map, the method further includes: decoding the fusion map of the second feature map and the third feature map through a second decoder to obtain the optical flow parameters and weight parameters; after obtaining After the second segmentation map, the method further includes: adjusting the second segmentation map according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameters and the weight parameters.
在本实施中,第二解码器用于对第二特征图和第三特征图的融合图进行解码,从而得到光流参数和权重参数,其中,光流参数是指运动物体在观察成像平面上的像素运动的瞬时速度。In this implementation, the second decoder is used to decode the fusion map of the second feature map and the third feature map to obtain optical flow parameters and weight parameters, where the optical flow parameters refer to the movement of the moving object on the observation imaging plane. The instantaneous speed of pixel movement.
示例性地,图4为本申请实施例提供的一种局部生成网络的架构示意图,局部生成网络用于生成驱动姿态下的源人物的每个区域的图像;局部生成网络包括第一生成对抗网络,第一生成对抗网络可以是具有vid2vid框架的Layout生成对抗网络(Generative Adversarial Networks,GAN)。第一生成对抗网络包括 三个编码器和两个解码器。其中,第一编码器可以为编码器其中l为Layout GAN的标识,用于表征第一编码器归属于Layout GAN网络,用于编码第一分割图Ls得到第一特征图第二编码器可以为编码器用于编码沿通道拼接的多个关键点连接图得到第二特征图其中,t表示当前时刻,t-1,t-2表示当前时刻之前的两个历史连续时刻,第三编码器可以为编码器用于编码之前时刻生成的两个历史第二分割图得到第三特征图通过第一解码器解码相加的特征得到原始结果相似的,通过第二解码器解码相加特征得到光流参数O及其权重参数w。从而得到最终结果第二分割图Layout GAN可以公式化表示为:


Exemplarily, Figure 4 is an architectural schematic diagram of a local generation network provided by an embodiment of the present application. The local generation network is used to generate images of each area of the source character in the driving posture; the local generation network includes a first generative adversarial network. , the first generative adversarial network can be a Layout generative adversarial network (GAN) with a vid2vid framework. The first generative adversarial network includes Three encoders and two decoders. Wherein, the first encoder may be an encoder Among them, l is the identifier of Layout GAN, which is used to indicate that the first encoder belongs to the Layout GAN network. Used to encode the first segmentation map L s to obtain the first feature map The second encoder may be an encoder Multiple keypoint connection graphs for encoding stitching along channels Get the second feature map Among them, t represents the current moment, t-1 and t-2 represent the two historical consecutive moments before the current moment, and the third encoder can be an encoder Used to encode the two historical second segmentation maps generated at previous moments. Get the third feature map through first decoder Decode the added features get original result Similarly, through the second decoder Decoding additive features Obtain the optical flow parameter O and its weight parameter w. To obtain the final result of the second segmentation map Layout GAN can be formulated as:


其中,+表示点对点相加,*表示点对点相乘,{,}表示其中的输入沿通道维度进行拼接。Warp(I,O)表示根据光流参数O对图像I进行仿射变换。本实施使用到的光流和Warp操作是基于相邻帧,其目的是为了提高生成视频的时序一致性,而不是基于源图像和驱动图像之间的变换。Among them, + indicates point-to-point addition, * indicates point-to-point multiplication, and {,} indicates that the inputs are spliced along the channel dimension. Warp(I,O) represents the affine transformation of image I based on the optical flow parameter O. The optical flow and warp operations used in this implementation are based on adjacent frames, and their purpose is to improve the temporal consistency of the generated video, rather than based on the transformation between the source image and the driving image.
在一些实施方式中,第一生成对抗网络的训练步骤,可以包括:获取样本驱动图像中第一对象的每个预设区域的第三分割图;确定与样本源图像对应的第二分割图,和与样本驱动图像对应的第三分割图的第一损失;根据第一损失,对第一生成对抗网络进行训练。In some embodiments, the training step of the first generative adversarial network may include: obtaining a third segmentation map of each preset area of the first object in the sample driving image; determining a first loss of the second segmentation map corresponding to the sample source image and the third segmentation map corresponding to the sample driving image; and training the first generative adversarial network according to the first loss.
第三分割图是指驱动图像中第一对象的预设区域的分割图。The third segmentation map refers to a segmentation map driving a preset area of the first object in the image.
第一生成对抗网络可以预先通过多个样本驱动图像和样本源图像进行训练得到,其中,样本驱动图像和样本源图像可以为成对训练数据,即样本驱动图像和样本源图像中的人物为同一人物,例如,选择一段视频中的正向人体视频帧作为样本源图像,该视频作为样本驱动视频。其中,选择正向人体视频帧作为源图像的原因是它包含了更多源人物的外观细节。在所训练的第一生成对抗网络中,第一损失可为交叉熵损失,通过计算样本源图像对应的第二分割图和样本驱动图像对应的第三分割图的交叉熵损失,并基于交叉熵损失调整第一生成对抗网络的网络参数,使交叉损失逐渐减小并趋于稳定,直到网络训练完成,得到第一生成对抗网络。The first generative adversarial network can be trained in advance through multiple sample-driven images and sample source images, where the sample-driven images and sample source images can be paired training data, that is, the characters in the sample-driven images and the sample source images are the same For example, a forward human video frame in a video is selected as a sample source image, and the video is used as a sample-driven video. Among them, the reason why the forward human body video frame is selected as the source image is that it contains more appearance details of the source character. In the trained first generative adversarial network, the first loss may be a cross-entropy loss, by calculating the cross-entropy loss of the second segmentation map corresponding to the sample source image and the third segmentation map corresponding to the sample-driven image, and based on the cross-entropy The loss adjusts the network parameters of the first generative adversarial network so that the cross loss gradually decreases and becomes stable until the network training is completed and the first generative adversarial network is obtained.
本申请实施例在上述实施例基础上,对生成第二分割图的技术特征进行了细化。此外,本申请实施例与上述实施例提出的动作迁移方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,且本实施例与上述实 施例具有相同的效果。Based on the above embodiments, the embodiments of the present application refine the technical features of generating the second segmentation map. In addition, the embodiments of the present application and the action migration method proposed in the above embodiments belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment is different from the above embodiments. Example has the same effect.
参考图5,图5为本申请实施例提供的另一种动作迁移方法的流程图,本实施例的方法与上述实施例中提供的动作迁移方法中多个方案可以结合。本实施例提供的动作迁移方法进行了说明。第二前景图通过第二生成对抗网络生成,且通过第二生成对抗网络生成第二前景图的步骤,可以包括:通过第四编码器对多个预设区域的第二分割图进行编码,得到第四特征图;通过第五编码器对第一前景图进行编码,得到第五特征图;通过第三解码器对第四特征图和第五特征图的融合图进行解码,得到第二前景图。Referring to Figure 5, Figure 5 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. The second foreground image is generated through the second generative adversarial network, and the step of generating the second foreground image through the second generative adversarial network may include: encoding the second segmentation images of the plurality of preset areas by using a fourth encoder to obtain The fourth feature map; the first foreground image is encoded through the fifth encoder to obtain the fifth feature map; the fusion map of the fourth feature map and the fifth feature map is decoded through the third decoder to obtain the second foreground image .
如图5,本实施例的方法可以包括:As shown in Figure 5, the method in this embodiment may include:
S410、获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态。S410. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.
S420、根据关键点连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图。S420: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
S430、通过第四编码器对多个预设区域的第二分割图进行编码,得到第四特征图。S430: Use the fourth encoder to encode the second segmentation maps of the plurality of preset areas to obtain a fourth feature map.
S440、通过第五编码器对第一前景图进行编码,得到第五特征图。S440: Encode the first foreground image through the fifth encoder to obtain the fifth feature map.
S450、通过第三解码器对第四特征图和第五特征图的融合图进行解码,得到第二前景图。S450: Decode the fusion image of the fourth feature map and the fifth feature map through a third decoder to obtain a second foreground image.
S460、将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。S460: Fusing the second foreground image with the first background image of the source image to obtain an action migration image.
在本实施例中,第二前景图通过第二生成对抗网络生成,第二生成对抗网络可以包括第四编码器、第五编码器和第三解码器,其中,第四编码器用于对多个预设区域的第二分割图进行编码,第五编码器用于对第一前景图进行编码,第三解码器用于对第四特征图和第五特征图的融合图进行解码,从而得到第二前景图。第四编码器、第五编码器和第三解码器用于区分编码器或者解码器的作用不同,而不必用于描述特定的顺序或先后次序。In this embodiment, the second foreground image is generated by a second generative adversarial network. The second generative adversarial network may include a fourth encoder, a fifth encoder and a third decoder, where the fourth encoder is used to encode multiple The second segmentation image of the preset area is encoded, the fifth encoder is used to encode the first foreground image, and the third decoder is used to decode the fusion image of the fourth feature map and the fifth feature map to obtain the second foreground image. picture. The fourth encoder, the fifth encoder and the third decoder are used to distinguish between encoders or decoders with different functions, and are not necessarily used to describe a specific order or sequence.
在上述实施例的基础上,若驱动图像为视频帧,则通过第五编码器对第一前景图进行编码,得到第五特征图,可以包括:获取与当前视频帧的前预设数量个视频帧对应的历史第二前景图;通过第五编码器对第一前景图和至少一个历史第二前景图的融合图进行编码,得到第五特征图。Based on the above embodiment, if the driving image is a video frame, the first foreground image is encoded by the fifth encoder to obtain the fifth feature map, which may include: obtaining a preset number of videos related to the current video frame The historical second foreground image corresponding to the frame; the fifth encoder encodes the fusion image of the first foreground image and at least one historical second foreground image to obtain a fifth feature map.
历史第二前景图是指当前视频帧以前的一个或多个第二前景图。将历史第二前景图输入至第二生成对抗网络,以使第二生成对抗网络有效提取到不同视 频帧之间的关系,从而提高视频的时序一致性。The historical second foreground image refers to one or more second foreground images before the current video frame. The historical second foreground image is input into the second generative adversarial network so that the second generative adversarial network can effectively extract different visual images. The relationship between video frames is improved to improve the timing consistency of the video.
示例性地,图4为本申请实施例提供的一种局部生成网络的架构示意图;局部生成网络还包括第二生成对抗网络,本实施例中的第二生成对抗网络可以是具有vid2vid框架的Region GAN网络,Region GAN仅用于生成初始的区域图像,因此本实施例使用一个生成器来生成人体的5个区域(不生成背景区域)。这样做不仅可以节省计算资源,而且可以防止模型过拟合。第二生成对抗网络可以包括第四编码器第五编码器和第三解码器其中,r为Region GAN的标识,用于表征其归属于Region GAN网络。通过第四编码器对当前时刻t,第i个预设区域的掩码进行编码,得到第四特征图通过第五编码器编码沿通道维度拼接的得到第五特征图其中,表示当前时刻t之前的历史第二前景图,Is,i表示第i个预设区域的第一前景图,因此原始的Region GAN可以表述为:
Exemplarily, Figure 4 is an architectural schematic diagram of a local generation network provided by an embodiment of the present application; the local generation network also includes a second generation adversarial network. The second generation adversarial network in this embodiment can be a Region with a vid2vid framework. GAN network, Region GAN is only used to generate the initial region image, so this embodiment uses a generator to generate 5 regions of the human body (no background region is generated). Doing so not only saves computing resources, but also prevents the model from overfitting. The second generative adversarial network may include a fourth encoder fifth encoder and third decoder Among them, r is the identifier of Region GAN, which is used to indicate that it belongs to the Region GAN network. via fourth encoder For the current time t, the mask of the i-th preset area Encode to obtain the fourth feature map via fifth encoder Codes spliced along the channel dimension Get the fifth feature map in, Represents the historical second foreground image before the current time t, I s,i represents the first foreground image of the i-th preset region, so the original Region GAN can be expressed as:
其中,表示第三解码器,由于源人物和驱动人物在大小以及空间位置上可能差异巨大,因此编码得到的特征图存在不对齐的情况。为此,本申请提出使用全局对齐模块(GAM)对第一前景图像FGs进行仿射变换以匹配对应的第二分割图。首先通过Ls计算人体掩膜Ms进而可以得到源图像Is的第一前景图像FGs。全局对齐模块整个流程用以下公式表示:
in, Represents the third decoder. Since the source character and the driver character may differ greatly in size and spatial position, the encoded feature map and There is a misalignment. To this end, this application proposes to use a global alignment module (GAM) to perform affine transformation on the first foreground image FG s to match the corresponding second segmentation map. First through L s and Calculate the human mask M s and Then the first foreground image FG s of the source image I s can be obtained. The entire process of the global alignment module is expressed by the following formula:
其中,表示经全局对齐模块对齐后的第一前景图像。R表示缩放参数。对分割可以得到不同预设区域的Is,i。c=[cx,cy]T表示位移参数。因此最终的Region GAN可以表述为:
in, Represents the first foreground image aligned by the global alignment module. R represents the scaling parameter. right Segmentation can obtain I s,i of different preset areas. c=[c x , cy ] T represents the displacement parameter. Therefore, the final Region GAN can be expressed as:
在上述实施例的基础上,第二生成对抗网络的训练步骤,可以包括:获取样本驱动图像中第一对象的每个预设区域的第三分割图;确定与样本源图像对应的第二前景图,和与样本源图像对应的前景真值图之间的第二损失;确定与样本源图像对应的第二前景图,和与样本驱动图像对应的第三分割图的第三损失;根据第二损失和第三损失,对第二生成对抗网络进行训练。Based on the above embodiment, the training step of the second generative adversarial network may include: obtaining a third segmentation map of each preset area of the first object in the sample-driven image; determining the second foreground corresponding to the sample source image The second loss between the map, and the foreground ground truth map corresponding to the sample source image; determining the second foreground map corresponding to the sample source image, and the third loss of the third segmentation map corresponding to the sample driving image; according to the second loss Second loss and third loss, the second generative adversarial network is trained.
在本实施例中,第二损失可包括重建损失、感知损失;第三损失可为对抗损失,即图像分布损失。In this embodiment, the second loss may include reconstruction loss and perceptual loss; the third loss may be adversarial loss, that is, image distribution loss.
第二生成对抗网络可以预先通过多个样本驱动图像和样本源图像进行训练得到,其中,样本驱动图像和样本源图像可以为成对训练数据。在所训练的第 二生成对抗网络中,第二损失可包括重建损失、感知损失,通过计算样本源图像对应的第二前景图,和样本源图像对应的前景真值图之间的重建损失、感知损失,并计算样本源图像对应的第二前景图,和样本驱动图像对应的第三分割图的对抗损失,基于重建损失、感知损失和对抗损失调整第二生成对抗网络的网络参数,使重建损失、感知损失和对抗损失逐渐减小并趋于稳定,直到网络训练完成,得到第二生成对抗网络。The second generative adversarial network can be pre-trained using a plurality of sample driving images and sample source images, wherein the sample driving images and sample source images can be paired training data. In the second generative adversarial network, the second loss may include reconstruction loss and perceptual loss. The reconstruction loss and perceptual loss are calculated between the second foreground image corresponding to the sample source image and the foreground truth image corresponding to the sample source image, and the adversarial loss between the second foreground image corresponding to the sample source image and the third segmentation image corresponding to the sample driving image is calculated. Based on the reconstruction loss, perceptual loss and adversarial loss, the network parameters of the second generative adversarial network are adjusted so that the reconstruction loss, perceptual loss and adversarial loss gradually decrease and tend to be stable until the network training is completed, thereby obtaining the second generative adversarial network.
示例性地,本实施例采用L1重建损失,与L2重建损失相比,L1重建损失更加关注于生成图像和真实图像的细微差异。其计算公式为:
Illustratively, this embodiment uses L1 reconstruction loss. Compared with L2 reconstruction loss, L1 reconstruction loss pays more attention to the subtle differences between the generated image and the real image. The calculation formula is:
其中,表示t时刻,第i区域的驱动姿态下的第二前景图的预测值,表示t时刻,第i区域的驱动姿态下的驱动图像分割图的实际值。in, Represents the predicted value of the second foreground image under the driving posture of the i-th area at time t, Indicates the actual value of the driving image segmentation map under the driving posture of the i-th area at time t.
感知损失用于约束生成图像和真实图像在多维特征空间靠近。感知损失包括特征内容损失和特征风格损失,可表示为:
The perceptual loss is used to constrain the generated image and the real image to be close in the multi-dimensional feature space. Perceptual loss includes feature content loss and feature style loss, which can be expressed as:
其中,j表示预训练的视觉几何图形组(Visual Geometry Group,VGG)-19模型的第j层,G表示计算特征图的Gram矩阵。Among them, j represents the jth layer of the pre-trained Visual Geometry Group (VGG)-19 model, and G represents the Gram matrix for calculating the feature map.
对抗损失的目的是使合成的图像与真实图像具有相似分布。为了使网络关注多尺度的图像细节,本申请使用了pix2pixHD中提出的多尺度条件鉴别器。它以合成图像和相应的区域掩码作为输入。它的表达式是:
The purpose of the adversarial loss is to make the synthesized image have a similar distribution to the real image. In order to make the network pay attention to multi-scale image details, this application uses the multi-scale conditional discriminator proposed in pix2pixHD. it combines images and the corresponding area mask as input. Its expression is:
因此,生成器的损失函数如下所示:
Therefore, the generator’s loss function looks like this:
其中,λrcc和λper分别是重建损失和感知损失的权重。Among them, λ rcc and λ per are the weights of reconstruction loss and perceptual loss respectively.
此外,第一生成对抗网络和第二生成对抗网络训练过程包括判别器的训练,对于判别器,其损失函数为:
In addition, the training process of the first generative adversarial network and the second generative adversarial network includes the training of the discriminator. For the discriminator, the loss function is:
本申请实施例在上述实施例基础上,增加了确定第二前景图的细节特征。此外,本申请实施例与上述实施例提出的动作迁移方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,且本实施例与上述实施例具 有相同的效果。Based on the above embodiment, the embodiment of the present application adds detailed features for determining the second foreground image. In addition, the action migration method proposed in the embodiments of the present application and the above-mentioned embodiments belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiments, and this embodiment is different from the above-mentioned embodiments. Have the same effect.
参考图6,图6为本申请实施例提供的另一种动作迁移方法的流程图,本实施例的方法与上述实施例中提供的动作迁移方法中多个方案可以结合。本实施例提供的动作迁移方法进行了说明。在生成驱动姿态下第二对象的第二前景图之后,还包括:根据第一前景图与第二前景图,确定纹理增强参数;根据纹理增强参数和第一前景图,对第二前景图进行纹理增强。Referring to Figure 6, Figure 6 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. After generating the second foreground image of the second object in the driving posture, the method further includes: determining the texture enhancement parameters according to the first foreground image and the second foreground image; and performing the processing on the second foreground image according to the texture enhancement parameters and the first foreground image. Texture enhancement.
如图6,本实施例的方法可以包括:As shown in Figure 6, the method in this embodiment may include:
S510、获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态。S510. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.
S520、根据关键点连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图。S520: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
S530、根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图。S530: Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
S540、根据第一前景图与第二前景图,确定纹理增强参数。S540. Determine texture enhancement parameters based on the first foreground image and the second foreground image.
S550、根据纹理增强参数和第一前景图,对第二前景图进行纹理增强。S550: Perform texture enhancement on the second foreground image according to the texture enhancement parameters and the first foreground image.
S560、将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。S560: Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.
本实施例中,纹理增强参数是指用于增强图像纹理的调节参数,通过纹理增强参数将第一前景图的特征与第二前景图的特征进行对齐,以保留更多的细节,如衣服的纹理和身体的边缘等。In this embodiment, the texture enhancement parameters refer to adjustment parameters used to enhance the image texture. The texture enhancement parameters are used to align the features of the first foreground image with the features of the second foreground image to retain more details, such as clothing. Texture and body edges etc.
在一些实施方式中,根据第一前景图与第二前景图,确定纹理增强参数,可以包括:通过第六编码器对第一前景图进行编码,得到第六特征图;通过第七编码器对第二前景图进行编码,得到第七特征图;将第六特征图和第七特征图按通道展开,分别得到第八特征图和第九特征图;将第八特征图和第九特征图的相关性矩阵,作为纹理增强参数。In some embodiments, determining the texture enhancement parameters according to the first foreground image and the second foreground image may include: encoding the first foreground image through a sixth encoder to obtain a sixth feature map; and encoding the first foreground image through a seventh encoder. The second foreground image is encoded to obtain the seventh feature map; the sixth feature map and the seventh feature map are expanded by channels to obtain the eighth feature map and the ninth feature map respectively; the eighth feature map and the ninth feature map are Correlation matrix as a texture enhancement parameter.
示例性地,图7为本申请实施例提供的一种整体合成网络的架构示意图,整体合成网络用于是将局部生成网络生成的不同区域图像进行整合生成最终的动作迁移图像,与此同时为生成的动作迁移图像添加合适的背景图像,整体合成网络的损失为生成动作迁移图像和驱动图像之间的损失。通过第六编码器编码对第一前景图进行编码,得到第六特征图通过第七编码器对第二前景图进行编码,得到第七特征图其中表示当前时刻的第二前景图,表示当前时刻之前的历史第二前景图,由于和第一前景图具有相同的纹理,不同的姿 态,因此本实施例提出纹理对齐模块(TAM)以更好地融合特征图如图7所示,首先将特征图分别展开为c表示通道数。然后使用余弦距离计算这两个特征图间的相关性矩阵(即纹理增强参数)计算公式如下:
Exemplarily, Figure 7 is a schematic architectural diagram of an overall synthesis network provided by an embodiment of the present application. The overall synthesis network is used to integrate the images of different regions generated by the local generation network to generate the final action migration image. At the same time, it generates An appropriate background image is added to the action transfer image, and the loss of the overall synthesis network is the loss between the generated action transfer image and the driving image. via sixth encoder Encoding encodes the first foreground image to obtain the sixth feature map by seventh encoder to the second foreground image Encode to obtain the seventh feature map in Represents the second foreground image at the current moment, Represents the second historical foreground picture before the current moment, because Same texture as the first foreground image, different pose state, so this embodiment proposes a texture alignment module (TAM) to better integrate feature maps As shown in Figure 7, first the feature map respectively expanded to and c represents the number of channels. Then use the cosine distance to calculate the correlation matrix between the two feature maps (i.e. texture enhancement parameters) Calculated as follows:
其中,H1,i表示H1在i处的特征,同样地,H2,j表示H2在j处的特征。·表示矩阵相乘。Among them, H 1,i represents the characteristics of H 1 at i, and similarly, H 2,j represents the characteristics of H 2 at j. ·Represents matrix multiplication.
在一些实施方式中,根据纹理增强参数和第一前景图,对第二前景图进行纹理增强,可以包括:根据第八特征图和纹理增强参数,确定纹理增强图;将纹理增加图和第九特征图的融合图按通道整合,得到第十特征图;通过第四解码器对第十特征图进行解码,得到纹理增强后的第二前景图。In some embodiments, performing texture enhancement on the second foreground image according to the texture enhancement parameter and the first foreground image may include: determining the texture enhancement map according to the eighth feature map and the texture enhancement parameter; adding the texture enhancement map and the ninth The fusion map of the feature map is integrated by channel to obtain the tenth feature map; the tenth feature map is decoded through the fourth decoder to obtain the texture-enhanced second foreground map.
示例性地,通过以下公式得到第十特征图
Illustratively, the tenth feature map is obtained by the following formula
最后,通过第四解码器得到纹理增强后的第二前景图。因此整个纹理增强后的第二前景图生成可以公式为:
Finally, through the fourth decoder Obtain the second foreground image after texture enhancement. Therefore, the second foreground image generation after the entire texture enhancement can be formulated as:
本申请实施例在上述实施例基础上,增加了纹理增强的技术特征,以实现更好地融合特征图。此外,本申请实施例与上述实施例提出的动作迁移方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,且本实施例与上述实施例具有相同的效果。Based on the above embodiments, the embodiments of the present application add the technical feature of texture enhancement to achieve better fusion of feature maps. In addition, the action migration method proposed in the embodiment of the present application and the above-mentioned embodiment belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiment, and this embodiment has the same effect as the above-mentioned embodiment.
参考图8,图8为本申请实施例提供的另一种动作迁移方法的流程图,本实施例的方法与上述实施例中提供的动作迁移方法中多个方案可以结合。本实施例提供的动作迁移方法进行了说明。将第二前景图与源图像的第一背景图进行融合,可以包括:根据第二分割图和关键点连接图,确实姿态掩膜图;根据姿态掩膜图和第一背景图,确定第二背景图;将第二前景图与第二背景图进行融合。Referring to Figure 8, Figure 8 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. Fusion of the second foreground image with the first background image of the source image may include: determining the pose mask image based on the second segmentation map and the key point connection map; determining the second pose mask image based on the pose mask image and the first background image. Background image; fuse the second foreground image with the second background image.
如图8,本实施例的方法可以包括:As shown in Figure 8, the method in this embodiment may include:
S610、获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态。 S610. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.
S620、根据关键点连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图。S620: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.
S630、根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图。S630. Generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image.
S640、根据第二分割图和关键点连接图,确实姿态掩膜图。S640. Confirm the attitude mask map based on the second segmentation map and the key point connection map.
S650、根据姿态掩膜图和第一背景图,确定第二背景图。S650. Determine the second background image based on the attitude mask image and the first background image.
S660、将第二前景图与第二背景图进行融合,得到动作迁移图像。S660: Fusion of the second foreground image and the second background image to obtain a motion migration image.
本实施例中,在得到驱动姿态下第二对象的第二前景图之后,需要为其添加一个合理的背景图,以得到更加真实的动作迁移图像。In this embodiment, after obtaining the second foreground image of the second object in the driving posture, it is necessary to add a reasonable background image to obtain a more realistic motion migration image.
将第二分割图和关键点连接图输入至第八编码器,得到第十一特征图,通过第五解码器对第十一特征图进行解码,得到姿态掩膜图,其中,姿态掩膜图是指包含第一对象姿态的软掩码图;通过姿态掩膜图对第一背景图对应位置进行遮蔽,以得到遮蔽了第一对象的第二背景图。Input the second segmentation map and the key point connection map to the eighth encoder to obtain the eleventh feature map, and decode the eleventh feature map through the fifth decoder to obtain the pose mask map, where, the pose mask map It refers to a soft mask image containing the posture of the first object; the corresponding position of the first background image is masked through the attitude mask image to obtain a second background image that covers the first object.
在一些实施方式中,第五解码器对第七特征图和第十一特征图的融合图进行解码,得到姿态掩膜图。采用融合图的好处在于可以优化轮廓和姿态,得到更加准确的轮廓图。In some implementations, the fifth decoder decodes the fusion map of the seventh feature map and the eleventh feature map to obtain the pose mask map. The advantage of using fusion maps is that the outline and posture can be optimized to obtain a more accurate outline map.
示例性地,使用第八编码器编码得到特征图其中,表示当前时刻t的第二分割图和当前时刻t之前的两个历史第二分割图;表示当前时刻t的关键点连接图和当前时刻t之前的两个历史关键点连接图,将相加的特征送入到第五解码器得到姿态掩膜图m,最后使用下面的公式得到最终的动作迁移图像:
For example, using the eighth encoder coding Get feature map in, Represents the second segmentation graph at the current time t and the two historical second segmentation graphs before the current time t; Representing the key point connection graph at the current time t and the two historical key point connection graphs before the current time t, the added features and into the fifth decoder Get the pose mask image m, and finally use the following formula to get the final action transfer image:
在上述实施例的基础上,第二对象包括虚拟对象。其中,虚拟对象可以为数字人、虚拟客服、虚拟主播。第二对象也可以为真实人物对象。On the basis of the above embodiment, the second object includes a virtual object. Among them, the virtual objects can be digital people, virtual customer service, and virtual anchors. The second object may also be a real person object.
示例性地,本实施例的动作迁移方法可应用于数字人、虚拟客服、虚拟主播的动作驱动和生成,从而提高虚拟人物动作的逼真度和丰富度,提高用户的使用体验。Illustratively, the action migration method of this embodiment can be applied to the action driving and generation of digital people, virtual customer service, and virtual anchors, thereby improving the fidelity and richness of virtual character actions and improving the user experience.
此外本申请还设计了一个专门针对人脸区域的鉴别器,以生成真实的人脸。训练过程中,将源图像和动作迁移图像的头像输入判别器,训练模型,促使生成器生成的前景人脸更真实。In addition, this application also designs a discriminator specifically for the face area to generate real faces. During the training process, the avatars of the source image and the action transfer image are input into the discriminator to train the model, making the foreground faces generated by the generator more realistic.
示例性地,给定源图像Is和驱动视频其中,表示驱动视频在t时刻的视频帧。本实施例的目标是生成一个新的视频,其中,源图 像中的人物在做驱动视频中人物的动作。整个方案可以公式化为:
Illustratively, given a source image I s and a driving video in, Represents the video frame driving the video at time t. The goal of this embodiment is to generate a new video in which the source image The person in the image is doing the actions that drive the person in the video. The entire scheme can be formulated as:
其中,F(·,·)表示本实施例中的生成模型,N表示驱动视频帧的个数,表示生成的目标视频帧,其中源人物的姿态与驱动人物保持一致。Wherein, F(·,·) represents the generation model in this embodiment, N represents the number of driving video frames, represents the generated target video frame, where the pose of the source character is consistent with the driving character.
在训练阶段,为了得到成对的训练数据作为监督信息,本实施例选择一段视频中的正向人体视频帧作为源图像,该视频作为驱动视频。选择正向人体视频帧作为源图像的原因是它包含了更多源人物的外观细节。本实施例采取分步骤训练的策略。首先,分别对Layout GAN和Region GAN进行10轮的训练。然后利用Region GAN的输出对整体合成网络进行10轮的训练。In the training phase, in order to obtain paired training data as supervision information, this embodiment selects a forward human video frame in a video as the source image, and this video is used as the driving video. The reason for choosing the forward human video frame as the source image is that it contains more appearance details of the source character. This embodiment adopts a step-by-step training strategy. First, Layout GAN and Region GAN were trained for 10 rounds respectively. Then the output of Region GAN is used to train the overall synthesis network for 10 rounds.
在推断阶段,驱动视频的选择不受限制,只要是任何单人清晰的运动视频即可。并且本实施例可以执行端到端推断。In the inference stage, the selection of the driving video is not restricted, as long as it is any clear motion video of a single person. And this embodiment can perform end-to-end inference.
本实施例提出的生成框架在两个公开数据集上(iPER和SoloDance数据集)都取得了最优或是相当的结果。两个数据集上的实验结果分别如表1和表2所示,其中结构相似性(Structural Similarity,SSIM)和峰值信噪比(Peak Signal to Noise Ratio,PSNR)是基于相似性的评价指标,其值越大表示生成图像质量越好。学习感知图像块相似性(Learned Perceptual Image Patch Similarity,LPIPS)和Fréchet起始距离(Fréchet Inception Distance,FID)是基于特征距离的评价指标,其值越小表示生成图像质量越好。时间一致模式(Temporally Consistent Mode,TCM)用于评价生成视频的时序一致性,越大越好。The generation framework proposed in this embodiment has achieved the best or equivalent results on two public datasets (iPER and SoloDance datasets). The experimental results on the two datasets are shown in Tables 1 and 2, respectively, where Structural Similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) are evaluation indicators based on similarity, and the larger the value, the better the quality of the generated image. Learned Perceptual Image Patch Similarity (LPIPS) and Fréchet Inception Distance (FID) are evaluation indicators based on feature distance, and the smaller the value, the better the quality of the generated image. Temporally Consistent Mode (TCM) is used to evaluate the temporal consistency of the generated video, and the larger the value, the better.
表1数据集iPER上的实验结果
Table 1 Experimental results on the data set iPER
表2数据集SoloDance上的实验结果

Table 2 Experimental results on the data set SoloDance

从表1可以看出,本申请实施例的实验效果在iPER数据集上所有评价指标达到了最优。从表2可以看出,虽然在SSIM和PSNR指标上没有取得最优,但是在对应的Mask-SSIM和Mask-PSNR上取得最优值(括号里的值,通过人体掩膜将图像的背景区域置为0,然后计算SSIM和PSNR指标)。这表明我们的方法生成的人体图像质量与C2F比更好。As can be seen from Table 1, the experimental results of the embodiment of the present application have achieved the best results in all evaluation indicators on the iPER dataset. As can be seen from Table 2, although the SSIM and PSNR indicators are not optimal, the corresponding Mask-SSIM and Mask-PSNR indicators are optimal (the values in brackets are obtained by setting the background area of the image to 0 through the human body mask, and then calculating the SSIM and PSNR indicators). This shows that the quality of the human body image generated by our method is better than C2F.
与其他方法相比,本申请可以更好地处理姿态剧烈变化的情况,同时保留源人物的外观细节。此外,本申请生成的动作迁移图像一般有更清晰的面部细节。这是由于本申请中的渐进式生成模型,其中初始的人脸区域图像为最终的清晰人脸提供了一个重要的模板。Compared with other methods, this application can better handle the situation of drastic changes in posture while retaining the appearance details of the source character. In addition, the motion transfer images generated by this application generally have clearer facial details. This is due to the progressive generative model in this application, where the initial face region image provides an important template for the final clear face.
本申请实施例在上述实施例基础上,增加了确定第二背景图的技术细节特征。此外,本申请实施例与上述实施例提出的动作迁移方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,且本实施例与上述实施例具有相同的效果。The embodiment of the present application adds the technical details of determining the second background image based on the above embodiment. In addition, the action migration method proposed in the embodiment of the present application and the above embodiment belongs to the same concept, and the technical details not described in detail in the present embodiment can be referred to the above embodiment, and the present embodiment has the same effect as the above embodiment.
图9为本申请实施例提供的一种动作迁移装置结构示意图,本申请实施例可适用于对图像或视频中对象进行动作迁移的情况,例如人体动作迁移的情况。通过本申请提供的动作迁移装置,可实现上述实施例提供的动作迁移方法。FIG. 9 is a schematic structural diagram of a motion migration device provided by an embodiment of the present application. The embodiment of the present application may be applicable to motion migration of objects in images or videos, such as human body motion migration. Through the action migration device provided by this application, the action migration method provided in the above embodiments can be implemented.
如图9所示,本申请实施例中动作迁移装置,可以包括:As shown in Figure 9, the action migration device in the embodiment of the present application may include:
图像获取模块710,设置为获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态;第一生成模块720,设置为根据关键点连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图;第二生成模块730,设置为根据多 个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图;合成模块740,设置为将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。The image acquisition module 710 is configured to acquire a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the key point connection diagram is used to characterize the driving posture of the first object; the first generation module 720 is configured to generate a second segmentation diagram of each preset area that conforms to the driving posture according to the key point connection diagram and the first segmentation diagram; the second generation module 730 is configured to generate a second segmentation diagram of each preset area according to the multiple The second segmentation map of the preset area and the first foreground map of the source image are used to generate a second foreground map of the second object under the driving posture; the synthesis module 740 is configured to fuse the second foreground map with the first background map of the source image to obtain the action migration image.
在一些实施方式中,动作迁移装置,还包括:In some embodiments, the action migration device further includes:
对齐参数确定模块,设置为根据第一分割图和第二分割图确定对齐参数;图像对齐模块,设置为根据对齐参数对第一前景图进行变换,以使第一前景图与第二分割图对齐。The alignment parameter determination module is configured to determine the alignment parameters based on the first segmentation map and the second segmentation map; the image alignment module is configured to transform the first foreground image based on the alignment parameters to align the first foreground image with the second segmentation map. .
在一些实施方式中,根据第一分割图和第二分割图确定对齐参数,包括下述至少一项:In some embodiments, the alignment parameters are determined according to the first segmentation map and the second segmentation map, including at least one of the following:
根据第一分割图和第二分割图中预设区域的尺寸,确定缩放参数;根据第一分割图和第二分割图中预设区域的中心坐标,确定位移参数。The scaling parameter is determined according to the size of the preset area in the first segmentation map and the second segmentation map; the displacement parameter is determined based on the center coordinates of the preset area in the first segmentation map and the second segmentation map.
在一些实施方式中,第一生成模块720,包括:In some implementations, the first generation module 720 includes:
第一编码单元,设置为通过第一编码器对第一分割图编码,得到第一特征图;第二编码单元,设置为通过第二编码器对关键点连接图编码,得到第二特征图;第一解码单元,设置为通过第一解码器对第一特征图和第二特征图的融合图进行解码,得到第二分割图。The first encoding unit is configured to encode the first segmentation map through the first encoder to obtain the first feature map; the second encoding unit is configured to encode the key point connection map through the second encoder to obtain the second feature map; The first decoding unit is configured to decode the fusion map of the first feature map and the second feature map through the first decoder to obtain the second segmentation map.
在一些实施方式中,若驱动图像为视频帧,则装置还包括:In some implementations, if the driving image is a video frame, the device further includes:
历史第二分割图获取模块,设置为获取与当前视频帧的前预设数量个视频帧对应的历史第二分割图;历史分割图编码模块,设置为通过第三编码器对至少一个历史第二分割图进行编码,得到第三特征图;第一解码单元,还设置为:The historical second segmentation map acquisition module is configured to acquire the historical second segmentation map corresponding to the first preset number of video frames of the current video frame; the historical segmentation map encoding module is configured to use a third encoder to encode at least one historical second segmentation map. The segmented map is encoded to obtain the third feature map; the first decoding unit is also set to:
通过第一解码器对第一特征图、第二特征图和第三特征图的融合图进行解码,得到第二分割图。The fusion map of the first feature map, the second feature map and the third feature map is decoded by the first decoder to obtain a second segmentation map.
在一些实施方式中,装置,还设置为:In some embodiments, the device is further configured to:
通过第二解码器对第二特征图和第三特征图的融合图进行解码,得到光流参数和权重参数;根据与当前视频帧的前一视频帧对应的历史第二分割图、光流参数和权重参数,对第二分割图进行调整。The second decoder decodes the fusion map of the second feature map and the third feature map to obtain the optical flow parameters and weight parameters; according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameters and weight parameters to adjust the second segmentation map.
在一些实施方式中,第一生成对抗网络的训练步骤,包括:In some embodiments, the training steps of the first generative adversarial network include:
获取样本驱动图像中第一对象的每个预设区域的第三分割图;确定与样本源图像对应的第二分割图,和与样本驱动图像对应的第三分割图的第一损失;根据第一损失,对第一生成对抗网络进行训练。Obtaining a third segmentation map of each preset area of the first object in the sample-driven image; determining a second segmentation map corresponding to the sample source image, and a first loss of the third segmentation map corresponding to the sample-driven image; according to the first A loss to train the first generative adversarial network.
在一些实施方式中,第二生成模块730,包括: In some implementations, the second generation module 730 includes:
第四编码单元,设置为通过第四编码器对多个预设区域的第二分割图进行编码,得到第四特征图;第五编码单元,设置为通过第五编码器对第一前景图进行编码,得到第五特征图;第三编码单元,设置为通过第三解码器对第四特征图和第五特征图的融合图进行解码,得到第二前景图。The fourth encoding unit is configured to encode the second segmentation images of the plurality of preset areas through the fourth encoder to obtain the fourth feature map; the fifth encoding unit is configured to encode the first foreground image through the fifth encoder. Encoding to obtain the fifth feature map; the third encoding unit is configured to decode the fusion map of the fourth feature map and the fifth feature map through the third decoder to obtain the second foreground image.
在一些实施方式中,若驱动图像为视频帧,则第五编码单元,还设置为:In some implementations, if the driving image is a video frame, the fifth coding unit is also set to:
获取与当前视频帧的前预设数量个视频帧对应的历史第二前景图;通过第五编码器对第一前景图和至少一个历史第二前景图的融合图进行编码,得到第五特征图。Obtain the historical second foreground image corresponding to a preset number of video frames before the current video frame; use the fifth encoder to encode the fusion image of the first foreground image and at least one historical second foreground image to obtain a fifth feature map .
在一些实施方式中,第二生成对抗网络的训练步骤,包括:In some embodiments, the training steps of the second generative adversarial network include:
获取样本驱动图像中第一对象的每个预设区域的第三分割图;确定与样本源图像对应的第二前景图,和与样本源图像对应的前景真值图之间的第二损失;确定与样本源图像对应的第二前景图,和与样本驱动图像对应的第三分割图的第三损失;根据第二损失和第三损失,对第二生成对抗网络进行训练。Obtain a third segmentation map of each preset area of the first object in the sample-driven image; determine a second loss between the second foreground map corresponding to the sample source image and the foreground true value map corresponding to the sample source image; Determine the second foreground image corresponding to the sample source image and the third loss of the third segmentation image corresponding to the sample driven image; train the second generative adversarial network based on the second loss and the third loss.
在一些实施方式中,装置,还包括:In some embodiments, the device further includes:
纹理增强参数确定模块,设置为根据第一前景图与第二前景图,确定纹理增强参数;纹理增强模块,设置为根据纹理增强参数和第一前景图,对第二前景图进行纹理增强。The texture enhancement parameter determination module is configured to determine the texture enhancement parameters based on the first foreground image and the second foreground image; the texture enhancement module is configured to perform texture enhancement on the second foreground image based on the texture enhancement parameters and the first foreground image.
在一些实施方式中,纹理增强参数确定模块,设置为:In some implementations, the texture enhancement parameter determination module is set to:
通过第六编码器对第一前景图进行编码,得到第六特征图;通过第七编码器对第二前景图进行编码,得到第七特征图;将第六特征图和第七特征图按通道展开,分别得到第八特征图和第九特征图;将第八特征图和第九特征图的相关性矩阵,作为纹理增强参数。The first foreground image is encoded by the sixth encoder to obtain the sixth feature map; the second foreground image is encoded by the seventh encoder to obtain the seventh feature map; the sixth feature map and the seventh feature map are divided into channels Expand to obtain the eighth feature map and the ninth feature map respectively; use the correlation matrices of the eighth feature map and the ninth feature map as texture enhancement parameters.
在一些实施方式中,纹理增强模块,设置为:In some implementations, the texture enhancement module is configured to:
根据第八特征图和纹理增强参数,确定纹理增强图;将纹理增加图和第九特征图的融合图按通道整合,得到第十特征图;通过第四解码器对第十特征图进行解码,得到纹理增强后的第二前景图。According to the eighth feature map and texture enhancement parameters, determine the texture enhancement map; integrate the fusion map of the texture enhancement map and the ninth feature map by channel to obtain the tenth feature map; decode the tenth feature map through the fourth decoder, Obtain the second foreground image after texture enhancement.
在一些实施方式中,合成模块740,设置为:In some embodiments, synthesis module 740 is configured to:
根据第二分割图和关键点连接图,确实姿态掩膜图;根据姿态掩膜图和第一背景图,确定第二背景图;将第二前景图与第二背景图进行融合。According to the second segmentation map and the key point connection map, the attitude mask image is determined; based on the attitude mask image and the first background image, the second background image is determined; and the second foreground image and the second background image are fused.
在一些实施方式中,第二对象包括虚拟对象。In some implementations, the second object includes a virtual object.
本申请实施例提供的动作迁移装置,与上述实施例提供的动作迁移方法属于同一构思,未在本申请实施例中详尽描述的技术细节可参见上述实施例,并 且本申请实施例与上述实施例具有相同的效果。The action migration device provided by the embodiments of this application belongs to the same concept as the action migration method provided by the above embodiments. Technical details that are not described in detail in the embodiments of this application can be referred to the above embodiments, and And the embodiments of the present application have the same effect as the above-mentioned embodiments.
图10为本申请实施例提供的一种终端设备的硬件结构示意图。本申请实施例中的终端设备900可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(Television,TV)、台式计算机等等的固定终端。图10示出的终端设备900仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。FIG10 is a schematic diagram of the hardware structure of a terminal device provided in an embodiment of the present application. The terminal device 900 in the embodiment of the present application may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (Portable Android Devices, PADs), portable multimedia players (Portable Media Players, PMPs), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital televisions (TVs), desktop computers, etc. The terminal device 900 shown in FIG10 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present application.
如图10所示,终端设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(Read-Only Memory,ROM)902中的程序或者从存储装置908加载到随机访问存储器(Random Access Memory,RAM)903中的程序而执行多种适当的动作和处理。在RAM903中,还存储有终端设备900操作所需的多种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(Input/Output,I/O)接口905也连接至总线904。As shown in FIG. 10 , the terminal device 900 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 901, which may perform a variety of appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage device 908 to a random access memory (RAM) 903. In the RAM 903, a variety of programs and data required for the operation of the terminal device 900 are also stored. The processing device 901, the ROM 902, and the RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置907;包括例如磁带、硬盘等的存储装置908;以及通信装置909。通信装置909可以允许终端设备900与其他设备进行无线或有线通信以交换数据。虽然图10示出了具有多种装置的终端设备900,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Generally, the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) , an output device 907 such as a speaker, a vibrator, etc.; a storage device 908 including a magnetic tape, a hard disk, etc.; and a communication device 909. The communication device 909 may allow the terminal device 900 to communicate wirelessly or wiredly with other devices to exchange data. Although FIG. 10 shows the terminal device 900 having various means, it is not required to implement or have all the illustrated means. More or fewer means may alternatively be implemented or provided.
根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM 902被安装。在该计算机程序被处理装置901执行时,执行本申请实施例提供的或者的动作迁移方法中限定的上述功能。According to embodiments of the present application, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present application include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via communication device 909, or from storage device 908, or from ROM 902. When the computer program is executed by the processing device 901, the above-mentioned functions provided by the embodiments of the present application or defined in the action migration method are performed.
本申请实施例提供的终端与上述实施例提供的动作迁移方法属于同一构思,未在本申请实施例中详尽描述的技术细节可参见上述实施例,并且本申请实施例与上述实施例具有相同的效果。 The terminal provided by the embodiment of the present application and the action migration method provided by the above embodiment belong to the same concept. Technical details that are not described in detail in the embodiment of the present application can be referred to the above embodiment, and the embodiment of the present application has the same features as the above embodiment. Effect.
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的或者的动作迁移方法。Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the action migration method or the action migration method provided by the above embodiments is implemented.
本申请实施例上述的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,Erasable Programmable Read-Only Memory,EPROM)或闪存(FLASH)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。The computer-readable storage medium mentioned above in the embodiment of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. Examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard drives, RAM, ROM, Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory (EPROM) or flash memory (FLASH), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above . In the embodiments of the present application, the computer-readable storage medium may be any tangible medium containing or storing a program, which may be used by or in combination with an instruction execution system, apparatus or device. In the embodiment of the present application, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code contained on a computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wires, optical cables, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and server can communicate using any currently known or future developed network protocol, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium. Communications (e.g., communications network) interconnections. Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any current network for knowledge or future research and development.
上述计算机可读存储介质可以是上述终端设备中所包含的,也可以是单独存在,而未装配入该终端设备中的。The computer-readable storage medium may be included in the terminal device, or may exist independently without being installed in the terminal device.
上述终端设备存储承载有一个或者多个程序,当上述一个或者多个程序被该终端设备执行时,使得该终端设备:The terminal device stores and carries one or more programs. When the one or more programs are executed by the terminal device, the terminal device:
获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;关键点连接图用于表征第一对象的驱动姿态;根据关键点 连接图和第一分割图,生成每个预设区域的符合驱动姿态的第二分割图;根据多个预设区域的第二分割图和源图像的第一前景图,生成驱动姿态下第二对象的第二前景图;将第二前景图与源图像的第一背景图进行融合,得到动作迁移图像。Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the driving posture of the first object; according to the key points Connect the map and the first segmentation map to generate a second segmentation map of each preset area that conforms to the driving posture; generate a second segmentation map in the driving posture based on the second segmentation map of multiple preset areas and the first foreground image of the source image. The second foreground image of the object; fuse the second foreground image with the first background image of the source image to obtain a motion migration image.
可以以一种或多种程序设计语言或其组合来编写用于执行本申请的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).
附图中的流程图和框图,图示了按照本申请多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能页可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在一种情况下并不构成对该单元本身的限定。The units involved in the embodiments of this application can be implemented in software or hardware. Among them, the name of a unit does not constitute a limitation on the unit itself.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范样式的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programming Logic Device,CPLD)等等。 The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary hardware logic components that may be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.

Claims (16)

  1. 一种动作迁移方法,包括:An action migration method, including:
    获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;其中,所述关键点连接图用于表征所述第一对象的驱动姿态;Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; wherein the key point connection diagram is used to characterize the driving posture of the first object ;
    根据所述关键点连接图和所述第一分割图,生成所述每个预设区域的符合所述驱动姿态的第二分割图;Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map;
    根据多个预设区域的第二分割图和所述源图像的第一前景图,生成所述驱动姿态下所述第二对象的第二前景图;Generate a second foreground image of the second object in the driving posture according to the second segmentation map of the plurality of preset areas and the first foreground image of the source image;
    将所述第二前景图与所述源图像的第一背景图进行融合,得到动作迁移图像。The second foreground image is fused with the first background image of the source image to obtain a motion transition image.
  2. 根据权利要求1所述的方法,在所述生成所述每个预设区域的符合所述驱动姿态的第二分割图之后,还包括:The method according to claim 1, after generating the second segmentation map of each preset area that conforms to the driving posture, further comprises:
    根据所述第一分割图和所述第二分割图确定对齐参数;Determine alignment parameters according to the first segmentation map and the second segmentation map;
    在所述根据多个预设区域的第二分割图和所述源图像的第一前景图,生成所述驱动姿态下所述第二对象的第二前景图之前,还包括:Before generating the second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image, the method further includes:
    根据所述对齐参数对所述第一前景图进行变换,以使所述第一前景图与所述第二分割图对齐。The first foreground image is transformed according to the alignment parameter to align the first foreground image with the second segmentation image.
  3. 根据权利要求2所述的方法,其中,所述根据所述第一分割图和所述第二分割图确定对齐参数,包括下述至少一项:The method according to claim 2, wherein the determining alignment parameters according to the first segmentation map and the second segmentation map includes at least one of the following:
    根据所述第一分割图和所述第二分割图中所述每个预设区域的尺寸,确定缩放参数;Determine scaling parameters according to the size of each preset area in the first segmentation map and the second segmentation map;
    根据所述第一分割图和所述第二分割图中所述每个预设区域的中心坐标,确定位移参数。The displacement parameter is determined according to the center coordinates of each preset area in the first segmentation map and the second segmentation map.
  4. 根据权利要求1所述的方法,其中,所述第二分割图通过第一生成对抗网络生成,且通过所述第一生成对抗网络生成所述第二分割图,包括:The method of claim 1, wherein the second segmentation map is generated by a first generative adversarial network, and the second segmentation map is generated by the first generative adversarial network, including:
    通过第一编码器对所述第一分割图编码,得到第一特征图;The first segmentation map is encoded by a first encoder to obtain a first feature map;
    通过第二编码器对所述关键点连接图编码,得到第二特征图;Encode the key point connection map through the second encoder to obtain a second feature map;
    通过第一解码器对所述第一特征图和所述第二特征图的融合图进行解码,得到所述第二分割图。The first decoder decodes the fusion map of the first feature map and the second feature map to obtain the second segmentation map.
  5. 根据权利要求4中所述的方法,其中,在所述驱动图像为视频帧的情况 下,所述方法还包括:The method according to claim 4, wherein when the driving image is a video frame Below, the method also includes:
    获取与当前视频帧的前预设数量个视频帧对应的历史第二分割图;Obtain the historical second segmentation map corresponding to the previous preset number of video frames of the current video frame;
    通过第三编码器对至少一个历史第二分割图进行编码,得到第三特征图;Encode at least one historical second segmentation map through a third encoder to obtain a third feature map;
    所述通过第一解码器对所述第一特征图和所述第二特征图的融合图进行解码,得到所述第二分割图,包括:Decoding the fusion map of the first feature map and the second feature map through the first decoder to obtain the second segmentation map includes:
    通过所述第一解码器对所述第一特征图、所述第二特征图和所述第三特征图的融合图进行解码,得到所述第二分割图。The first decoder decodes a fusion map of the first feature map, the second feature map, and the third feature map to obtain the second segmentation map.
  6. 根据权利要求5中所述的方法,在所述得到第三特征图之后,还包括:The method according to claim 5, after obtaining the third feature map, further comprising:
    通过第二解码器对所述第二特征图和所述第三特征图的融合图进行解码,得到光流参数和权重参数;The second decoder decodes the fusion map of the second feature map and the third feature map to obtain optical flow parameters and weight parameters;
    在所述得到第二分割图之后,还包括:After obtaining the second segmentation map, it also includes:
    根据与所述当前视频帧的前一视频帧对应的历史第二分割图、所述光流参数和所述权重参数,对所述第二分割图进行调整。The second segmentation map is adjusted according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameter and the weight parameter.
  7. 根据权利要求4中所述的方法,其中,所述第一生成对抗网络的训练方式,包括:The method according to claim 4, wherein the training method of the first generative adversarial network includes:
    获取样本驱动图像中第一对象的每个预设区域的第三分割图;Obtaining a third segmentation map of each preset area of the first object in the sample-driven image;
    确定与样本源图像对应的第二分割图,和与所述样本驱动图像对应的第三分割图的第一损失;determining a second segmentation map corresponding to the sample source image and a first loss of a third segmentation map corresponding to the sample driving image;
    根据所述第一损失,对所述第一生成对抗网络进行训练。The first generative adversarial network is trained according to the first loss.
  8. 根据权利要求1所述的方法,其中,所述第二前景图通过第二生成对抗网络生成,且通过所述第二生成对抗网络生成所述第二前景图,包括:The method of claim 1, wherein the second foreground image is generated by a second generative adversarial network, and the second foreground image is generated by the second generative adversarial network, including:
    通过第四编码器对所述多个预设区域的第二分割图进行编码,得到第四特征图;Encode the second segmentation maps of the plurality of preset areas through a fourth encoder to obtain a fourth feature map;
    通过第五编码器对所述第一前景图进行编码,得到第五特征图;The first foreground image is encoded by a fifth encoder to obtain a fifth feature map;
    通过第三解码器对所述第四特征图和所述第五特征图的融合图进行解码,得到所述第二前景图。The third decoder decodes the fused image of the fourth feature map and the fifth feature map to obtain the second foreground image.
  9. 根据权利要求8中所述的方法,其中,在所述驱动图像为视频帧的情况下,所述通过第五编码器对所述第一前景图进行编码,得到第五特征图,包括:The method according to claim 8, wherein when the driving image is a video frame, the first foreground image is encoded by a fifth encoder to obtain a fifth feature map, including:
    获取与当前视频帧的前预设数量个视频帧对应的历史第二前景图;Obtain the historical second foreground image corresponding to the previous preset number of video frames of the current video frame;
    通过所述第五编码器对所述第一前景图和至少一个历史第二前景图的融合 图进行编码,得到第五特征图。Fusion of the first foreground image and at least one historical second foreground image by the fifth encoder The image is encoded to obtain the fifth feature map.
  10. 根据权利要求8所述的方法,其中,所述第二生成对抗网络的训练方式,包括:The method according to claim 8, wherein the training method of the second generative adversarial network includes:
    获取样本驱动图像中第一对象的每个预设区域的第三分割图;Obtaining a third segmentation map of each preset area of the first object in the sample-driven image;
    确定与样本源图像对应的第二前景图,和与所述样本源图像对应的前景真值图之间的第二损失;Determine a second loss between the second foreground image corresponding to the sample source image and the foreground ground truth map corresponding to the sample source image;
    确定与样本源图像对应的第二前景图,和与所述样本驱动图像对应的第三分割图的第三损失;Determining a second foreground image corresponding to the sample source image, and a third loss of a third segmentation map corresponding to the sample driven image;
    根据所述第二损失和所述第三损失,对所述第二生成对抗网络进行训练。The second generative adversarial network is trained according to the second loss and the third loss.
  11. 根据权利要求1所述的方法,在所述生成所述驱动姿态下所述第二对象的第二前景图之后,还包括:The method according to claim 1, after generating the second foreground image of the second object in the driving posture, further comprising:
    根据所述第一前景图与所述第二前景图,确定纹理增强参数;Determine texture enhancement parameters according to the first foreground image and the second foreground image;
    根据所述纹理增强参数和所述第一前景图,对所述第二前景图进行纹理增强。Texture enhancement is performed on the second foreground image according to the texture enhancement parameter and the first foreground image.
  12. 根据权利要求1所述的方法,其中,所述将所述第二前景图与所述源图像的第一背景图进行融合,包括:The method according to claim 1, wherein said fusing the second foreground image with the first background image of the source image includes:
    根据所述第二分割图和所述关键点连接图,确实姿态掩膜图;According to the second segmentation map and the key point connection map, determine the posture mask map;
    根据所述姿态掩膜图和所述第一背景图,确定第二背景图;Determine a second background image according to the posture mask image and the first background image;
    将所述第二前景图与所述第二背景图进行融合。Fusion of the second foreground image and the second background image.
  13. 根据权利要求1-12中任一所述的方法,其中,所述第二对象包括虚拟对象。The method of any one of claims 1-12, wherein the second object includes a virtual object.
  14. 一种动作迁移装置,包括:A motion transfer device including:
    图像获取模块,设置为获取驱动图像中第一对象的关键点连接图和源图像中第二对象的每个预设区域的第一分割图;其中,所述关键点连接图用于表征所述第一对象的驱动姿态;The image acquisition module is configured to acquire the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; wherein the key point connection diagram is used to characterize the The driving posture of the first object;
    第一生成模块,设置为根据所述关键点连接图和所述第一分割图,生成所述每个预设区域的符合所述驱动姿态的第二分割图;A first generation module configured to generate a second segmentation map of each preset area that conforms to the driving posture based on the key point connection map and the first segmentation map;
    第二生成模块,设置为根据多个预设区域的第二分割图和所述源图像的第一前景图,生成所述驱动姿态下所述第二对象的第二前景图;A second generation module configured to generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image;
    合成模块,设置为将所述第二前景图与所述源图像的第一背景图进行融合, 得到动作迁移图像。a synthesis module configured to fuse the second foreground image with the first background image of the source image, Get motion transfer images.
  15. 一种终端设备,包括:A terminal device including:
    至少一个处理器;at least one processor;
    存储器,设置为存储至少一个程序;a memory configured to store at least one program;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-13中任一所述的动作迁移方法。When the at least one program is executed by the at least one processor, the at least one processor is caused to implement the action migration method as described in any one of claims 1-13.
  16. 一种计算机可读存储介质,存储有计算机程序,所述程序被处理器执行时实现如权利要求1-13中任一所述的动作迁移方法。 A computer-readable storage medium stores a computer program. When the program is executed by a processor, the action migration method as described in any one of claims 1-13 is implemented.
PCT/CN2023/097712 2022-09-21 2023-06-01 Action migration method and apparatus, and terminal device and storage medium WO2024060669A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211154081.1 2022-09-21
CN202211154081.1A CN115471658A (en) 2022-09-21 2022-09-21 Action migration method and device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2024060669A1 true WO2024060669A1 (en) 2024-03-28

Family

ID=84335422

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/097712 WO2024060669A1 (en) 2022-09-21 2023-06-01 Action migration method and apparatus, and terminal device and storage medium

Country Status (2)

Country Link
CN (1) CN115471658A (en)
WO (1) WO2024060669A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471658A (en) * 2022-09-21 2022-12-13 北京京东尚科信息技术有限公司 Action migration method and device, terminal equipment and storage medium
CN116664603B (en) * 2023-07-31 2023-12-12 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190311202A1 (en) * 2018-04-10 2019-10-10 Adobe Inc. Video object segmentation by reference-guided mask propagation
CN115471658A (en) * 2022-09-21 2022-12-13 北京京东尚科信息技术有限公司 Action migration method and device, terminal equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190311202A1 (en) * 2018-04-10 2019-10-10 Adobe Inc. Video object segmentation by reference-guided mask propagation
CN115471658A (en) * 2022-09-21 2022-12-13 北京京东尚科信息技术有限公司 Action migration method and device, terminal equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUI LI, LI TENG: "Pose-guided scene-preserving person video generation algorithm", JOURNAL OF GRAPHICS, vol. 41, no. 4, 22 July 2020 (2020-07-22), pages 539 - 547, XP093148839, ISSN: 2095-302X, DOI: 10.11996/JG.j.2095-302X.2020040539 *
李桂(LI, GUI): "高质量任意人体姿态图像视频生成研究 (Non-official translation: Research on High-Quality Arbitrary Human Body Pose Image/Video Generation)", 中国优秀硕士学位论文全文数据库信息科技辑 (INFORMATION AND TECHNOLOGY, CHINA MASTER'S THESES FULL-TEXT DATABASE), no. 7, 15 July 2020 (2020-07-15), pages 8 - 43, XP09314883, ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN115471658A (en) 2022-12-13

Similar Documents

Publication Publication Date Title
WO2024060669A1 (en) Action migration method and apparatus, and terminal device and storage medium
EP3998552A1 (en) Image processing method and apparatus, and electronic device
CN110827193B (en) Panoramic video significance detection method based on multichannel characteristics
WO2022111110A1 (en) Virtual video livestreaming processing method and apparatus, storage medium, and electronic device
CN112040311B (en) Video image frame supplementing method, device and equipment and storage medium
CN114025219B (en) Rendering method, device, medium and equipment for augmented reality special effects
US20210279515A1 (en) Data processing method and device for generating face image and medium
WO2020220516A1 (en) Image generation network training and image processing methods, apparatus, electronic device and medium
US20220358675A1 (en) Method for training model, method for processing video, device and storage medium
WO2020177214A1 (en) Double-stream video generation method based on different feature spaces of text
WO2023103576A1 (en) Video processing method and apparatus, and computer device and storage medium
WO2023232056A1 (en) Image processing method and apparatus, and storage medium and electronic device
EP4120181A2 (en) Method and apparatus of fusing image, and method of training image fusion model
WO2024017093A1 (en) Image generation method, model training method, related apparatus, and electronic device
CN115082300B (en) Training method of image generation model, image generation method and device
WO2023125181A1 (en) Image processing method and apparatus, electronic device, and storage medium
WO2023056835A1 (en) Video cover generation method and apparatus, and electronic device and readable medium
CN113365156A (en) Panoramic video multicast stream view angle prediction method based on limited view field feedback
CN116958534A (en) Image processing method, training method of image processing model and related device
CN115914505A (en) Video generation method and system based on voice-driven digital human model
Zhao et al. Laddernet: Knowledge transfer based viewpoint prediction in 360◦ video
Zhang et al. No-reference omnidirectional image quality assessment based on joint network
WO2024032331A9 (en) Image processing method and apparatus, electronic device, and storage medium
WO2024056030A1 (en) Image depth estimation method and apparatus, electronic device and storage medium
CN114399829A (en) Posture migration method based on generative countermeasure network, electronic device and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23866966

Country of ref document: EP

Kind code of ref document: A1