WO2024060669A1

WO2024060669A1 - Action migration method and apparatus, and terminal device and storage medium

Info

Publication number: WO2024060669A1
Application number: PCT/CN2023/097712
Authority: WO
Inventors: 刘鑫辰; 刘武; 杨权威; 梅涛
Original assignee: 北京京东尚科信息技术有限公司
Priority date: 2022-09-21
Filing date: 2023-06-01
Publication date: 2024-03-28
Also published as: CN115471658A

Abstract

Provided in the present application are an action migration method and apparatus, and a terminal device and a storage medium. The method comprises: acquiring a key-point connection graph of a first object in a drive image and a first segmentation image of each preset area of a second object in a source image, wherein the key-point connection graph is used for representing a drive posture of the first object; according to the key-point connection graph and the first segmentation image, generating a second segmentation image, which conforms to the drive posture, of each preset area; generating a second foreground image of the second object in the drive posture according to the second segmentation images of a plurality of preset areas and a first foreground image of the source image; and fusing the second foreground image with a first background image of the source image to obtain an action migration image. By means of the technical solution, a realistic action migration image is obtained.

Description

Action migration method, device, terminal equipment and storage medium

This application claims priority to the Chinese patent application with application number 202211154081.1, which was submitted to the China Patent Office on September 21, 2022. The entire content of this application is incorporated into this application by reference.

Technical Field

This application relates to the field of image processing technology, for example, to action migration methods, devices, terminal equipment and storage media.

Background technique

Action transfer refers to generating a new video based on the source image and the driver video. The new video contains the character in the source image, and the character performs the same actions as the character in the driver video.

In the related art, an affine transformation (which may be called a warp operation) is usually performed on the source image or its encoded feature image according to the driving video to generate a new video.

There are at least the following technical problems in related technologies:

When the postures of the characters in the source image and the characters in the driving video are drastically different, the affine transformation cannot be accurately performed, resulting in poor authenticity of the characters in the generated video, seriously affecting the user's visual experience.

Summary of the invention

This application provides action migration methods, devices, terminal equipment and storage media, which can better adapt to scenes with drastically different postures, ensure the authenticity of the characters in the generated videos, and improve the user's visual experience.

In the first aspect, this application provides an action migration method, including:

Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the driving posture of the first object;

Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map;

Generate a second foreground image of the second object in the driving posture according to the second segmentation map of the plurality of preset areas and the first foreground image of the source image;

The second foreground image is fused with the first background image of the source image to obtain a motion transition image.

In the second aspect, this application provides an action migration device, including:

The image acquisition module is configured to acquire the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the first The driving posture of the object;

A first generation module configured to generate a second segmentation map of each preset area that conforms to the driving posture based on the key point connection map and the first segmentation map;

A second generation module configured to generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image;

A synthesis module configured to fuse the second foreground image with the first background image of the source image to obtain an action migration image.

In the third aspect, this application provides a terminal device, including:

one or more processors;

memory configured to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the above-mentioned action migration method.

In a fourth aspect, the present application provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the above-mentioned action migration method is implemented.

Description of drawings

Figure 1 is a flow chart of an action migration method provided by an embodiment of the present application;

Figure 2 is a flow chart of another action migration method provided by an embodiment of the present application;

Figure 3 is a flow chart of another action migration method provided by an embodiment of the present application;

Figure 4 is a schematic architectural diagram of a local generation network provided by an embodiment of the present application;

Figure 5 is a flow chart of another action migration method provided by an embodiment of the present application;

Figure 6 is a flow chart of another action migration method provided by an embodiment of the present application;

Figure 7 is a schematic diagram of the architecture of an overall synthesis network provided by an embodiment of the present application;

Figure 8 is a flow chart of another action migration method provided by an embodiment of the present application;

Figure 9 is a schematic structural diagram of an action migration device provided by an embodiment of the present application;

Figure 10 is a schematic diagram of the hardware structure of a terminal device provided by an embodiment of the present application.

Detailed ways

The following will describe the technical solutions of the present application through implementation modes with reference to the accompanying drawings in the embodiments of the present application. The described embodiments are part of the embodiments of the present application. The acquisition, storage, use and processing of data in the technical solution of this application all comply with the relevant provisions of national laws and regulations.

Figure 1 is a flow chart of an action migration method provided by an embodiment of the present application. The action migration method provided by an embodiment of the present application can be applied to the situation of migrating the posture of objects in images and/or videos, such as the situation of human body movement migration. . The method can be executed by an action migration device, which is implemented in software and/or hardware, for example, configured in a terminal device, such as a computer device.

As shown in Figure 1, the action migration method provided in the embodiment of this application may include the following steps:

S110. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the driving posture of the first object.

In the embodiment of this application, action migration refers to the process of generating a new image based on the source image and the driving image. The new image contains the second object in the source image, and the second object does the same thing as the first object in the driving image. action. Among them, the driving image refers to an image with a driving posture. There are multiple driving images, and the actions or postures of the objects in the multiple driving images change in association according to a preset sequence. For example, the driving images may be video frames in the driving video. The first object in the driving image refers to a person or other area of interest, which is not limited here. The key point connection diagram refers to a posture connection diagram in which multiple key points of the first object are connected in a predefined connection manner. It can be used to represent the driving posture of the first object, where the key points can be corresponding to the body parts of the first object. key points. The source image refers to an image with a source pose of the second object, and the second object in the source image refers to a person or other area of interest. The first segmentation map refers to the segmentation map of each preset area of the second object, which may include but is not limited to the body part area segmentation map and the background area segmentation map. For example, the body part of the second object may include but is not limited to the head. , tops, bottoms, shoes and limbs, etc., are not limited here. The number of channels of the first segmentation map can be determined according to the number of divided body parts, such as 18 channels, 6 channels, 5 channels, etc., which are not limited here.

The first object and the second object are used to distinguish the difference between the objects in the source image and the objects in the driving image, and are not necessarily used to describe a specific order or sequence.

For example, the first object may be a driving character, the second object may be a source character, and the preset area may be a body part of the character or other areas of interest. In order to represent the posture of the human body, the human body key point detection model OpenPose can be used to predict the driving image and obtain the two-dimensional key points of the driving character. Among them, OpenPose is an open source two-dimensional key point detection model of the human body; the two-dimensional key points of the driving character are connected according to the predefined connection method to obtain the key point connection diagram of the driving character. Among them, the key point connection diagram may be a Red-Green-Blue (RGB) key point connection diagram, and H×W represents the resolution of the image. To represent human body layout, this application uses human Analytical self-correction (Self Correction for Human Parsing, SCHP) model is used to obtain the 18-channel semantic segmentation map of the source image; taking into account the texture characteristics of different parts of the human body, the 18-channel semantic segmentation map channels are merged into 6 channels, namely the head head, tops, bottoms, shoes, limbs and background, thereby obtaining the first segmentation map of the preset area of the source character in the source image

S120. Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.

In this embodiment, the second segmentation map refers to a preset area analysis map of the second object in the driving posture. In other words, the second segmentation map includes a preset area of the second object that conforms to the driving posture.

The key point connection map and the first segmentation map can be input into a pre-trained first neural network model to obtain a second segmentation map, thereby realizing the transformation of the segmentation map corresponding to the second object from the source pose to the driving pose. The network structure of the first neural network model is not limited here, for example, the network structure can be composed of at least one encoder and at least one decoder.

S130. Generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image.

In this embodiment, the second foreground image refers to a preset area foreground image of the second object in the driving posture, where the foreground image refers to the object area in the image excluding the background. In other words, the second foreground image is composed of a plurality of preset area foreground images of the second object that conform to the driving posture.

The second segmentation images of multiple preset areas and the first foreground image of the source image can be input to the pre-trained second neural network model to obtain the foreground images of the preset area under multiple driving postures of the second object, and The preset area foreground images under multiple driving postures are combined to obtain a second foreground image, which realizes the transformation of the foreground image corresponding to the second object from the source posture to the driving posture. The network structure of the second neural network model is not limited here. For example, the network structure may consist of at least one encoder and at least one decoder.

S140. Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.

In this embodiment, the motion transfer image refers to an image with the source image background as the background and including the second object in the driving posture. That is, the motion transfer image refers to an image in which the first object's posture is transferred to the second object.

In some embodiments, the foreground in the second foreground image can be embedded into the corresponding position of the first background image of the source image to achieve image fusion. In some embodiments, the first foreground image and the second foreground image can also be texture aligned, and the foreground in the texture-aligned second foreground image can be embedded into the corresponding position of the first background image of the source image to achieve image fusion. The image fusion method is not limited here.

An action migration method provided by an embodiment of the present application uses the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image to obtain each The second segmentation map of the preset area that conforms to the driving posture realizes the transformation of the segmentation map corresponding to the second object from the source posture to the driving posture; according to the second segmentation map of multiple preset areas and the first foreground image of the source image, Generate a second foreground image of the second object in the driving posture, realize the transformation of the foreground image corresponding to the second object from the source posture to the driving posture, and assign the texture of the second object of the source image to the second segmentation map; The foreground image is fused with the first background image of the source image to obtain a realistic motion migration image, and this application abandons the original Warp operation, which can better adapt to scenarios with different postures and ensure the authenticity of the characters in the generated video. Improve the user's visual experience.

Referring to Figure 2, Figure 2 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. After generating a second segmentation map of each preset area that conforms to the driving posture, it also includes: determining alignment parameters based on the first segmentation map and the second segmentation map; Before generating the second foreground image of the second object in the driving posture, the method further includes: transforming the first foreground image according to the alignment parameter to align the first foreground image with the second segmentation image.

As shown in FIG. 2 , the method of this embodiment may include:

S210. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.

S220: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.

S230. Determine alignment parameters according to the first segmentation map and the second segmentation map.

S240. Transform the first foreground image according to the alignment parameter so that the first foreground image is aligned with the second segmentation image.

S250. Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.

S260: Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.

In this embodiment, the alignment parameters refer to parameters used to align the first foreground image and the second segmentation image. Aligning the first foreground image with the second segmentation image through the alignment parameter can avoid the situation where there is a huge difference in size and spatial position between the source character and the driver character.

In some embodiments, determining the alignment parameter according to the first segmentation map and the second segmentation map includes at least one of the following: determining the scaling parameter according to the size of the preset area in the first segmentation map and the second segmentation map; The center coordinates of the preset areas in the first segmentation map and the second segmentation map determine the displacement parameters.

The scaling parameter refers to the parameter that controls the zoom size of the image. The first mask height is determined based on the first segmentation map, and the second mask height is determined based on the second segmentation map; the scaling parameter is determined based on the first mask height and the second mask height. The displacement parameter is a parameter that characterizes the position offset of the image. The first mask center coordinate is determined based on the first segmentation map, and the second mask center coordinate is determined based on the second segmentation map; the displacement parameter is determined based on the first mask center coordinate and the second mask center coordinate.

For example, it can be calculated by the following formula:

Among them, R represents the scaling parameter, represents the height of the human body mask in the second segmentation image, and H _s represents the height of the human body mask in the first segmentation image.

as well as:
c＝[c _x , c _y ] ^T

Among them, c represents the displacement parameter, c _x represents the difference in the horizontal direction between the first mask center coordinate and the second mask center coordinate, cy _y represents the vertical difference between the first mask center coordinate and the second mask center coordinate value.

The first foreground image is transformed through the alignment parameter to align the first foreground image with the second segmentation image, thereby generating the second object in the driving posture for the subsequent second segmentation image and first foreground image based on multiple preset areas. The second foreground image lays the foundation for image quality and improves the image quality of the second foreground image.

Based on the above-mentioned embodiments, the embodiment of the present application adds "determine the alignment parameter according to the first segmentation map and the second segmentation map; transform the first foreground image according to the alignment parameter, so that the first foreground image and the second segmentation map Align". In addition, the action migration method proposed in the embodiment of the present application and the above-mentioned embodiment belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiment, and this embodiment has the same effect as the above-mentioned embodiment.

Referring to Figure 3, Figure 3 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. The second segmentation map is generated through the first generative adversarial network, and the step of generating the second segmentation map through the first generative adversarial network includes: encoding the first segmentation map through a first encoder to obtain a first feature map; The encoder encodes the key point connection map to obtain the second feature map; the first decoder decodes the fusion map of the first feature map and the second feature map to obtain the second segmentation map.

As shown in Figure 3, the method in this embodiment may include:

S310. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.

S320. Use the first encoder to encode the first segmentation map to obtain the first feature map.

S330: Encode the key point connection map through the second encoder to obtain the second feature map.

S340: Use the first decoder to decode the fusion map of the first feature map and the second feature map to obtain a second segmentation map.

S350. Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.

S360: Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.

In this embodiment, the second segmentation map may be generated by a first generative adversarial network, and the first generative adversarial network may include a first encoder, a second encoder, and a first decoder. Wherein, the first encoder is used to encode the first segmentation graph, the second encoder is used to encode the key point connection graph, and the first decoder is used to decode the fusion graph of the encoding results of the first encoder and the second encoder, This results in a clearer and sharper second segmentation image. The first encoder, the second encoder and the first decoder are used to distinguish the different functions of the encoder or decoder, and are not necessarily used to describe a specific order or sequence.

In some embodiments, if the driving image is a video frame, the method may further include: obtaining a historical second segmentation map corresponding to the first preset number of video frames of the current video frame; and using a third encoder to encode at least one historical second segmentation map. The second segmentation map is encoded to obtain the third feature map; correspondingly, the fusion map of the first feature map and the second feature map is decoded through the first decoder to obtain the second segmentation map, including: The fusion map of the first feature map, the second feature map and the third feature map is decoded to obtain a second segmentation map.

The historical second segmentation map refers to one or more second segmentation images before the current video frame. The historical second segmentation map is input to the first generative adversarial network, so that the first generative adversarial network can effectively extract the relationship between different video frames, thereby improving the temporal consistency of the video.

Based on the above embodiment, after obtaining the third feature map, the method further includes: decoding the fusion map of the second feature map and the third feature map through a second decoder to obtain the optical flow parameters and weight parameters; after obtaining After the second segmentation map, the method further includes: adjusting the second segmentation map according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameters and the weight parameters.

In this implementation, the second decoder is used to decode the fusion map of the second feature map and the third feature map to obtain optical flow parameters and weight parameters, where the optical flow parameters refer to the movement of the moving object on the observation imaging plane. The instantaneous speed of pixel movement.

Exemplarily, Figure 4 is an architectural schematic diagram of a local generation network provided by an embodiment of the present application. The local generation network is used to generate images of each area of the source character in the driving posture; the local generation network includes a first generative adversarial network. , the first generative adversarial network can be a Layout generative adversarial network (GAN) with a vid2vid framework. The first generative adversarial network includes Three encoders and two decoders. Wherein, the first encoder may be an encoder Among them, l is the identifier of Layout GAN, which is used to indicate that the first encoder belongs to the Layout GAN network. Used to encode the first segmentation map L _s to obtain the first feature map The second encoder may be an encoder Multiple keypoint connection graphs for encoding stitching along channels Get the second feature map Among them, t represents the current moment, t-1 and t-2 represent the two historical consecutive moments before the current moment, and the third encoder can be an encoder Used to encode the two historical second segmentation maps generated at previous moments. Get the third feature map through first decoder Decode the added features get original result Similarly, through the second decoder Decoding additive features Obtain the optical flow parameter O and its weight parameter w. To obtain the final result of the second segmentation map Layout GAN can be formulated as:

Among them, + indicates point-to-point addition, * indicates point-to-point multiplication, and {,} indicates that the inputs are spliced along the channel dimension. Warp(I,O) represents the affine transformation of image I based on the optical flow parameter O. The optical flow and warp operations used in this implementation are based on adjacent frames, and their purpose is to improve the temporal consistency of the generated video, rather than based on the transformation between the source image and the driving image.

In some embodiments, the training step of the first generative adversarial network may include: obtaining a third segmentation map of each preset area of the first object in the sample driving image; determining a first loss of the second segmentation map corresponding to the sample source image and the third segmentation map corresponding to the sample driving image; and training the first generative adversarial network according to the first loss.

The third segmentation map refers to a segmentation map driving a preset area of the first object in the image.

The first generative adversarial network can be trained in advance through multiple sample-driven images and sample source images, where the sample-driven images and sample source images can be paired training data, that is, the characters in the sample-driven images and the sample source images are the same For example, a forward human video frame in a video is selected as a sample source image, and the video is used as a sample-driven video. Among them, the reason why the forward human body video frame is selected as the source image is that it contains more appearance details of the source character. In the trained first generative adversarial network, the first loss may be a cross-entropy loss, by calculating the cross-entropy loss of the second segmentation map corresponding to the sample source image and the third segmentation map corresponding to the sample-driven image, and based on the cross-entropy The loss adjusts the network parameters of the first generative adversarial network so that the cross loss gradually decreases and becomes stable until the network training is completed and the first generative adversarial network is obtained.

Based on the above embodiments, the embodiments of the present application refine the technical features of generating the second segmentation map. In addition, the embodiments of the present application and the action migration method proposed in the above embodiments belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment is different from the above embodiments. Example has the same effect.

Referring to Figure 5, Figure 5 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. The second foreground image is generated through the second generative adversarial network, and the step of generating the second foreground image through the second generative adversarial network may include: encoding the second segmentation images of the plurality of preset areas by using a fourth encoder to obtain The fourth feature map; the first foreground image is encoded through the fifth encoder to obtain the fifth feature map; the fusion map of the fourth feature map and the fifth feature map is decoded through the third decoder to obtain the second foreground image .

As shown in Figure 5, the method in this embodiment may include:

S410. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.

S420: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.

S430: Use the fourth encoder to encode the second segmentation maps of the plurality of preset areas to obtain a fourth feature map.

S440: Encode the first foreground image through the fifth encoder to obtain the fifth feature map.

S450: Decode the fusion image of the fourth feature map and the fifth feature map through a third decoder to obtain a second foreground image.

S460: Fusing the second foreground image with the first background image of the source image to obtain an action migration image.

In this embodiment, the second foreground image is generated by a second generative adversarial network. The second generative adversarial network may include a fourth encoder, a fifth encoder and a third decoder, where the fourth encoder is used to encode multiple The second segmentation image of the preset area is encoded, the fifth encoder is used to encode the first foreground image, and the third decoder is used to decode the fusion image of the fourth feature map and the fifth feature map to obtain the second foreground image. picture. The fourth encoder, the fifth encoder and the third decoder are used to distinguish between encoders or decoders with different functions, and are not necessarily used to describe a specific order or sequence.

Based on the above embodiment, if the driving image is a video frame, the first foreground image is encoded by the fifth encoder to obtain the fifth feature map, which may include: obtaining a preset number of videos related to the current video frame The historical second foreground image corresponding to the frame; the fifth encoder encodes the fusion image of the first foreground image and at least one historical second foreground image to obtain a fifth feature map.

The historical second foreground image refers to one or more second foreground images before the current video frame. The historical second foreground image is input into the second generative adversarial network so that the second generative adversarial network can effectively extract different visual images. The relationship between video frames is improved to improve the timing consistency of the video.

Exemplarily, Figure 4 is an architectural schematic diagram of a local generation network provided by an embodiment of the present application; the local generation network also includes a second generation adversarial network. The second generation adversarial network in this embodiment can be a Region with a vid2vid framework. GAN network, Region GAN is only used to generate the initial region image, so this embodiment uses a generator to generate 5 regions of the human body (no background region is generated). Doing so not only saves computing resources, but also prevents the model from overfitting. The second generative adversarial network may include a fourth encoder fifth encoder and third decoder Among them, r is the identifier of Region GAN, which is used to indicate that it belongs to the Region GAN network. via fourth encoder For the current time t, the mask of the i-th preset area Encode to obtain the fourth feature map via fifth encoder Codes spliced along the channel dimension Get the fifth feature map in, Represents the historical second foreground image before the current time t, I _s,i represents the first foreground image of the i-th preset region, so the original Region GAN can be expressed as:

in, Represents the third decoder. Since the source character and the driver character may differ greatly in size and spatial position, the encoded feature map and There is a misalignment. To this end, this application proposes to use a global alignment module (GAM) to perform affine transformation on the first foreground image FG _s to match the corresponding second segmentation map. First through L _s and Calculate the human mask M _s and Then the first foreground image FG _s of the source image I _s can be obtained. The entire process of the global alignment module is expressed by the following formula:

in, Represents the first foreground image aligned by the global alignment module. R represents the scaling parameter. right Segmentation can obtain I _s,i of different preset areas. c=[c _x , _cy ] ^T represents the displacement parameter. Therefore, the final Region GAN can be expressed as:

Based on the above embodiment, the training step of the second generative adversarial network may include: obtaining a third segmentation map of each preset area of the first object in the sample-driven image; determining the second foreground corresponding to the sample source image The second loss between the map, and the foreground ground truth map corresponding to the sample source image; determining the second foreground map corresponding to the sample source image, and the third loss of the third segmentation map corresponding to the sample driving image; according to the second loss Second loss and third loss, the second generative adversarial network is trained.

In this embodiment, the second loss may include reconstruction loss and perceptual loss; the third loss may be adversarial loss, that is, image distribution loss.

The second generative adversarial network can be pre-trained using a plurality of sample driving images and sample source images, wherein the sample driving images and sample source images can be paired training data. In the second generative adversarial network, the second loss may include reconstruction loss and perceptual loss. The reconstruction loss and perceptual loss are calculated between the second foreground image corresponding to the sample source image and the foreground truth image corresponding to the sample source image, and the adversarial loss between the second foreground image corresponding to the sample source image and the third segmentation image corresponding to the sample driving image is calculated. Based on the reconstruction loss, perceptual loss and adversarial loss, the network parameters of the second generative adversarial network are adjusted so that the reconstruction loss, perceptual loss and adversarial loss gradually decrease and tend to be stable until the network training is completed, thereby obtaining the second generative adversarial network.

Illustratively, this embodiment uses L1 reconstruction loss. Compared with L2 reconstruction loss, L1 reconstruction loss pays more attention to the subtle differences between the generated image and the real image. The calculation formula is:

in, Represents the predicted value of the second foreground image under the driving posture of the i-th area at time t, Indicates the actual value of the driving image segmentation map under the driving posture of the i-th area at time t.

The perceptual loss is used to constrain the generated image and the real image to be close in the multi-dimensional feature space. Perceptual loss includes feature content loss and feature style loss, which can be expressed as:

Among them, j represents the jth layer of the pre-trained Visual Geometry Group (VGG)-19 model, and G represents the Gram matrix for calculating the feature map.

The purpose of the adversarial loss is to make the synthesized image have a similar distribution to the real image. In order to make the network pay attention to multi-scale image details, this application uses the multi-scale conditional discriminator proposed in pix2pixHD. it combines images and the corresponding area mask as input. Its expression is:

Therefore, the generator’s loss function looks like this:

Among them, λ _rcc and λ _per are the weights of reconstruction loss and perceptual loss respectively.

In addition, the training process of the first generative adversarial network and the second generative adversarial network includes the training of the discriminator. For the discriminator, the loss function is:

Based on the above embodiment, the embodiment of the present application adds detailed features for determining the second foreground image. In addition, the action migration method proposed in the embodiments of the present application and the above-mentioned embodiments belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiments, and this embodiment is different from the above-mentioned embodiments. Have the same effect.

Referring to Figure 6, Figure 6 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. After generating the second foreground image of the second object in the driving posture, the method further includes: determining the texture enhancement parameters according to the first foreground image and the second foreground image; and performing the processing on the second foreground image according to the texture enhancement parameters and the first foreground image. Texture enhancement.

As shown in Figure 6, the method in this embodiment may include:

S510. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.

S520: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.

S530: Generate a second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image.

S540. Determine texture enhancement parameters based on the first foreground image and the second foreground image.

S550: Perform texture enhancement on the second foreground image according to the texture enhancement parameters and the first foreground image.

S560: Fusion of the second foreground image and the first background image of the source image to obtain a motion migration image.

In this embodiment, the texture enhancement parameters refer to adjustment parameters used to enhance the image texture. The texture enhancement parameters are used to align the features of the first foreground image with the features of the second foreground image to retain more details, such as clothing. Texture and body edges etc.

In some embodiments, determining the texture enhancement parameters according to the first foreground image and the second foreground image may include: encoding the first foreground image through a sixth encoder to obtain a sixth feature map; and encoding the first foreground image through a seventh encoder. The second foreground image is encoded to obtain the seventh feature map; the sixth feature map and the seventh feature map are expanded by channels to obtain the eighth feature map and the ninth feature map respectively; the eighth feature map and the ninth feature map are Correlation matrix as a texture enhancement parameter.

Exemplarily, Figure 7 is a schematic architectural diagram of an overall synthesis network provided by an embodiment of the present application. The overall synthesis network is used to integrate the images of different regions generated by the local generation network to generate the final action migration image. At the same time, it generates An appropriate background image is added to the action transfer image, and the loss of the overall synthesis network is the loss between the generated action transfer image and the driving image. via sixth encoder Encoding encodes the first foreground image to obtain the sixth feature map by seventh encoder to the second foreground image Encode to obtain the seventh feature map in Represents the second foreground image at the current moment, Represents the second historical foreground picture before the current moment, because Same texture as the first foreground image, different pose state, so this embodiment proposes a texture alignment module (TAM) to better integrate feature maps As shown in Figure 7, first the feature map respectively expanded to and c represents the number of channels. Then use the cosine distance to calculate the correlation matrix between the two feature maps (i.e. texture enhancement parameters) Calculated as follows:

Among them, H _1,i represents the characteristics of H ₁ at i, and similarly, H _2,j represents the characteristics of H ₂ at j. ·Represents matrix multiplication.

In some embodiments, performing texture enhancement on the second foreground image according to the texture enhancement parameter and the first foreground image may include: determining the texture enhancement map according to the eighth feature map and the texture enhancement parameter; adding the texture enhancement map and the ninth The fusion map of the feature map is integrated by channel to obtain the tenth feature map; the tenth feature map is decoded through the fourth decoder to obtain the texture-enhanced second foreground map.

Illustratively, the tenth feature map is obtained by the following formula

Finally, through the fourth decoder Obtain the second foreground image after texture enhancement. Therefore, the second foreground image generation after the entire texture enhancement can be formulated as:

Based on the above embodiments, the embodiments of the present application add the technical feature of texture enhancement to achieve better fusion of feature maps. In addition, the action migration method proposed in the embodiment of the present application and the above-mentioned embodiment belong to the same concept. Technical details that are not described in detail in this embodiment can be referred to the above-mentioned embodiment, and this embodiment has the same effect as the above-mentioned embodiment.

Referring to Figure 8, Figure 8 is a flow chart of another action migration method provided by an embodiment of the present application. The method of this embodiment can be combined with multiple solutions of the action migration method provided in the above embodiments. The action migration method provided in this embodiment is explained. Fusion of the second foreground image with the first background image of the source image may include: determining the pose mask image based on the second segmentation map and the key point connection map; determining the second pose mask image based on the pose mask image and the first background image. Background image; fuse the second foreground image with the second background image.

As shown in Figure 8, the method in this embodiment may include:

S610. Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to represent the driving posture of the first object.

S620: Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map.

S630. Generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image.

S640. Confirm the attitude mask map based on the second segmentation map and the key point connection map.

S650. Determine the second background image based on the attitude mask image and the first background image.

S660: Fusion of the second foreground image and the second background image to obtain a motion migration image.

In this embodiment, after obtaining the second foreground image of the second object in the driving posture, it is necessary to add a reasonable background image to obtain a more realistic motion migration image.

Input the second segmentation map and the key point connection map to the eighth encoder to obtain the eleventh feature map, and decode the eleventh feature map through the fifth decoder to obtain the pose mask map, where, the pose mask map It refers to a soft mask image containing the posture of the first object; the corresponding position of the first background image is masked through the attitude mask image to obtain a second background image that covers the first object.

In some implementations, the fifth decoder decodes the fusion map of the seventh feature map and the eleventh feature map to obtain the pose mask map. The advantage of using fusion maps is that the outline and posture can be optimized to obtain a more accurate outline map.

For example, using the eighth encoder coding Get feature map in, Represents the second segmentation graph at the current time t and the two historical second segmentation graphs before the current time t; Representing the key point connection graph at the current time t and the two historical key point connection graphs before the current time t, the added features and into the fifth decoder Get the pose mask image m, and finally use the following formula to get the final action transfer image:

On the basis of the above embodiment, the second object includes a virtual object. Among them, the virtual objects can be digital people, virtual customer service, and virtual anchors. The second object may also be a real person object.

Illustratively, the action migration method of this embodiment can be applied to the action driving and generation of digital people, virtual customer service, and virtual anchors, thereby improving the fidelity and richness of virtual character actions and improving the user experience.

In addition, this application also designs a discriminator specifically for the face area to generate real faces. During the training process, the avatars of the source image and the action transfer image are input into the discriminator to train the model, making the foreground faces generated by the generator more realistic.

Illustratively, given a source image I _s and a driving video in, Represents the video frame driving the video at time t. The goal of this embodiment is to generate a new video in which the source image The person in the image is doing the actions that drive the person in the video. The entire scheme can be formulated as:

Wherein, F(·,·) represents the generation model in this embodiment, N represents the number of driving video frames, represents the generated target video frame, where the pose of the source character is consistent with the driving character.

In the training phase, in order to obtain paired training data as supervision information, this embodiment selects a forward human video frame in a video as the source image, and this video is used as the driving video. The reason for choosing the forward human video frame as the source image is that it contains more appearance details of the source character. This embodiment adopts a step-by-step training strategy. First, Layout GAN and Region GAN were trained for 10 rounds respectively. Then the output of Region GAN is used to train the overall synthesis network for 10 rounds.

In the inference stage, the selection of the driving video is not restricted, as long as it is any clear motion video of a single person. And this embodiment can perform end-to-end inference.

The generation framework proposed in this embodiment has achieved the best or equivalent results on two public datasets (iPER and SoloDance datasets). The experimental results on the two datasets are shown in Tables 1 and 2, respectively, where Structural Similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) are evaluation indicators based on similarity, and the larger the value, the better the quality of the generated image. Learned Perceptual Image Patch Similarity (LPIPS) and Fréchet Inception Distance (FID) are evaluation indicators based on feature distance, and the smaller the value, the better the quality of the generated image. Temporally Consistent Mode (TCM) is used to evaluate the temporal consistency of the generated video, and the larger the value, the better.

Table 1 Experimental results on the data set iPER

Table 2 Experimental results on the data set SoloDance

As can be seen from Table 1, the experimental results of the embodiment of the present application have achieved the best results in all evaluation indicators on the iPER dataset. As can be seen from Table 2, although the SSIM and PSNR indicators are not optimal, the corresponding Mask-SSIM and Mask-PSNR indicators are optimal (the values in brackets are obtained by setting the background area of the image to 0 through the human body mask, and then calculating the SSIM and PSNR indicators). This shows that the quality of the human body image generated by our method is better than C2F.

Compared with other methods, this application can better handle the situation of drastic changes in posture while retaining the appearance details of the source character. In addition, the motion transfer images generated by this application generally have clearer facial details. This is due to the progressive generative model in this application, where the initial face region image provides an important template for the final clear face.

The embodiment of the present application adds the technical details of determining the second background image based on the above embodiment. In addition, the action migration method proposed in the embodiment of the present application and the above embodiment belongs to the same concept, and the technical details not described in detail in the present embodiment can be referred to the above embodiment, and the present embodiment has the same effect as the above embodiment.

FIG. 9 is a schematic structural diagram of a motion migration device provided by an embodiment of the present application. The embodiment of the present application may be applicable to motion migration of objects in images or videos, such as human body motion migration. Through the action migration device provided by this application, the action migration method provided in the above embodiments can be implemented.

As shown in Figure 9, the action migration device in the embodiment of the present application may include:

The image acquisition module 710 is configured to acquire a key point connection diagram of a first object in a driving image and a first segmentation diagram of each preset area of a second object in a source image; the key point connection diagram is used to characterize the driving posture of the first object; the first generation module 720 is configured to generate a second segmentation diagram of each preset area that conforms to the driving posture according to the key point connection diagram and the first segmentation diagram; the second generation module 730 is configured to generate a second segmentation diagram of each preset area according to the multiple The second segmentation map of the preset area and the first foreground map of the source image are used to generate a second foreground map of the second object under the driving posture; the synthesis module 740 is configured to fuse the second foreground map with the first background map of the source image to obtain the action migration image.

In some embodiments, the action migration device further includes:

The alignment parameter determination module is configured to determine the alignment parameters based on the first segmentation map and the second segmentation map; the image alignment module is configured to transform the first foreground image based on the alignment parameters to align the first foreground image with the second segmentation map. .

In some embodiments, the alignment parameters are determined according to the first segmentation map and the second segmentation map, including at least one of the following:

The scaling parameter is determined according to the size of the preset area in the first segmentation map and the second segmentation map; the displacement parameter is determined based on the center coordinates of the preset area in the first segmentation map and the second segmentation map.

In some implementations, the first generation module 720 includes:

The first encoding unit is configured to encode the first segmentation map through the first encoder to obtain the first feature map; the second encoding unit is configured to encode the key point connection map through the second encoder to obtain the second feature map; The first decoding unit is configured to decode the fusion map of the first feature map and the second feature map through the first decoder to obtain the second segmentation map.

In some implementations, if the driving image is a video frame, the device further includes:

The historical second segmentation map acquisition module is configured to acquire the historical second segmentation map corresponding to the first preset number of video frames of the current video frame; the historical segmentation map encoding module is configured to use a third encoder to encode at least one historical second segmentation map. The segmented map is encoded to obtain the third feature map; the first decoding unit is also set to:

The fusion map of the first feature map, the second feature map and the third feature map is decoded by the first decoder to obtain a second segmentation map.

In some embodiments, the device is further configured to:

The second decoder decodes the fusion map of the second feature map and the third feature map to obtain the optical flow parameters and weight parameters; according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameters and weight parameters to adjust the second segmentation map.

In some embodiments, the training steps of the first generative adversarial network include:

Obtaining a third segmentation map of each preset area of the first object in the sample-driven image; determining a second segmentation map corresponding to the sample source image, and a first loss of the third segmentation map corresponding to the sample-driven image; according to the first A loss to train the first generative adversarial network.

In some implementations, the second generation module 730 includes:

The fourth encoding unit is configured to encode the second segmentation images of the plurality of preset areas through the fourth encoder to obtain the fourth feature map; the fifth encoding unit is configured to encode the first foreground image through the fifth encoder. Encoding to obtain the fifth feature map; the third encoding unit is configured to decode the fusion map of the fourth feature map and the fifth feature map through the third decoder to obtain the second foreground image.

In some implementations, if the driving image is a video frame, the fifth coding unit is also set to:

Obtain the historical second foreground image corresponding to a preset number of video frames before the current video frame; use the fifth encoder to encode the fusion image of the first foreground image and at least one historical second foreground image to obtain a fifth feature map .

In some embodiments, the training steps of the second generative adversarial network include:

Obtain a third segmentation map of each preset area of the first object in the sample-driven image; determine a second loss between the second foreground map corresponding to the sample source image and the foreground true value map corresponding to the sample source image; Determine the second foreground image corresponding to the sample source image and the third loss of the third segmentation image corresponding to the sample driven image; train the second generative adversarial network based on the second loss and the third loss.

In some embodiments, the device further includes:

The texture enhancement parameter determination module is configured to determine the texture enhancement parameters based on the first foreground image and the second foreground image; the texture enhancement module is configured to perform texture enhancement on the second foreground image based on the texture enhancement parameters and the first foreground image.

In some implementations, the texture enhancement parameter determination module is set to:

The first foreground image is encoded by the sixth encoder to obtain the sixth feature map; the second foreground image is encoded by the seventh encoder to obtain the seventh feature map; the sixth feature map and the seventh feature map are divided into channels Expand to obtain the eighth feature map and the ninth feature map respectively; use the correlation matrices of the eighth feature map and the ninth feature map as texture enhancement parameters.

In some implementations, the texture enhancement module is configured to:

According to the eighth feature map and texture enhancement parameters, determine the texture enhancement map; integrate the fusion map of the texture enhancement map and the ninth feature map by channel to obtain the tenth feature map; decode the tenth feature map through the fourth decoder, Obtain the second foreground image after texture enhancement.

In some embodiments, synthesis module 740 is configured to:

According to the second segmentation map and the key point connection map, the attitude mask image is determined; based on the attitude mask image and the first background image, the second background image is determined; and the second foreground image and the second background image are fused.

In some implementations, the second object includes a virtual object.

The action migration device provided by the embodiments of this application belongs to the same concept as the action migration method provided by the above embodiments. Technical details that are not described in detail in the embodiments of this application can be referred to the above embodiments, and And the embodiments of the present application have the same effect as the above-mentioned embodiments.

FIG10 is a schematic diagram of the hardware structure of a terminal device provided in an embodiment of the present application. The terminal device 900 in the embodiment of the present application may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (Portable Android Devices, PADs), portable multimedia players (Portable Media Players, PMPs), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital televisions (TVs), desktop computers, etc. The terminal device 900 shown in FIG10 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present application.

As shown in FIG. 10 , the terminal device 900 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 901, which may perform a variety of appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage device 908 to a random access memory (RAM) 903. In the RAM 903, a variety of programs and data required for the operation of the terminal device 900 are also stored. The processing device 901, the ROM 902, and the RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Generally, the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) , an output device 907 such as a speaker, a vibrator, etc.; a storage device 908 including a magnetic tape, a hard disk, etc.; and a communication device 909. The communication device 909 may allow the terminal device 900 to communicate wirelessly or wiredly with other devices to exchange data. Although FIG. 10 shows the terminal device 900 having various means, it is not required to implement or have all the illustrated means. More or fewer means may alternatively be implemented or provided.

According to embodiments of the present application, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present application include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via communication device 909, or from storage device 908, or from ROM 902. When the computer program is executed by the processing device 901, the above-mentioned functions provided by the embodiments of the present application or defined in the action migration method are performed.

The terminal provided by the embodiment of the present application and the action migration method provided by the above embodiment belong to the same concept. Technical details that are not described in detail in the embodiment of the present application can be referred to the above embodiment, and the embodiment of the present application has the same features as the above embodiment. Effect.

Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the action migration method or the action migration method provided by the above embodiments is implemented.

The computer-readable storage medium mentioned above in the embodiment of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. Examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard drives, RAM, ROM, Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory (EPROM) or flash memory (FLASH), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above . In the embodiments of the present application, the computer-readable storage medium may be any tangible medium containing or storing a program, which may be used by or in combination with an instruction execution system, apparatus or device. In the embodiment of the present application, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code contained on a computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wires, optical cables, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.

In some embodiments, the client and server can communicate using any currently known or future developed network protocol, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium. Communications (e.g., communications network) interconnections. Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any current network for knowledge or future research and development.

The computer-readable storage medium may be included in the terminal device, or may exist independently without being installed in the terminal device.

The terminal device stores and carries one or more programs. When the one or more programs are executed by the terminal device, the terminal device:

Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; the key point connection diagram is used to characterize the driving posture of the first object; according to the key points Connect the map and the first segmentation map to generate a second segmentation map of each preset area that conforms to the driving posture; generate a second segmentation map in the driving posture based on the second segmentation map of multiple preset areas and the first foreground image of the source image. The second foreground image of the object; fuse the second foreground image with the first background image of the source image to obtain a motion migration image.

Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

The units involved in the embodiments of this application can be implemented in software or hardware. Among them, the name of a unit does not constitute a limitation on the unit itself.

The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary hardware logic components that may be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.

Claims

An action migration method, including:

Obtain the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; wherein the key point connection diagram is used to characterize the driving posture of the first object ;

Generate a second segmentation map of each preset area that conforms to the driving posture according to the key point connection map and the first segmentation map;

Generate a second foreground image of the second object in the driving posture according to the second segmentation map of the plurality of preset areas and the first foreground image of the source image;

The second foreground image is fused with the first background image of the source image to obtain a motion transition image.
The method according to claim 1, after generating the second segmentation map of each preset area that conforms to the driving posture, further comprises:

Determine alignment parameters according to the first segmentation map and the second segmentation map;

Before generating the second foreground image of the second object in the driving posture according to the second segmentation images of the plurality of preset areas and the first foreground image of the source image, the method further includes:

The first foreground image is transformed according to the alignment parameter to align the first foreground image with the second segmentation image.
The method according to claim 2, wherein the determining alignment parameters according to the first segmentation map and the second segmentation map includes at least one of the following:

Determine scaling parameters according to the size of each preset area in the first segmentation map and the second segmentation map;

The displacement parameter is determined according to the center coordinates of each preset area in the first segmentation map and the second segmentation map.
The method of claim 1, wherein the second segmentation map is generated by a first generative adversarial network, and the second segmentation map is generated by the first generative adversarial network, including:

The first segmentation map is encoded by a first encoder to obtain a first feature map;

Encode the key point connection map through the second encoder to obtain a second feature map;

The first decoder decodes the fusion map of the first feature map and the second feature map to obtain the second segmentation map.
The method according to claim 4, wherein when the driving image is a video frame Below, the method also includes:

Obtain the historical second segmentation map corresponding to the previous preset number of video frames of the current video frame;

Encode at least one historical second segmentation map through a third encoder to obtain a third feature map;

Decoding the fusion map of the first feature map and the second feature map through the first decoder to obtain the second segmentation map includes:

The first decoder decodes a fusion map of the first feature map, the second feature map, and the third feature map to obtain the second segmentation map.
The method according to claim 5, after obtaining the third feature map, further comprising:

The second decoder decodes the fusion map of the second feature map and the third feature map to obtain optical flow parameters and weight parameters;

After obtaining the second segmentation map, it also includes:

The second segmentation map is adjusted according to the historical second segmentation map corresponding to the previous video frame of the current video frame, the optical flow parameter and the weight parameter.
The method according to claim 4, wherein the training method of the first generative adversarial network includes:

Obtaining a third segmentation map of each preset area of the first object in the sample-driven image;

determining a second segmentation map corresponding to the sample source image and a first loss of a third segmentation map corresponding to the sample driving image;

The first generative adversarial network is trained according to the first loss.
The method of claim 1, wherein the second foreground image is generated by a second generative adversarial network, and the second foreground image is generated by the second generative adversarial network, including:

Encode the second segmentation maps of the plurality of preset areas through a fourth encoder to obtain a fourth feature map;

The first foreground image is encoded by a fifth encoder to obtain a fifth feature map;

The third decoder decodes the fused image of the fourth feature map and the fifth feature map to obtain the second foreground image.
The method according to claim 8, wherein when the driving image is a video frame, the first foreground image is encoded by a fifth encoder to obtain a fifth feature map, including:

Obtain the historical second foreground image corresponding to the previous preset number of video frames of the current video frame;

Fusion of the first foreground image and at least one historical second foreground image by the fifth encoder The image is encoded to obtain the fifth feature map.
The method according to claim 8, wherein the training method of the second generative adversarial network includes:

Obtaining a third segmentation map of each preset area of the first object in the sample-driven image;

Determine a second loss between the second foreground image corresponding to the sample source image and the foreground ground truth map corresponding to the sample source image;

Determining a second foreground image corresponding to the sample source image, and a third loss of a third segmentation map corresponding to the sample driven image;

The second generative adversarial network is trained according to the second loss and the third loss.
The method according to claim 1, after generating the second foreground image of the second object in the driving posture, further comprising:

Determine texture enhancement parameters according to the first foreground image and the second foreground image;

Texture enhancement is performed on the second foreground image according to the texture enhancement parameter and the first foreground image.
The method according to claim 1, wherein said fusing the second foreground image with the first background image of the source image includes:

According to the second segmentation map and the key point connection map, determine the posture mask map;

Determine a second background image according to the posture mask image and the first background image;

Fusion of the second foreground image and the second background image.
The method of any one of claims 1-12, wherein the second object includes a virtual object.
A motion transfer device including:

The image acquisition module is configured to acquire the key point connection diagram of the first object in the driving image and the first segmentation diagram of each preset area of the second object in the source image; wherein the key point connection diagram is used to characterize the The driving posture of the first object;

A first generation module configured to generate a second segmentation map of each preset area that conforms to the driving posture based on the key point connection map and the first segmentation map;

A second generation module configured to generate a second foreground image of the second object in the driving posture based on the second segmentation images of the plurality of preset areas and the first foreground image of the source image;

a synthesis module configured to fuse the second foreground image with the first background image of the source image, Get motion transfer images.
A terminal device including:

at least one processor;

a memory configured to store at least one program;

When the at least one program is executed by the at least one processor, the at least one processor is caused to implement the action migration method as described in any one of claims 1-13.
A computer-readable storage medium stores a computer program. When the program is executed by a processor, the action migration method as described in any one of claims 1-13 is implemented.