WO2022002032A1 - 图像驱动模型训练、图像生成 - Google Patents

图像驱动模型训练、图像生成 Download PDF

Info

Publication number
WO2022002032A1
WO2022002032A1 PCT/CN2021/103042 CN2021103042W WO2022002032A1 WO 2022002032 A1 WO2022002032 A1 WO 2022002032A1 CN 2021103042 W CN2021103042 W CN 2021103042W WO 2022002032 A1 WO2022002032 A1 WO 2022002032A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
initial
pixel
affine transformation
Prior art date
Application number
PCT/CN2021/103042
Other languages
English (en)
French (fr)
Inventor
吴臻志
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2022002032A1 publication Critical patent/WO2022002032A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the embodiments of the present disclosure relate to the field of artificial intelligence, and in particular, to an image-driven model training method, an image generation method, and a corresponding apparatus, device, and medium.
  • a single target face and a driving video can be used, so that the target face can simulate the expression or action of the person in the driving video.
  • the pose estimation algorithm can be used to extract the key point information of the driving video
  • the model training can be realized through a Generative Adversarial Network (GAN), so that the trained model can be used to achieve the target face simulation drive Effects of expressions and movements of people in the video.
  • GAN Generative Adversarial Network
  • Embodiments of the present disclosure provide an image-driven model training method, an image generation method, and a corresponding device, equipment, and medium, which can improve the accuracy of the human body occlusion relationship of the characters in the generated image, and improve the authenticity of the generated image.
  • embodiments of the present disclosure provide an image-driven model training method, including: acquiring a first image frame and a second image frame; extracting an initial pose feature from the first image frame, and extracting an initial pose feature from the second image frame extraction target pose feature, and generate a local affine transformation matrix from the initial pose feature to the target pose feature; according to the local affine transformation matrix and the first image frame, generate pixel motion data and pixel occlusion data; training an image-driven model based on a deep learning model according to the first image frame, the pixel motion data, and the pixel occlusion data.
  • an embodiment of the present disclosure provides an image generation method, including: acquiring a person image; acquiring a target video frame in a specified video; inputting the person image and the target video frame into a pre-trained image-driven model , obtain the character-driven image output by the image-driven model.
  • the image-driven model is generated by training the image-driven model training method according to any one of the embodiments of the present disclosure.
  • an embodiment of the present disclosure further provides an image-driven model training apparatus, including: an image acquisition module, configured to acquire a first image frame and a second image frame; frame extracting initial posture features, and extracting target posture features from the second image frame, and generating a local affine transformation matrix from the initial posture features to the target posture features; a data generation module for according to the local affine transformation matrix The affine transformation matrix and the first image frame generate pixel motion data and pixel occlusion data; a model training module is used for training depth-based training based on the first image frame, the pixel motion data and the pixel occlusion data Image-driven models for learning models.
  • an embodiment of the present disclosure further provides an image generation device, including: a person image acquisition module, used to acquire a person image; a target video frame acquisition module, used to acquire a target video frame in a specified video; a person-driven image
  • the generating module is used for inputting the character image and the target video frame into a pre-trained image-driven model, and acquiring the character-driven image output by the image-driven model.
  • the image-driven model is generated by training the image-driven model training method according to any one of the embodiments of the present disclosure.
  • an embodiment of the present disclosure further provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the program as described in the present disclosure when the processor executes the program.
  • a computer device including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the program as described in the present disclosure when the processor executes the program.
  • an embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the image-driven model training method described in any one of the embodiments of the present disclosure or image generation method.
  • the embodiment of the present disclosure uses the first image frame and the pixel motion data and pixel occlusion data associated with the driving information as training samples to train the image-driven model, so that the image-driven model can automatically learn the occlusion features, thereby effectively improving the utilization of training
  • the obtained accuracy of the human body occlusion relationship of the character in the character-driven image output by the image-driven model is obtained, thereby improving the authenticity of the character-driven image.
  • FIG. 1 is a flowchart of an image-driven model training method according to Embodiment 1 of the present disclosure
  • FIG. 2A is a flowchart of an image-driven model training method according to Embodiment 2 of the present disclosure
  • 2B is a schematic diagram of a local affine transformation matrix according to an embodiment of the present disclosure
  • FIG. 3A is a flowchart of an image-driven model training method according to Embodiment 3 of the present disclosure
  • 3B is a schematic diagram of a first image frame according to an embodiment of the present disclosure.
  • 3C is a schematic diagram of a second image frame according to an embodiment of the present disclosure.
  • 3D is a schematic diagram of an optical flow information graph according to an embodiment of the present disclosure.
  • 3E is a schematic diagram of a shadow map according to an embodiment of the present disclosure.
  • 3F is a schematic diagram of an application scenario for training an image-driven model according to Embodiment 3 of the present disclosure
  • FIG. 4A is a flowchart of an image generation method according to Embodiment 4 of the present disclosure.
  • FIG. 4B is a schematic diagram of a character-driven image according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic structural diagram of an image-driven model training apparatus according to Embodiment 5 of the present disclosure.
  • FIG. 6 is a schematic structural diagram of an image generation apparatus according to Embodiment 6 of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a computer device in Embodiment 7 of the present disclosure.
  • FIG. 1 is a flowchart of an image-driven model training method according to Embodiment 1 of the present disclosure.
  • This embodiment is applicable to training to generate an image-driven model, and the image-driven model is used to simulate a character in a character image and include in a specified video.
  • the facial expressions and/or body movements of the characters, that is, the characters are driven to perform actions and/or expressions that match the specified video.
  • the method may be performed by the image-driven model training apparatus provided by the embodiment of the present disclosure.
  • the apparatus can be implemented in software and/or hardware, and can generally be integrated in computer equipment. As shown in FIG. 1 , the method of this embodiment specifically includes the following steps.
  • the first image frame and the second image frame may be obtained from the driving video.
  • the driving video includes a plurality of video frames that are consecutive in time sequence, and the video frames in the driving video are images reflecting the movement of the characters.
  • two still images may be acquired, each of which is an image reflecting the movement of a person.
  • a static image can be acquired as the first image frame, and the second image frame can be acquired from the driving video.
  • the person in the first image frame may be different from the person in the second image frame. To make it easier to train an image-driven model, the person in the first image frame can also be the same as the person in the second image frame.
  • the first image frame may be used as the initial character image
  • the second image frame may be used as the target character image to be simulated for the initial character image.
  • the first image frame includes a person image specifying a person
  • the second image frame includes a person image specifying a human pose.
  • the human body pose of the character included in the second image frame may be used as the target human body pose to be simulated by the character in the initial character image.
  • the second image frame may be any video frame in the driving video.
  • the first image frame and the second image are required.
  • the frames are not exactly the same.
  • the time stamps corresponding to the first image frame and the second image frame are separated by at least a set period of time, such as 1 minute.
  • the similarity value between the first image frame and the second image frame is smaller than the set threshold, that is, there is a certain difference between the first image frame and the second image frame.
  • the initial pose feature is used to characterize the feature of a person in the first image frame, and may include facial feature data and/or body feature data.
  • the target pose feature is used to characterize the feature of a person in the second image frame, and may include facial feature data and/or body feature data.
  • Affine transformation matrices are used to spatially transform one pixel matrix to form another pixel matrix.
  • the affine transformation matrix may be used to spatially transform a matrix including human pixels to form another matrix of human pixels.
  • the spatial transformation includes at least one of the following: linear transformation, rotational transformation, and translation transformation.
  • the local affine transformation matrix can perform affine transformation on the pixel matrix of the local area of the character.
  • the local area of a person represents a local area of a certain person, such as a left arm area, a right leg area or a head area, etc., and may even be a combination of multiple human body local areas of a certain person.
  • the local affine transformation matrix from the initial pose feature to the target pose feature is used to transform the human pixel matrix (also referred to as the initial human pixel matrix hereinafter) in the first image frame to form a matching image frame with the second image frame through affine transformation.
  • Target person pixel matrix also referred to as the initial human pixel matrix hereinafter
  • the affine transformation matrix may be determined according to the initial human pixel matrix and the matched target human pixel matrix, wherein the human pixels may be pixels representing key points of the human body.
  • the pixel motion data may represent the motion of a pixel (eg, a pixel associated with a person) to a specified pixel location (eg, a pixel location associated with a specified human pose).
  • the pixel occlusion data may represent the occlusion relationship between multiple pixels moving to the same pixel position during the process of moving the pixel to the specified pixel position.
  • the motion direction (transformation vector) of the human pixel in the first image frame can be determined as the pixel motion data associated with the driving information; and the front and rear occlusions of the pixels moving to the same pixel position can be determined order, as the pixel occlusion data associated with the drive information.
  • the image driving model is used to drive the character to make a specified human posture, which can be understood as moving the pixels associated with the character to the specified pixel position to form the specified human posture of the character.
  • the moving direction and the moving distance need to be determined.
  • the pixel motion data may include moving direction and/or moving distance, and the like.
  • the pixel occlusion data may include occlusion relationships of key points.
  • the image-driven model can accurately adjust the person in the first image frame to the human pose specified by the second image frame according to the target pose feature in the second image frame, and generate corresponding character-driven imagery.
  • the image-driven model based on deep learning is trained, so that the image-driven model can learn the pixel motion data and pixel occlusion data from the target pose feature, And from the process of generating a character driving image simulating the human body posture specified by the second image frame according to the pixel motion data, the pixel occlusion data and the first image frame, automatically learn how to generate a character driving image that simulates the human body posture specified by the driving video from the character image. image.
  • the trained image-driven model is an end-to-end model, which can avoid the operation of image preprocessing, greatly simplify the model training process, and reduce the introduction of errors due to multi-link image processing, thereby improving the use of the trained image-driven model.
  • the accuracy of the generated character-driven images is an end-to-end model, which can avoid the operation of image preprocessing, greatly simplify the model training process, and reduce the introduction of errors due to multi-link image processing, thereby improving the use of the trained image-driven model.
  • the image-driven model can be trained by using the first image frame and the pixel motion data and pixel occlusion data associated with the driving information as training samples, so that the image-driven model can automatically learn the occlusion features, thereby effectively ensuring that the obtained results obtained by training are effectively used.
  • the accuracy of the human body occlusion relationship of the character in the character-driven image output by the image-driven model is improved, thereby improving the authenticity of the character-driven image.
  • FIG. 2A is a flowchart of an image-driven model training method according to Embodiment 2 of the present disclosure. This embodiment is embodied on the basis of the above-mentioned embodiment. The method of this embodiment specifically includes:
  • S202 Acquire a first image frame and a second image frame in the driving video.
  • the first image frame and the second image frame are different video frames
  • the person image included in the first image frame may be referred to as an initial character image
  • the second image frame includes a human body of the character image.
  • the pose may be referred to as specifying a human pose.
  • S203 Input the first image frame into a key point detection model, and obtain a plurality of initial character key points output by the key point detection model and a heat map corresponding to each of the initial character key points.
  • the keypoint detection model is used to detect human keypoints in human images and generate a heat map.
  • Heat map can use color changes to reflect data information in a two-dimensional matrix or table, and it can intuitively represent a certain attribute (such as size or density) of data values with a defined color depth.
  • the initial character keypoints may be human body keypoints of the character in the first image frame.
  • the corresponding heatmap is used to describe the probability that the initial character keypoint is located at each position in the first image frame.
  • the keypoint detection model includes a U-shaped network (U-Net).
  • U-Net can include encoder and decoder.
  • the encoder can include four sub-modules, each sub-module includes two convolutional layers.
  • Each sub-module is respectively connected with a down-sampling layer, and the down-sampling layer can be realized by a max-pooling network, that is, the output result of each sub-module is input to the down-sampling layer for down-sampling.
  • the data goes through the downsampling layers sequentially, and the resolution decreases sequentially.
  • the decoder may include four sub-modules, each connected to an upsampling layer.
  • U-Net uses skip connections to connect the upsampled output of a sub-module in the decoder with the output of a sub-module of the same resolution in the encoder as the input of the next sub-module in the decoder.
  • U-Net combines the shallow feature map with the deep feature map, which can combine the features of local conditions (Where) and global content (What) to generate more accurate images, so that more accurate images can be generated. Perform key point detection to improve the accuracy of key point detection.
  • the probability of each position of the initial character key point in the first image frame can be determined according to the initial character key point, and a corresponding heat map can be generated according to the probability and the position of the key point. Since the shape of the heat map corresponding to each key point is different, the heat map corresponding to each key point can be uniformly transformed into a specified shape, and the affine transformation matrix for transforming the heat map corresponding to each key point can be used as the initial key point. Local affine transformation matrix.
  • the generating an initial local affine transformation matrix according to each of the initial character key points and the corresponding heatmap includes: acquiring the coordinates of each of the initial character key points and the matching confidence; The coordinates of the initial character key points and the matching confidence are generated to generate heat map areas that match each of the initial character key points respectively; for each heat map area matched by the initial character key points, the heat map area Converting into a set regular shape, and converting to the local affine transformation matrix corresponding to the set regular shape, and determining it as the local affine transformation matrix corresponding to the initial character key point; according to each of the initial character key points The corresponding local affine transformation matrix determines the initial local affine transformation matrix.
  • U-Net or other regression algorithms such as CPM (Convolutional Pose Machines) algorithm can be used to calculate the predicted coordinates of the initial character key point in the first image frame and the initial character key point.
  • CPM Convolutional Pose Machines
  • the probability of each position in the first image frame, and the confidence of the predicted coordinates is determined according to the predicted coordinates of the initial character key point in the first image frame and the probability of the position of the initial character key point around the predicted coordinates .
  • the position with the highest probability of the initial character key point in the first image frame is determined as the predicted coordinate of the initial character key point in the first image frame.
  • a heat map centered on the initial character key point is generated.
  • the heatmap is used to represent the influence of the center point (ie, the predicted coordinate position with the highest probability) on the surrounding by color.
  • the coordinates of each key point and the confidence of the coordinates can be obtained, specifically (x1, y1, m1, n1).
  • (x1, y1) are the coordinates
  • m1 is the confidence of x1
  • n1 is the confidence of y1.
  • the value range of the confidence level can be [0,1].
  • a predetermined odd-numbered matrix (for example, a 3*3 matrix or a 5*5 matrix) may be generated in advance. For example, take the predicted coordinate position with the highest probability as the center of the matrix, and according to the corresponding confidence of the predicted coordinate position, use the bilinear interpolation method to interpolate in the x-axis direction and the y-axis direction respectively, and configure the pixel color of the inserted coordinate point.
  • the values are used as elements in the matrix, thereby generating the odd-numbered matrix corresponding to the heat map.
  • the pixel color value of the inserted coordinate point has a corresponding relationship with the distance between the coordinate point and the center point. The higher the red value of the color value.
  • Odd-numbered matrices are usually not directly used for the affine transformation of the character pixel matrix. Therefore, by performing affine transformation on the odd-numbered matrix corresponding to the heat map, a matrix with a regular shape can be generated as the local affine corresponding to the initial character key points. Transformation matrix.
  • the set regular shape may be set as required. Exemplarily, the set regular shape may be a 2*3 matrix. In addition, there are other situations, which are not specifically limited in this embodiment of the present disclosure.
  • the transformation method from the odd-numbered matrix to the set-regular-shaped matrix can be determined by specifying the mapping method between the above-mentioned odd-numbered matrix and the setting regular-shaped matrix.
  • an affine transformation matrix can be used, multiplied by the specified odd-numbered matrix, and the product is the set regular-shaped matrix.
  • the odd-numbered matrix corresponding to the heat map is multiplied by the affine transformation matrix, and the obtained product result is the local affine transformation matrix with regular shape corresponding to the initial character key point.
  • the initial local affine transformation matrix includes local affine transformation matrices corresponding to each of a plurality of initial character key points.
  • a heat map corresponding to each initial character key point is generated, and a local simulation corresponding to each initial character key point is determined according to each heat map. It can relatively accurately evaluate the prediction accuracy of the initial character key points and effectively instruct the image-driven model to relatively accurately learn the coordinates of the initial character key points, thereby effectively improving the training results.
  • the recognition accuracy of the character key points of the image-driven model is improved, thereby improving the accuracy of the character-driven image generated by using the image-driven model.
  • U-Net can also predict 4 scalar weighted values for each initial character key point, and weight the regional confidence of the heat map corresponding to each initial character key point according to the scalar weighted value, and finally obtain each For example, a local affine transformation matrix of a 2x3 set regular shape matrix corresponding to each initial character key point.
  • S205 Input the second image frame into the key point detection model, and obtain a plurality of target posture key points output by the key point detector and a heat map corresponding to each of the target posture key points.
  • the target pose key points may be human body key points of the person in the second image frame.
  • the corresponding heat map is used to describe the probability that the target pose key points are located at each position in the second image frame.
  • the generation method of the target local affine transformation matrix is the same as the above-mentioned generation method of the initial local affine transformation matrix, and details are not repeated here.
  • S207 Multiply the initial local affine transformation matrix by the target local affine transformation matrix to obtain a local affine transformation matrix from the initial posture feature to the target posture feature.
  • the local affine transformation matrix is the result of multiplying the initial local affine transformation matrix by the target local affine transformation matrix.
  • the matrix can characterize the image features
  • the initial local affine transformation matrix is used to describe or characterize the initial pose feature of the first image frame
  • the target local affine transformation matrix is used to describe or characterize the target pose feature of the second image frame.
  • the multiplied local affine transformation matrix is used to describe or characterize the amount of change from the initial pose feature to the target pose feature. Therefore, according to the local affine transformation matrix, the human pixel matrix in the first image frame can be transformed to form a target human pixel matrix matching the human posture in the second image frame.
  • a local radiation transformation matrix may be associated with a local region in the human body, eg, a left arm region, a right arm region, a left leg region, or a right leg region.
  • S208 Generate pixel motion data and pixel occlusion data according to the local affine transformation matrix and the first image frame.
  • the training of an image-driven model based on a deep learning model according to the first image frame, pixel motion data, and pixel occlusion data includes: calculating a loss function of the deep learning model according to loss function configuration information, where The loss function configuration information is used to add an isomorphic constraint function on the basis of the initial loss function of the deep learning model. If it is determined that the loss function satisfies the stable condition, the deep learning model obtained by the current training is determined as the image-driven model, otherwise, return to step 202 to perform the training of the image-driven model based on the deep learning model again.
  • the loss function configuration information is used to add an isomorphic constraint function based on the initial loss function of the image-driven model.
  • the homogeneity constraint function may include the Euclidean distance norm, which may also be called a regularization term or an L2 norm, which refers to the result of re-squaring the sum of the squares of each element.
  • Adding the Euclidean distance norm is equivalent to adding constraints to the initial loss function. In fact, the weight vector with large values is severely punished to favor a more dispersed weight vector, so as to achieve a more uniform distribution of weights and avoid weight concentration.
  • the image-driven model is made closer to the low-dimensional model. The lower the dimension, the smaller the amount of data used for training. Therefore, adding the Euclidean distance norm as a constraint to the initial loss function can reduce the amount of data used for image-driven model training, thereby reducing the complexity of image-driven model training.
  • the stability condition is used to judge whether the loss function tends to be stable and convergent.
  • the stable condition is used to judge whether the change rate of the loss function in adjacent training rounds is smaller than the set change rate threshold.
  • the size of the change rate threshold may be limited according to the actual situation.
  • the rate of change of the loss function in adjacent training rounds may be: calculating the difference between the value of the loss function obtained by the current training and the value of the loss function obtained by the previous training, and calculating the difference and the loss obtained by the current training The ratio of the values of the function. If the ratio is smaller than the set rate of change threshold, it is determined that the rate of change of the loss function is small even after retraining, indicating that the loss function tends to be stable, or the loss function converges. At this point, it is determined that the training of the deep learning model is completed, and the deep learning model obtained by the current training is used as the image-driven model.
  • the updated loss function LOSS_new by adding the Euclidean distance norm to the initial loss function of the image-driven model can be shown in formula (1):
  • LOSS is the initial loss function of the image-driven model
  • L eqv is the isomorphic constraint function
  • the isomorphic constraint function is determined according to the difference between the coordinates of the initial character key points after spatial transformation and the coordinates of the expected key points.
  • the isomorphic constraint function L eqv can be shown in formula (2):
  • K is the number of initial character key points
  • (x' k , y' k ) is the coordinates of the k-th initial character key point
  • (x k , y k ) is the k-th desired key point coordinates
  • g(*) is the function used to transform the coordinates of the initial character key points
  • g(x′ k , y′ k ) is the initial character key point (x′ k , y′ k ) to perform coordinate transformation through g(*) after the coordinates.
  • ⁇ F is the F-norm of the parameter term matrix of the hidden space.
  • the desired key point may be configured as a key point approaching the target pose key point, that is, the desired key point may be a relay in the process of transforming the initial character key point into the target pose key point.
  • the desired key point may be obtained by the desired transformation of the initial character key point, and the desired transformation may be a transformation obtained by performing amplitude limiting on the local affine transformation matrix.
  • g(*) can be understood as a randomly created Thin Plate Spline (TPS), which can use random translation, rotation and scaling to determine the global affine component of TPS, and spatially perturb a set of control points to determine the local TPS components.
  • TPS Thin Plate Spline
  • the inventors of the present disclosure have found that, compared with the supervised method, the use of the self-supervised method to train the key point detection model will inevitably lead to instability and even inaccuracy of the key points.
  • the consistency of the semantic information of the image can be guaranteed by adding an isomorphic constraint function on the basis of the initial loss function. For example, in the image after the action transformation, the arms and legs of the characters will not be dislocated.
  • the distribution of the weights of the vectors can be made more uniform, and the weights can be avoided to be concentrated on a few vectors, which can not only reduce the cost of training the image-driven model.
  • the amount of data and the complexity of calculation also enable the image-driven model to automatically learn how to accurately establish the correspondence between the key points of the character image and the key points of the expected character-driven image, which can effectively improve the use of trained image-driven images. Accuracy and fidelity of human structures in person-driven images generated by the model.
  • the local affine transformation matrix is generated by directly acquiring the human body key points and the corresponding heat map in the first image frame and the second image frame, and determining the pose feature according to the human body key points and the corresponding heat map,
  • the pixel motion data and pixel occlusion data required to make the person in the first image frame simulate the human posture specified in the second image frame can be obtained, and the pixel motion data and pixel occlusion data can be combined with the first image frame to generate training samples , which is used to train image-driven models based on deep learning models. In this way, the demand for manually labeled samples can be effectively reduced, and the labor cost required for model training can be significantly reduced.
  • FIG. 3A is a flowchart of an image-driven model training method according to Embodiment 3 of the present disclosure. This embodiment is embodied on the basis of the above-mentioned embodiment. The method of this embodiment specifically includes:
  • S302 Acquire a first image frame and a second image frame in the driving video.
  • the first image frame and the second image frame are different video frames, and the character image included in the first image frame may be called an initial character image, including a plurality of initial character pixels in the first image frame ;
  • the human body posture including the human image of the second image frame may be referred to as a designated human body posture or a target posture, and is associated with a plurality of desired target pixel positions.
  • S304 Input the local affine transformation matrix and the first image frame into a pre-trained dense motion estimation model, and obtain pixel motion data and pixel occlusion data output by the dense motion estimation model.
  • the dense motion estimation model includes a deep learning model.
  • the pixel motion data includes the motion direction of each initial human pixel in the first image frame pointing to the matching target pixel position in the second image frame, and the pixel occlusion data includes a plurality of pixels in the first image frame. The occlusion sequence relationship between the initial character pixels when they are moved to the same target pixel position in the second image frame through affine transformation.
  • the dense motion estimation model is used to estimate the motion of each initial character pixel and the occlusion order of different initial character pixels after motion.
  • the dense motion estimation model may be a pretrained deep learning model.
  • the initial character pixels are the pixels of the character included in the first image frame.
  • the initial human pixels may include pixels representing human keypoints.
  • the target pixel position is the pixel position to which the initial character pixel is expected to be moved by simulating the human posture specified by the second image frame.
  • the target pixel location is not necessarily the location of the pixel included in the second image frame that matches the original character pixel.
  • the pixel motion data is used to determine the motion vector that transforms from the initial character pixel to the target pixel location.
  • the motion vector can be the direction and size from the initial character pixel point to the target pixel position, which can be specifically represented by an optical flow information graph.
  • the optical flow information map includes a plurality of regional pixel sets, each regional pixel set may use an arrow direction to represent the movement direction, and the arrow size may represent the vector size. Assuming that the first image frame is shown in FIG. 3B and the second image frame is shown in FIG. 3C , the character in the first image frame simulates the action of the character in the second image frame, and the effect of the obtained optical flow information graph can be as shown in the figure In 3D, each arrow represents the motion vector of a pixel region.
  • the pixel occlusion data is used to determine the occlusion sequence relationship between different target person pixels.
  • the target character pixel may represent a pixel formed after the initial character pixel is moved to the matching target pixel position. After each initial person pixel in the first image frame is affinely transformed to the matching target pixel position in the second image frame to form the corresponding target person pixel, there may be multiple target person pixels that are respectively matched by multiple initial person pixels. the case of a pixel location. When multiple target person pixels are located at the same pixel position, only the top-level target person pixels are displayed, and other target person pixels are not displayed as occluded pixels.
  • the occlusion order relationship is used to describe the display order of multiple pixels.
  • the pixel occlusion data can be represented by a shadow map.
  • the correspondingly obtained shadow map may be as shown in FIG. 3E .
  • the darker the place, the lower the gray value (that is, the gray value is close to 0) the higher the degree of occlusion of the area; the brighter the place, the higher the gray value (that is, the gray value is close to 0). close to 255), which means that the area is less occluded.
  • the dense motion estimation model is pre-trained in the following manner: the minimum value of the photometric error between the video frame in the training video and the spatially converted video frame is used as the training target, and the deep learning model is iteratively trained to obtain the The dense motion estimation model described.
  • the spatially transformed video frame is generated by inputting the video frame in the training video into the spatial transformation model, and the local spatial features of the video frame in the training video and the matching local spatial features in the spatially transformed video frame same.
  • the video frame in the training video can be any video frame in the training video.
  • the spatially transformed video frame may be a video frame generated by spatially transforming the video frame in the training video by using a spatial transformation method.
  • the local spatial features of the video frames in the training video are the same as the matching local spatial features in the spatially transformed video frames, indicating that the video frames in the training video and the spatially transformed video frames satisfy the spatial invariance, and it also shows that the spatial transformation method satisfies the spatial invariance.
  • the spatial transformation method can be realized by the spatial transformer modules proposed by Max Jaderberg of Oxford University, Karen Simonyan et al.
  • N is the total number of pixels included in the video frame
  • (i, j) is the coordinate of the pixel
  • I 1 (i, j) is the local spatial feature of the video frame in the training video
  • I′ 1 (i, j) is the matching local spatial feature in the spatially transformed video frame
  • ⁇ (*) is used to represent the photometric error between the local spatial feature of the video frame in the training video and the matching local spatial feature in the spatially transformed video frame, such as light intensity Difference and change direction of light.
  • the training objective of the dense motion estimation model is to minimize L reconst .
  • the dense motion estimation model can learn the motion features of the optical flow, so that the pixels required for the person in the first image frame to simulate the human pose specified in the second image frame can be automatically extracted relatively accurately motion data and pixel occlusion data, thereby effectively improving the accuracy of the human body occlusion relationship of the characters in the character-driven image generated by the image-driven model obtained by using the pixel motion data and pixel occlusion data for training, thereby improving the character-driven image. authenticity.
  • the training process of the image-driven model may be: using the key point detection model 301 to extract a plurality of initial character key points and the heat corresponding to each initial character key point from the first image frame and generate an initial local affine transformation matrix according to multiple initial character key points and the heat map corresponding to each initial character key point.
  • the key point detection model 301 is used to extract a plurality of target posture key points and the heat map corresponding to each target posture key point from the second image frame, and according to the plurality of target posture key points and the corresponding heat map of each target posture key point. Heatmap to generate target local affine transformation matrix.
  • the initial local affine transformation matrix is multiplied by the target local affine transformation matrix to obtain the local affine transformation matrix, which is input into the dense motion estimation model 302 to obtain pixel motion data and pixel occlusion data.
  • the image driving model 303 based on the deep learning model is trained to obtain the character driving image output by the image driving model 303 .
  • the trained image-driven model 303 can be used to generate character-driven images.
  • the embodiment of the present disclosure automatically extracts pixel motion data and pixel occlusion data from the local affine transformation matrix and the first image frame through a pre-trained dense motion estimation model, which can improve the accuracy of the extracted character pixel motion features, thereby The accuracy of the human body occlusion relationship of the character in the character-driven image generated by the image-driven model obtained by training the pixel motion data and the pixel occlusion data is improved, and the authenticity of the character-driven image is improved.
  • FIG. 4A is a flowchart of an image generation method according to Embodiment 4 of the present disclosure.
  • This embodiment is applicable to make a person in a person image simulate facial expressions and/or body movements included in a specified video.
  • the method may be performed by the image generating apparatus provided by the embodiment of the present disclosure, and the apparatus may be implemented by means of software and/or hardware, and may generally be integrated in a computer device.
  • the device includes a trained image-driven model, and the training method thereof may refer to the method in the above-mentioned embodiment.
  • the method of this embodiment specifically includes the following steps.
  • the image of the person may include a real image of the person.
  • a person image includes a real image of a person's face and/or a person's body.
  • the character image may include at least one character, and one character may be instructed to select a target character to be driven to simulate a specified expression and/or action according to the actual situation.
  • one of multiple people in the person image can be randomly selected, or the person with the largest proportion of the area in the person image, or the person whose face is not covered in the person image can be selected as the target person.
  • the present disclosure does not limit how to select the target person in the person image.
  • the specified video includes multiple video frames with consecutive time series, and each video frame can be regarded as an image.
  • the target video frame can be any video frame in the specified video.
  • the video frame in the specified video may be selected as the target video frame in sequence according to the timing of video playback, or a video frame may be randomly selected as the target video frame from the specified video.
  • the present disclosure does not limit the selection of the target video frame.
  • the target video frame is used to obtain the target pose information.
  • the target video frame includes target posture information, and the target posture information is used to instruct the target person in the person image to make a specified human body posture (facial posture and/or body posture), that is, the human body posture in the target video frame is transferred to the image of the person.
  • a specified human body posture facial posture and/or body posture
  • the target pose information may include human facial feature data and/or human body feature data.
  • the human facial feature data can be used to characterize the facial posture in the target video frame, so that the face of the target person in the human image simulates the facial posture in the target video frame, driving the facial posture of the target person and the face in the target video frame.
  • Pose matching for example, can drive the face of the target person to make the same expression as in the target video.
  • the facial feature data of a person can represent at least one of the following: the orientation of the person's face, the outline of the face, and the positions of various organs.
  • the character body feature data can be used to characterize the body posture of the person in the target video frame, so that the body of the target person in the character image simulates the body posture of the person in the target video frame, and drives the body posture of the target person to match the body posture in the target video frame.
  • Body pose matching for example, drives the target person's body to perform the same actions as in the target video.
  • the character body feature data may include at least one of the following: the position and direction of the character's head, the position and direction of the character's torso, and the position and direction of the character's limbs.
  • obtaining the target video frame in the designated video may include: obtaining the designated video; obtaining the first video frame in the designated video as the target video frame; after generating the character-driven image corresponding to the target video frame, selecting The next video frame after the target video frame in time sequence is used as a new target video frame, and the character-driven image corresponding to the target video frame is generated again, until the last video frame in the specified video is obtained as the target video frame, and generate a character driving image corresponding to the last video frame.
  • multiple character-driven images can be generated according to the specified video and arranged in sequence, and a character-driven video that is consistent with the expressions and/or actions of a specific character in the specified video can be generated, and finally the target character in the driven character image can be realized.
  • S430 Input the character image and the target video frame into a pre-trained image-driven model, and obtain a character-driven image output by the image-driven model.
  • the training method of the image-driven model is trained and generated.
  • the image-driven model is used to generate a person-driven image in which the target person in the person image simulates the human body posture in the target video frame, and actually drives the target person in the person image to make a human body posture that matches the target video frame.
  • the person driving image may include a target person in the person image, and the human body posture of the target person in the person driving image matches the target posture information included in the target video frame.
  • the image areas other than the target person in the person driving image are the same as those in the person image.
  • the background in the person-driven image is the same as that of the person image
  • the foreground person in the person-driven image is the same as the target person in the person image
  • the actions of the foreground person in the person-driven image are the same as those specified in the target video.
  • the actions of the characters are the same.
  • the character image is used to provide the target character to be driven.
  • the target video frame is used to specify the human pose.
  • the image-driven model is used to synthesize the target person and the specified human body posture, and generate a person image that can show the specified human body posture as a person-driven image.
  • the image-driven model is a pre-trained deep learning model.
  • the image-driven model may be the trained image-driven model 303 shown in FIG. 3F .
  • the person image and the target video frame in the specified video can be input into the trained keypoint detection model 301 as shown in FIG. Extracting the initial pose feature, extracting the target pose feature from the target video frame, and generating a local affine transformation matrix from the initial pose feature to the target pose feature. Then, the local affine transformation matrix and the person image can be input into the trained dense motion estimation model 302 as shown in FIG. 3F, so as to generate pixel motion according to the local affine transformation matrix and the person image data and pixel occlusion data. Finally, the character image, the pixel motion data and the pixel occlusion data can be input into the image driving model 303 obtained by pre-training as shown in FIG. 3F to obtain the character driving image output by the image driving model. Wherein, the image-driven model is generated by training the image-driven model training method according to any one of the embodiments of the present disclosure.
  • the image-driven model is generated by training the image-driven model training method according to any one of the embodiments of the present disclosure.
  • driving the character image according to the target video frame actually means adjusting the initial character pixels in the character image to target pixels matching the human posture specified in the target video frame.
  • the target pixel is not the real pixel in the target video frame, but the expected transformed pixel of the initial character pixel determined according to the human posture specified in the target video frame.
  • the initial human pixels include pixels representing human keypoints.
  • the adjustment data for the initial character pixel can be determined according to the initial character pixel and the matched target pixel, which may specifically include the motion of the initial character pixel and the occlusion between target pixels matched by multiple initial character pixels.
  • the corresponding pixel motion data and pixel occlusion data can be determined according to the person image and the target video frame, and the image-driven model training method according to any one of the embodiments of the present disclosure can be used based on the pixel motion data and pixel occlusion data. to train an image-driven model.
  • the initial character image includes two images in the first column on the left; the three images in the first row above are sequentially used as the target video frame in the specified video, and the formed character driving image can be As shown in the second column to the fourth column of the second row, and the second column to the fourth column of the third row of FIG. 4B , respectively.
  • the image-driven model is trained by using the first image frame and the character pixel motion data and the character pixel occlusion data associated with the driving information as training samples, so that the image-driven model can automatically learn the occlusion features, thereby improving the utilization of The accuracy of the human body occlusion relationship of the character in the character-driven image generated by the image-driven model obtained by training is improved, thereby improving the authenticity of the character-driven image.
  • FIG. 5 is a schematic diagram of an image-driven model training apparatus according to Embodiment 5 of the present disclosure.
  • the fifth embodiment is a corresponding apparatus for implementing the image-driven model training method provided by the above-mentioned embodiments of the present disclosure.
  • the apparatus may be implemented in software and/or hardware, and may generally be integrated in computer equipment or the like.
  • the image-driven model training device includes:
  • An image acquisition module 510 configured to acquire a first image frame and a second image frame, the first image frame and the second image frame may be different video frames in the same video, and the first image frame includes a person
  • the image may be referred to as an initial human image, and the human posture of the human image included in the second image frame may be referred to as a designated human posture;
  • a feature extraction module 520 configured to extract an initial pose feature from the first image frame, extract a target pose feature from the second image frame, and generate a local affine pointing from the initial pose feature to the target pose feature transformation matrix;
  • a data generation module 530 configured to generate pixel motion data and pixel occlusion data according to the local affine transformation matrix and the first image frame;
  • the model training module 540 is configured to train an image-driven model based on a deep learning model according to the first image frame, the pixel motion data and the pixel occlusion data.
  • the image-driven model can be trained by using the first image frame and the pixel motion data and pixel occlusion data associated with the driving information as training samples, so that the image-driven model can automatically learn the occlusion features, thereby effectively ensuring that the obtained results obtained by training are effectively used.
  • the accuracy of the human body occlusion relationship of the character in the character-driven image output by the image-driven model is improved, and the authenticity of the character-driven image is effectively improved.
  • the feature extraction module 520 includes a local affine transformation matrix calculation unit, configured to: input the first image frame into the keypoint detection model, and obtain a plurality of initial character keys output by the keypoint detection model. point and the heat map corresponding to each of the initial character key points; according to each of the initial character key points and the corresponding heat map, generate an initial local affine transformation matrix as the initial pose feature; input the second image frame to the In the key point detection model, obtain a plurality of target posture key points output by the key point detector and a heat map corresponding to each of the target posture key points; according to each of the target posture key points and the corresponding heat map, generate The target local affine transformation matrix is used as the target pose feature; the initial local affine transformation matrix is multiplied by the target local affine transformation matrix to obtain the local affine transformation from the initial pose feature to the target pose feature matrix.
  • a local affine transformation matrix calculation unit configured to: input the first image frame into the keypoint detection model, and obtain a plurality of initial character
  • the local affine transformation matrix calculation unit can be used to: obtain the coordinates of each of the initial character key points or each of the target pose key points, and the matching confidence; The coordinates of the key points of the target posture and the confidence level of the matching are generated to generate heat map regions that are respectively matched with each of the initial character key points or each of the target posture key points; for each of the initial character key points or each The heat map area matched with the key points of the target attitude is converted into a heat map area with a regular shape, and the local affine transformation matrix corresponding to the heat map area with a regular shape is obtained, and is determined as The local affine transformation matrix corresponding to the initial character key point or the target posture key point; the local affine transformation matrix corresponding to each initial character key point or each target posture key point is determined as the initial local affine transformation matrix The affine transformation matrix or the target local affine transformation matrix.
  • the key point detection model includes a U-shaped network.
  • the data generation module 530 may be configured to: input the local affine transformation matrix and the first image frame into a pre-trained dense motion estimation model, and obtain pixel motion data output by the dense motion estimation model. and pixel occlusion data.
  • the dense motion estimation model includes a deep learning model
  • the pixel motion data includes the motion direction of the initial character pixel in the first image frame pointing to the matching target pixel in the second image frame
  • the pixel occlusion data It includes the occlusion sequence relationship between target pixels formed by affine transformation of a plurality of initial human pixels in the first image frame to matching target pixel positions in the second image frame.
  • the image-driven model training device may further include a dense motion estimation model training module for training the dense motion estimation model as follows: the minimum value of the photometric error between the video frame in the training video and the spatially converted video frame As a training objective, a deep learning model is trained to generate a dense motion estimation model.
  • the spatially transformed video frame is generated by inputting the video frame in the training video into the spatial transformation model, and the local spatial features of the video frame in the training video and the matching local spatial features in the spatially transformed video frame same.
  • the model training module 540 may be configured to: calculate the loss function of the deep learning model according to the loss function configuration information, and the loss function configuration information is used to add an isomorphic constraint function as a constraint condition on the basis of the initial loss function.
  • the isomorphism constraint function is determined by the difference between the coordinates of the initial character key points after spatial transformation and the coordinates of the expected key points; if it is determined that the loss function satisfies the stability condition, the deep learning obtained by the current training is used.
  • the model is determined to be the image-driven model.
  • the above-mentioned image generation apparatus can execute any of the image-driven model training methods provided in the embodiments of the present disclosure, and has functional modules and beneficial effects corresponding to the executed image-driven model training methods.
  • FIG. 6 is a schematic diagram of an image generating apparatus according to Embodiment 6 of the present disclosure.
  • Embodiment 6 is a corresponding apparatus for implementing the image generation method provided by the foregoing embodiments of the present disclosure.
  • the apparatus may be implemented in software and/or hardware, and may generally be integrated in computer equipment or the like.
  • the apparatus of this embodiment may include:
  • a person image acquisition module 610 configured to acquire a person image
  • a target video frame acquisition module 620 configured to acquire a target video frame in a specified video
  • the character-driven image generation module 630 is configured to input the character image and the target video frame into a pre-trained image-driven model, and obtain the character-driven image output by the image-driven model, and the image-driven model is configured as described above.
  • the image-driven model training method described in any one of the embodiments is trained and generated.
  • the image-driven model can be trained by using the first image frame and the pixel motion data and pixel occlusion data associated with the driving information as training samples, so that the image-driven model can automatically learn the occlusion features, thereby effectively ensuring that the obtained results obtained by training are effectively used.
  • the above-mentioned image generation apparatus can execute the image generation method provided by any of the embodiments of the present disclosure, and has functional modules and beneficial effects corresponding to the executed image generation method.
  • FIG. 7 is a schematic structural diagram of a computer device according to Embodiment 7 of the present disclosure.
  • FIG. 7 shows a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present disclosure.
  • the computer device 12 shown in FIG. 7 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • computer device 12 takes the form of a general-purpose computing device.
  • Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • Bus 18 represents one or more of several types of bus structures, including a memory bus, a peripheral bus, or using any of a variety of bus structures.
  • bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, enhanced ISA bus, Video Electronics Standards Association (Video Electronics Standards Association) Association, VESA) local bus and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • ISA Video Electronics Standards Association
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including both volatile and nonvolatile media, removable and non-removable media.
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
  • Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 7, commonly referred to as a "hard drive”).
  • a disk mover may be provided for reading and writing to removable non-volatile magnetic disks (eg, "floppy disks"), as well as removable non-volatile optical disks (eg, compact disk read-only memory).
  • System memory 28 may store at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present disclosure.
  • a program/utility 40 having a set (at least one) of program modules 42 may be stored in system memory 28, for example.
  • Program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. An implementation of a network environment may be included in each or some combination of these examples.
  • Program modules 42 generally perform the functions and/or methods of the embodiments described in this disclosure.
  • Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with computer device 12, and/or communicate with Any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. Such communication may take place through an input/output (I/O) interface 22 .
  • the computer equipment 12 can also communicate with one or more networks (such as local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with the bus 18 communicates with other modules of the computer device 12.
  • networks such as local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with the bus 18 communicates with other modules of the computer device 12.
  • the network adapter 20 communicates with
  • the processing unit 16 executes various functional applications and data processing by running the program modules 42 stored in the system memory 28, such as implementing an image-driven model training and/or image generation method provided by any embodiment of the present disclosure.
  • the eighth embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the image-driven model training method provided by all the disclosed embodiments of the present application, or realizes the All disclosed embodiments provide image generation methods.
  • the computer storage medium of the embodiments of the present disclosure may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (a non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, RAM, Read Only Memory (ROM), Erasable Programmable Read Only Memory (Erasable Programmable Read) Only Memory, EPROM), flash memory, portable CD-ROM, optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • suitable medium including but not limited to wireless, wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种图像驱动模型训练方法、图像生成方法以及相应的装置、设备及介质。该方法包括:获取第一图像帧以及第二图像帧(S110),从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵(S120);根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据(S130);根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型(S140)。

Description

图像驱动模型训练、图像生成 技术领域
本公开实施例涉及人工智能领域,尤其涉及一种图像驱动模型训练方法、图像生成方法和相应的装置、设备及介质。
背景技术
近年来,人们对于合成图像的真实度要求越来越高,这要求图像处理技术可以生成更为真实和自然的图像。相关技术中,可以采用单张目标人脸和一个驱动视频(Driving Video),就可以让目标人脸模拟驱动视频中的人的表情或动作。在一个例子中,可以采用姿态估计的算法提取驱动视频的关键点信息,通过生成对抗网络模型(Generative Adversarial Network,GAN)实现模型训练,以使得训练得到的模型可以用于达到目标人脸模拟驱动视频中的人的表情和动作的效果。
然而,由于驱动视频中人物的表情和动作可能存在多种情况,相关技术中,在对模型训练过程中并未充分考虑人物的表情和动作的各种不同情况,所以生成图像的准确性有待提高。
发明内容
本公开实施例提供一种图像驱动模型训练方法、图像生成方法和相应的装置、设备及介质,可以提高生成图像中人物的人体遮挡关系准确性,提高生成图像的真实性。
第一方面,本公开实施例提供了一种图像驱动模型训练方法,包括:获取第一图像帧以及第二图像帧;从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵;根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据;根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
第二方面,本公开实施例提供了一种图像生成方法,包括:获取人物图像;获取指定视频中的目标视频帧;将所述人物图像和所述目标视频帧输入到预先训练的图像驱动模型中,获取所述图像驱动模型输出的人物驱动图像。其中,所述图像驱动模型通过如本公开实施例中任一项所述的图像驱动模型训练方法训练生成。
第三方面,本公开实施例还提供了一种图像驱动模型训练装置,包括:图像获取模块,用于获取第一图像帧以及第二图像帧;特征提取模块,用于从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵;数据生成模块,用于根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据;模型训练模块,用于根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
第四方面,本公开实施例还提供了一种图像生成装置,包括:人物图像获取模块,用于获取人物图像;目标视频帧获取模块,用于获取指定视频中的目标视频帧;人物驱动图像生成模块,用于将所述人物图像和所述目标视频帧输入到预先训练的图像驱动模型中,获取所述图像驱动模型输出的人物驱动图像。其中,所述图像驱动模型通过如本公开实施例中任一项所述的图像驱动模型训练方法训练生成。
第五方面,本公开实施例还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如本公开实施例中任一所述的图像驱动模型训练方法或图像生成方法。
第六方面,本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开实施例中任一所述的图像驱动模型训练方法或图像生成方法。
本公开实施例通过采用第一图像帧以及驱动信息关联的像素运动数据和像素遮挡数据作为训练样本,对图像驱动模型进行训练,可以使图像驱动模型自动学习到遮挡特征,从而可有效提高利用训练得到的该图像驱动模型输出的人物驱动图像中人物的人体遮挡关系的准确性,进而提高该人物驱动图像的真实性。
附图说明
图1是根据本公开实施例一的一种图像驱动模型训练方法的流程图;
图2A是根据本公开实施例二的一种图像驱动模型训练方法的流程图;
图2B是根据本公开实施例的一种局部仿射变换矩阵的示意图;
图3A是根据本公开实施例三的一种图像驱动模型训练方法的流程图;
图3B是根据本公开实施例的第一图像帧的示意图;
图3C是根据本公开实施例的第二图像帧的示意图;
图3D是根据本公开实施例的光流信息图的示意图;
图3E是根据本公开实施例的阴影图的示意图;
图3F是根据本公开实施例三的一种训练图像驱动模型的应用场景的示意图;
图4A是根据本公开实施例四的一种图像生成方法的流程图;
图4B是根据本公开实施例的一种人物驱动图像的示意图;
图5是根据本公开实施例五的一种图像驱动模型训练装置的结构示意图;
图6是根据本公开实施例六的一种图像生成装置的结构示意图;
图7是本公开实施例七中的一种计算机设备的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本公开,而非对本公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本公开相关的部分而非全部结构。
实施例一
图1为根据本公开实施例一的一种图像驱动模型训练方法的流程图,本实施例可适用于训练生成图像驱动模型,该图像驱动模型用于使人物图像中的人物模拟指定视频中包括的面部表情和/或身体动作,也就是驱动人物执行与指定视频相匹配的动作和/或做出与指定视频相匹配的表情。该方法可以由本公开实施例提供的图像驱动模型的训练装置来执行。该装置可采用软件和/或硬件的方式实现,并一般可集成在计算机设备中。如图1所示,本实施例的方法具体包括以下步骤。
S110,获取第一图像帧和第二图像帧。
在一些例子中,可以从驱动视频获取第一图像帧和第二图像帧。驱动视频包括时序上连续的多个视频帧,驱动视频中的视频帧为反映人物运动的图像。或者,可以获取两幅静态图像,每幅静态图像为反映人物运动的图像。或者,可以获取一幅静态图像作为第一图像帧,并从驱动视频获取第二图像帧。其中,第一图像帧中的人物与第二图像帧中的人物可以不同。为了更容易地训练图像驱动模型,第一图像帧中的人物与第二图像帧中的人物也可以相同。
在一些例子中,可以将第一图像帧作为初始人物图像,将第二图像帧作为初始人物图像需要模拟的目标人物图像。换言之,第一图像帧包括指定人物的人物图像,第二图像帧包括指定人体姿态的人物图像。例如,可以将第二图像帧中包括的人物的人体姿态作为初始人物图像中的人物将要模拟的目标人体姿态。其中,第二图像帧可以是驱动视频中的任意一个视频帧。
通常,为了让图像驱动模型可以学习到如何从人物图像生成能够模拟驱动视频中的人体姿态的人物驱动图像,需要使人物图像和生成的人物驱动图像不同,因此需要第一图像帧和第二图像帧不完全相同。例如,如果第一图像帧与第二图像帧来自同一视频,第一图像帧与第二图像帧对应的时间戳至少相隔设定时长,例如1分钟。又例如,第一图像帧与第二图像帧的相似度值小于设定阈值,也就是使第一图像帧与第二图像帧存在一定差异。
S120,从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵。
初始姿态特征用于表征第一图像帧中一个人物的特征,可以包括面部特征数据和/或身体特征数据。目标姿态特征用于表征第二图像帧中一个人物的特征,可以包括面部特征数据和/或身体特征数据。
仿射变换矩阵用于将一个像素矩阵进行空间变换,形成另一个像素矩阵。在本公开实施例中,仿射变换矩阵可以用于将包括人物像素的矩阵进行空间变换,形成另外一个人物像素的矩阵。其中,空间变换包括下述至少一项:线性变化、旋转变换和平移变换等。局部仿射变换矩阵可以针对人物局部区域的像素矩阵进行仿射变换。其中,人物局部区域表示某一人物的局部区域,例如左臂区域、右腿区域或头部区域等,甚至还可以是某一人物的多个人体局部区域的组合。
从初始姿态特征指向目标姿态特征的局部仿射变换矩阵,用于将第一图像帧中的人物像素矩阵(以下也可称为初始人物像素矩阵)通过仿射变换形成与第二图像帧匹配的目标人物像素矩阵。具体的,仿射变换矩阵可通过根据初始人物像素矩阵和匹配的目标人物像素矩阵确定,其中,人物像素可以是表示人体关键点的像素。
S130,根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据。
像素运动数据可以表示像素(例如人物关联的像素)移动到指定像素位置(例如指定人体姿态关联的像素位置)的运动。像素遮挡数据可以表示在像素移动到指定像素位置的过程中,移动到同一像素位置的多个像素之间的遮挡关系。
根据局部仿射变换矩阵和第一图像帧,可以确定第一图像帧中人物像素的运动方向(变换矢量),作为驱动信息关联的像素运动数据;以及确定运动到同一像素位置的像素的前后遮挡顺序,作为驱动信息关联的像素遮挡数据。
S140,根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深 度学习模型的图像驱动模型。
在本公开实施例中,图像驱动模型用于驱动人物做出指定的人体姿态,可以理解为将人物关联的像素移动到指定像素位置处以形成该人物的指定人体姿态。在将人物像素移动到指定人体姿态匹配的像素位置的过程中,需要确定移动方向和移动距离。相应的,像素运动数据可以包括移动方向和/或移动距离等。而且,指定的人体姿态中可能存在肢体相互遮挡的情况,并使得多个初始人物像素移动到同一指定像素位置,从而针对该同一指定像素位置,需要获取所述多个初始人物像素之间的遮挡关系,并将未被遮挡的初始人物像素在最终形成的人物驱动图像中进行展示。相应的,像素遮挡数据可以包括关键点的遮挡关系。
图像驱动模型通过学习第一图像帧、像素运动数据和像素遮挡数据,可以按照第二图像帧中目标姿态特征将第一图像帧中的人物准确调整成第二图像帧指定的人体姿态,生成相应的人物驱动图像。
将生成的像素运动数据、像素遮挡数据和第一图像帧作为训练样本,对基于深度学习的图像驱动模型进行训练,以使图像驱动模型从目标姿态特征中学习到像素运动数据和像素遮挡数据,以及从根据像素运动数据、像素遮挡数据和第一图像帧生成模拟第二图像帧指定的人体姿态的人物驱动图像的过程中,自动学习如何从人物图像生成模拟驱动视频指定的人体姿态的人物驱动图像。所训练的图像驱动模型为端到端模型,可以避免对图像预处理的操作,大大简化了模型训练过程,同时降低因多环节的图像处理的误差引入,从而提高利用经训练的图像驱动模型所生成的人物驱动图像的准确率。
本公开实施例通过采用第一图像帧以及驱动信息关联的像素运动数据和像素遮挡数据作为训练样本,对图像驱动模型进行训练,可以使图像驱动模型自动学习到遮挡特征,从而有效保证利用训练得到的该图像驱动模型输出的人物驱动图像中人物的人体遮挡关系的准确性,进而提高该人物驱动图像的真实性。
实施例二
图2A为根据本公开实施例二的一种图像驱动模型训练方法的流程图,本实施例以上述实施例为基础进行具体化。本实施例的方法具体包括:
S201,获取驱动视频。
本公开实施例中未详尽的描述可以参考前述实施例。
S202,获取所述驱动视频中的第一图像帧以及第二图像帧。其中,所述第一图像帧和所述第二图像帧为不同的视频帧,所述第一图像帧包括的人物图像可称为初始人物图像,所述第二图像帧包括的人物图像的人体姿态可称为指定人体姿态。
S203,将所述第一图像帧输入到关键点检测模型中,获取所述关键点检测模型输出的多个初始人物关键点和各所述初始人物关键点对应的热力图。
关键点检测模型用于在人物图像中检测人体关键点,并生成热力图(Heat map)。Heat map可以用颜色变化来反映二维矩阵或表格中的数据信息,它可以直观地将数据值的某个属性(例如大小或密度等)以定义的颜色深浅表示出来。初始人物关键点可以是第一图像帧中人物的人体关键点。对应的热力图用于描述初始人物关键点位于第一图像帧中各位置的概率。
可选的,所述关键点检测模型包括U型网络(U-Net)。其中,U-Net可以包括编码器和解码器。编码器可以包括四个子模块,每个子模块包括两个卷积层。每个子模块分别与一个下采样层相连,下采样层可通过最大池化网络实现,也即,每个子模块的输出结果输入到下采样层进行下采样。数据依次经过下采样层,分辨率依次下降。解码器可 以包括四个子模块,每个子模块分别与一个上采样层相连。数据依次经过上采样层,分辨率依次上升,直到与输入U-Net的图像的分辨率一致。从而在图像从输入U-Net到输出U-Net的整个过程中,图像的分辨率的大小变化形成U型效果。U-Net还使用了跳跃连接,将解码器中某个子模块输出的上采样结果与编码器中具有相同分辨率的子模块的输出进行连接,作为解码器中下一个子模块的输入。在关键点检测模型中,U-Net将浅层特征图与深层特征图结合,这样可以结合局部条件(Where)以及全局内容(What)的特征生成更精准的图像,从而可以根据更精准的图像进行关键点检测,提高关键点检测的准确率。
S204,根据各所述初始人物关键点和对应的热力图,生成初始局部仿射变换矩阵,作为初始姿态特征。
具体的,根据初始人物关键点可以确定初始人物关键点在第一图像帧中每个位置的概率,并根据该概率和关键点的位置可以生成对应的热力图。由于每个关键点对应的热力图的形状不同,可以将各关键点对应的热力图统一变换成指定形状,并将对各关键点对应的热力图进行变换的仿射变换矩阵作为关键点的初始局部仿射变换矩阵。
可选的,所述根据各所述初始人物关键点和对应的热力图,生成初始局部仿射变换矩阵,包括:获取各所述初始人物关键点的坐标,以及匹配的置信度;根据各所述初始人物关键点的坐标以及匹配的置信度,生成分别与各所述初始人物关键点匹配的热力图区域;针对每个所述初始人物关键点匹配的热力图区域,将所述热力图区域转换为设定规则形状,并将转换为所述设定规则形状时对应的局部仿射变换矩阵,确定为所述初始人物关键点对应的局部仿射变换矩阵;根据各所述初始人物关键点对应的局部仿射变换矩阵,确定初始局部仿射变换矩阵。
在一些实施例中,可以通过U-Net或者例如CPM(Convolutional Pose Machines,卷积姿态机)算法的其他回归算法,计算初始人物关键点在第一图像帧中的预测坐标以及该初始人物关键点在第一图像帧中每个位置的概率,并且根据初始人物关键点在第一图像帧中的预测坐标以及该初始人物关键点在该预测坐标周围的位置的概率,确定该预测坐标的置信度。通常,将初始人物关键点在第一图像帧中概率最大的位置确定作为该初始人物关键点在第一图像帧中的预测坐标。
根据初始人物关键点在第一图像帧中的预测坐标以及该初始人物关键点在第一图像帧中每个位置的概率,生成以初始人物关键点为中心的热力图。热力图用于通过颜色表示中心点(即概率最大的预测坐标位置)对周围的影响力。通过U-Net可以获取每个关键点的坐标以及该坐标的置信度,具体是(x1,y1,m1,n1)。其中,(x1,y1)是坐标,m1为x1的置信度,n1为y1的置信度。置信度的取值范围可为[0,1]。
具体的,为了生成热力图,可预先生成一个设定的奇数矩阵(例如,3*3矩阵或5*5矩阵)。例如,以概率最大的预测坐标位置为矩阵中心,根据该预测坐标位置对应的置信度,在x轴方向和y轴方向分别采用双线性插值方法进行插值,并配置插入的坐标点的像素色彩值作为矩阵中元素,从而生成热力图对应的奇数矩阵。所插入的坐标点的像素色彩值与该坐标点和中心点之间的距离存在对应关系,例如,远离中心点的坐标点的像素色彩值的红色值越低,靠近中心点的坐标点的像素色彩值的红色值越高。
奇数矩阵通常无法直接用于人物像素矩阵的仿射变换,由此,可以通过对热力图对应的奇数矩阵进行仿射变换,生成设定规则形状的矩阵,作为初始人物关键点对应的局部仿射变换矩阵。设定规则形状可以根据需要进行设定,示例性的,设定规则形状可为2*3矩阵,此外还有其他情形,对此,本公开实施例不作具体限制。
在指定上述奇数矩阵和设定规则形状矩阵之后,可以通过指定上述奇数矩阵和设定 规则形状矩阵之间的映射方式,确定由奇数矩阵指向设定规则形状矩阵的变换方法。例如,可采用一个仿射变换矩阵,与指定的奇数矩阵相乘,乘积为设定规则形状矩阵。相应的,将热力图对应的奇数矩阵与该仿射变换矩阵相乘,所得到的乘积结果即为初始人物关键点对应的设定规则形状的局部仿射变换矩阵。
初始局部仿射变换矩阵包括多个初始人物关键点各自对应的局部仿射变换矩阵。
通过基于各初始人物关键点在第一图像帧中的预测坐标和该预测坐标的置信度,生成各初始人物关键点对应的热力图,并根据各热力图确定各初始人物关键点对应的局部仿射变换矩阵、以及进一步确定初始局部仿射变换矩阵,可以相对准确地评估初始人物关键点的预测准确性,有效指示图像驱动模型相对准确地学习初始人物关键点的坐标,从而可以有效提高训练得到的该图像驱动模型的人物关键点的识别准确率,进而提高利用该图像驱动模型生成的人物驱动图像的准确率。
此外,还可以通过U-Net为每个初始人物关键点预测4个标量加权数值,并根据所述标量加权数值对每个初始人物关键点对应的热力图的区域置信度进行加权,最终获得每个初始人物关键点对应的例如2x3设定规则形状矩阵的局部仿射变换矩阵。
S205,将所述第二图像帧输入到所述关键点检测模型中,获取所述关键点检测器输出的多个目标姿态关键点和各所述目标姿态关键点对应的热力图。
目标姿态关键点可以是第二图像帧中人物的人体关键点。对应的热力图用于描述目标姿态关键点位于第二图像帧中各位置的概率。
S206,根据各所述目标姿态关键点和对应的热力图,生成目标局部仿射变换矩阵,作为目标姿态特征。
目标局部仿射变换矩阵的生成方法同上述初始局部仿射变换矩阵的生成方法,在此不再赘述。
S207,将所述初始局部仿射变换矩阵与所述目标局部仿射变换矩阵相乘,获取从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵。
局部仿射变换矩阵为初始局部仿射变换矩阵与目标局部仿射变换矩阵相乘的结果。实际上矩阵可以表征图像特征,初始局部仿射变换矩阵用于描述或表征第一图像帧的初始姿态特征,目标局部仿射变换矩阵用于描述或表征第二图像帧的目标姿态特征,两者相乘得到的局部仿射变换矩阵用于描述或表征从初始姿态特征到目标姿态特征的变化量。从而,根据局部仿射变换矩阵,可以将第一图像帧中的人物像素矩阵变换形成与第二图像帧中的人体姿态匹配的目标人物像素矩阵。
示例性的,局部仿射变换矩阵的示意图如图2B所示,一个矩形代表一个局部仿射变换矩阵。一个局部放射变换矩阵可以与人体中局部区域,例如,左臂区域、右臂区域、左腿区域或右腿区域相关联。
S208,根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据。
S209,根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
可选的,所述根据所述第一图像帧、像素运动数据和像素遮挡数据,训练基于深度学习模型的图像驱动模型,包括:根据损失函数配置信息计算所述深度学习模型的损失函数,所述损失函数配置信息用于在深度学习模型的初始损失函数的基础上添加同变性约束函数,所述同变性约束函数通过对初始人物关键点进行空间变换后的坐标与期望关 键点的坐标之间的差值确定;如果确定所述损失函数满足稳定条件,则将当前训练得到的深度学习模型确定为所述图像驱动模型,否则返回步骤202再次执行对基于深度学习模型的图像驱动模型的训练。
损失函数配置信息用于在图像驱动模型的初始损失函数的基础上,添加同变性约束函数。其中,该同变性约束函数可以包括欧氏距离范数,又可称为正则化项或者L2范数,是指各元素的平方和再开方的结果。添加欧氏距离范数相当于对初始损失函数添加约束条件,实际是对于大数值的权重向量进行严厉惩罚,以倾向于更加分散的权重向量,从而实现使权重的分配更均匀,并避免权重集中在少数向量上,使得图像驱动模型更接近低维模型。维度越低,训练使用的数据量越小。因此,对初始损失函数添加欧氏距离范数作为约束条件,可以降低图像驱动模型训练使用的数据量,从而可以降低图像驱动模型训练的复杂度。
稳定条件用于判断损失函数是否趋于稳定、趋于收敛。例如,稳定条件用于判断相邻训练轮次中损失函数的变化率是否小于设定的变化率阈值。其中,该变化率阈值的大小可以根据实际情况限定。相邻训练轮次中损失函数的变化率可以是:计算当前训练得到的损失函数的值与前一次训练得到的损失函数的值之间的差值,并计算该差值与当前训练得到的损失函数的值的比值。如果该比值小于设定的变化率阈值,则确定即使再训练损失函数的变化率也很小,表明损失函数趋于稳定,或称损失函数收敛。此时,确定深度学习模型训练完成,将当前训练得到的深度学习模型作为图像驱动模型。
具体的,通过在图像驱动模型的初始损失函数基础上添加欧氏距离范数而更新后的损失函数LOSS_new可以如公式(1)所示:
LOSS_new=LOSS+L eqv       (1)
其中,LOSS为图像驱动模型的初始损失函数,L eqv为同变性约束函数。
同变性约束函数根据对初始人物关键点进行空间变换后的坐标与期望关键点的坐标之间的差值确定,同变性约束函数L eqv可以如公式(2)所示:
Figure PCTCN2021103042-appb-000001
其中,K为初始人物关键点的数量,(x′ k,y′ k)为第k个初始人物关键点的坐标,(x k,y k)为第k个期望关键点的坐标,(x k,y k)实际表示初始人物关键点(x′ k,y′ k)经期望变换后形成的期望关键点。g(*)为用于对初始人物关键点进行坐标变换的函数,g(x′ k,y′ k)为初始人物关键点(x′ k,y′ k)通过g(*)进行坐标变换后的坐标。g(x′ k,y′ k)越接近(x k,y k),表明初始人物关键点越接近期望关键点,也即g(*)越接近期望变换。‖·‖ F为隐藏空间的参数项矩阵的F范数。其中,期望关键点可以配置为趋近于目标姿态关键点的关键点,也即,期望关键点可以是由初始人物关键点变换成为目标姿态关键点的过程中的中继。例如,期望关键点可以是初始人物关键点经过期望变换后得到的,期望变换可以是对局部仿射变换矩阵进行幅度限定后的变换。通过设置一个或多个期望关键点,可以使得将初始人物调整到目标姿态的动作更平滑。
具体的,g(*)可以理解为一个随机创建的薄板样条(Thin Plate Spline,TPS),可以使用随机平移、旋转和缩放来确定TPS的全局仿射分量,并通过空间扰动一组控制点来确定局部TPS分量。
本公开的发明人发现,采用自监督的方式训练关键点检测模型,相对于监督方式来说,不可避免的会导致关键点的不稳定乃至不准确。有鉴于此,可通过在初始损失函数的基础上添加同变性约束函数来保证图像语义信息的一致性,例如,在经过动作变换后的图像中,人物的胳膊和腿不会发生错位等。
通过在图像驱动模型的初始损失函数的基础上添加同变性约束函数作为约束条件,可以使向量的权重的分配更均匀,避免权重集中在少数向量上,从而不仅可以降低图像驱动模型训练时使用的数据量和计算的复杂度,还使得图像驱动模型可以自动学习如何准确建立人物图像的关键点与期望生成的人物驱动图像的关键点之间的对应关系,从而可有效提高利用训练得到的图像驱动模型所生成的人物驱动图像中人体结构的准确率和真实性。
在本公开实施例中,通过直接获取第一图像帧和第二图像帧中的人体关键点和对应的热力图,并根据人体关键点和对应的热力图确定姿态特征生成局部仿射变换矩阵,可以获取使第一图像帧中人物模拟第二图像帧中指定的人体姿态所需的像素运动数据和像素遮挡数据,并可以将所述像素运动数据和像素遮挡数据结合第一图像帧生成训练样本,用于对基于深度学习模型的图像驱动模型进行训练。这样,可以有效减少人工标注样本的需求量,显著降低模型训练所需的人工成本。
实施例三
图3A为根据本公开实施例三的一种图像驱动模型训练方法的流程图,本实施例以上述实施例为基础进行具体化。本实施例的方法具体包括:
S301,获取驱动视频。
本公开实施例中未详尽的描述可以参考前述实施例。
S302,获取所述驱动视频中的第一图像帧以及第二图像帧。其中,所述第一图像帧和所述第二图像帧为不同的视频帧,所述第一图像帧包括的人物图像可称为初始人物图像,包括第一图像帧中的多个初始人物像素;所述第二图像帧的包括人物图像的人体姿态可称为指定的人体姿态或目标姿态,与期望的多个目标像素位置关联。
S303,从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵。
S304,将所述局部仿射变换矩阵和所述第一图像帧输入到预先训练得到的密集运动估计模型中,获取所述密集运动估计模型输出的像素运动数据和像素遮挡数据。其中,所述密集运动估计模型包括深度学习模型。所述像素运动数据包括所述第一图像帧中的各初始人物像素指向所述第二图像帧中匹配的目标像素位置的运动方向,所述像素遮挡数据包括所述第一图像帧中多个初始人物像素在通过仿射变换移动到所述第二图像帧中匹配的同一个目标像素位置时相互之间的遮挡顺序关系。
密集运动估计模型用于估计各初始人物像素的运动情况和不同的初始人物像素在运动之后的遮挡顺序。密集运动估计模型可为预先训练的深度学习模型。
初始人物像素为第一图像帧包括的人物的像素。初始人物像素可以包括表示人体关键点的像素。目标像素位置为模拟第二图像帧指定的人体姿态期望初始人物像素移动到的像素位置。目标像素位置不一定是第二图像帧包括的初始人物像素匹配的像素的位置。
像素运动数据用于确定从初始人物像素变换到目标像素位置的运动矢量。通常运动矢量可以是从初始人物像素点指向目标像素位置的方向和大小,具体可以采用光流信息图表示。其中,光流信息图包括多个区域像素集合,每个区域像素集合可以采用箭头方向表示运动方向,箭头大小可表示矢量大小。假设第一图像帧如图3B所示,第二图像帧如图3C所示,第一图像帧中的人物模拟第二图像帧中的人物的动作,相应获取的光流信息图效果可以如图3D所示,每个箭头代表一个像素区域的运动矢量。
像素遮挡数据用于确定不同目标人物像素之间的遮挡顺序关系。目标人物像素可以表示初始人物像素移动到匹配的目标像素位置后形成的像素。第一图像帧中各初始人物 像素通过仿射变换到第二图像帧中匹配的目标像素位置形成相应的目标人物像素后,可能存在多个初始人物像素点分别匹配的多个目标人物像素位于同一个像素位置的情况。当多个目标人物像素位于同一个像素位置时,只展示顶层的目标人物像素,其他目标人物像素作为被遮挡的像素不进行显示。遮挡顺序关系用于描述多个像素的显示顺序,只有置于顶层,即未被遮挡的像素可以显示。其中,可以阴影图表示像素遮挡数据。例如,将图3B所示第一图像帧中的人物模拟图3C所示第二图像帧中的人物的动作,相应获取的阴影图可如图3E所示。图3E中,越暗的地方表示灰度值越低(也就是灰度值接近0),代表该区域被遮挡的程度越高;越亮的地方表示灰度值越高(也就是灰度值接近255),代表该区域被遮挡的程度越低。
可选的,预先通过如下方式训练所述密集运动估计模型:将训练视频中的视频帧与空间转换视频帧的光测误差的最小值作为训练目标,对深度学习模型进行迭代训练,以得到所述密集运动估计模型。其中,所述空间转换视频帧通过将所述训练视频中的视频帧输入到空间转换模型生成,所述训练视频中的视频帧的局部空间特征与所述空间转换视频帧中匹配的局部空间特征相同。
训练视频中的视频帧可以是训练视频中的任意视频帧。空间转换视频帧可以是通过采用空间变换方法对训练视频中的视频帧进行空间变换,所生成的视频帧。训练视频中的视频帧的局部空间特征与空间转换视频帧中匹配的局部空间特征相同,表明训练视频中的视频帧和空间转换视频帧满足空间不变性,也表明空间变换方法满足空间不变性。示例性的,空间变换方法可以是牛津大学的Max Jaderberg,Karen Simonyan等人提出的空间转换模块(spatial transformer modules)实现。
其中,密集运动估计模型的训练目标如公式(3)所示:
Figure PCTCN2021103042-appb-000002
其中,N为视频帧包括的像素的总数量,(i,j)为像素的坐标,I 1(i,j)为训练视频中的视频帧的局部空间特征,I′ 1(i,j)为空间转换视频帧中匹配的局部空间特征,ρ(*)用于表示训练视频中的视频帧的局部空间特征与空间转换视频帧中匹配的局部空间特征之间的光测误差,如光强差值和光的改变方向。密集运动估计模型的训练目标为最小化L reconst
通过训练密集运动估计模型,可以使密集运动估计模型学习到光流的运动特征,从而可以相对准确地自动提取出使第一图像帧中人物模拟第二图像帧中指定的人体姿态所需的像素运动数据和像素遮挡数据,从而有效提高利用所述像素运动数据和像素遮挡数据进行训练得到的图像驱动模型所生成的人物驱动图像中人物的人体遮挡关系的准确性,进而提高该人物驱动图像的真实性。
S305,根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
在一个具体的例子中,如图3F所示,图像驱动模型的训练过程可以是:采用关键点检测模型301从第一图像帧中提取多个初始人物关键点和各初始人物关键点对应的热力图,并根据多个初始人物关键点和各初始人物关键点对应的热力图生成初始局部仿射变换矩阵。可并行地,采用关键点检测模型301从第二图像帧中提取多个目标姿态关键点和各目标姿态关键点对应的热力图,并根据多个目标姿态关键点和各目标姿态关键点对应的热力图,生成目标局部仿射变换矩阵。将初始局部仿射变换矩阵与目标局部仿射变换矩阵相乘,得到局部仿射变换矩阵,输入到密集运动估计模型302中,可以获取像素运动数据和像素遮挡数据。将第一图像帧、像素运动数据和像素遮挡数据作为图像运动样本,对基于深度学习模型的图像驱动模型303进行训练,获取图像驱动模型303输出的人物驱动图像。训练完成的图像驱动模型303可用于生成人物驱动图像。
本公开实施例通过预先训练的密集运动估计模型,从局部仿射变换矩阵和第一图像帧中自动提取出像素运动数据和像素遮挡数据,可以提高所提取的人物像素运动特征的准确率,从而提高利用像素运动数据和像素遮挡数据进行训练得到的图像驱动模型所生成的人物驱动图像中的人物的人体遮挡关系的准确性,并提高该人物驱动图像的真实性。
实施例四
图4A为根据本公开实施例四的一种图像生成方法的流程图,本实施例可适用于使人物图像中的人物模拟指定视频包括的面部表情和/或身体动作。该方法可以由本公开实施例提供的图像生成装置来执行,该装置可采用软件和/或硬件的方式实现,并一般可集成在计算机设备中。该装置包括经过训练的图像驱动模型,其训练方法可参考上述实施例中的方法。如图4A所示,本实施例的方法具体包括以下步骤。
S410,获取人物图像。
人物图像可包括人物的真实图像。例如,人物图像包括人脸和/或人物身体的真实图像。人物图像中可包括至少一个人物,可以根据实际情况指示一个人物选择要被驱动来模拟指定的表情和/或动作的目标人物。例如,可随机选择人物图像中的多个人物之一,或选择在人物图像中所占面积比例最大的人物,或选择人物图像中脸部未被遮挡的人物,作为目标人物。对如何选择人物图像中的目标人物,本公开不作限制。
S420,获取指定视频中的目标视频帧。
指定视频包括时序连续的多个视频帧,每个视频帧都可以看作是一个图像。目标视频帧可以是指定视频中的任意一个视频帧。可以按照视频播放的时序,依次选择指定视频中的视频帧作为目标视频帧,或者也可以从指定视频中随机选取一个视频帧作为目标视频帧,本公开对目标视频帧的选取不作限制。目标视频帧用于获取目标姿态信息。
目标视频帧包括目标姿态信息,该目标姿态信息用于指示人物图像中的目标人物做出指定的人体姿态(面部姿态和/或身体姿态),即将目标视频帧中的人体姿态迁移到人物图像的目标人物中,以使人物图像中的目标人物模拟目标视频帧中的人体姿态,包括驱动目标人物执行目标视频帧指定的身体动作和/或做出面部表情等。目标姿态信息可以包括人物面部特征数据和/或人物身体特征数据。人物面部特征数据可以用于表征目标视频帧中的面部姿态,从而使得人物图像中的目标人物的人脸模拟目标视频帧中的面部姿态,驱动目标人物的面部姿态与目标视频帧中的该面部姿态匹配,例如,可以驱动目标人物的面部做出与目标视频中相同的表情。人物面部特征数据可以表征下述至少一项:人物脸部的方向、脸部轮廓和各器官的位置等。人物身体特征数据可以用于表征目标视频帧中人物的身体姿态,从而使得人物图像中的目标人物的身体模拟目标视频帧中人物的身体姿态,驱动目标人物的身体姿态与目标视频帧中的该身体姿态匹配,例如,驱动目标人物的身体做出与目标视频中相同的动作。人物身体特征数据可以包括下述至少一项:人物头部位置和方向、人物躯干位置和方向、以及人物四肢位置和方向等。
可选的,获取指定视频中的目标视频帧可以包括:获取指定视频;获取所述指定视频中的首个视频帧作为目标视频帧;在生成所述目标视频帧对应的人物驱动图像之后,选择时序上在该目标视频帧后的下一视频帧作为新的目标视频帧,并再次生成所述目标视频帧对应的人物驱动图像,直至获取所述指定视频中最后一个视频帧作为目标视频帧,并生成所述最后一个视频帧对应的人物驱动图像。以此类推,可以根据指定视频生成多个人物驱动图像,并按照时序进行排列,可以生成与指定视频中特定人物的表情和/或动作一致的人物驱动视频,最终实现驱动人物图像中的目标人物做出与指定视频相匹配的各个人体姿态,执行匹配的动作,和/或做出相匹配的面部表情等。此外,由于相邻视频帧的差异很小,还可以选择每间隔设定时长获取一个视频帧作为目标视频帧,例 如,设定时长为0.5s。
S430,将所述人物图像和所述目标视频帧输入到预先训练的图像驱动模型中,获取所述图像驱动模型输出的人物驱动图像,所述图像驱动模型通过如本公开实施例中任一项所述的图像驱动模型的训练方法训练生成。
图像驱动模型用于生成人物图像中目标人物模拟目标视频帧中人体姿态的人物驱动图像,实际是驱动人物图像中目标人物做出与目标视频帧匹配的人体姿态。其中,人物驱动图像可以是包括人物图像中的目标人物,且人物驱动图像中所述目标人物的人体姿态与目标视频帧包括的目标姿态信息匹配。此外,人物驱动图像中除所述目标人物以外的图像区域均与人物图像中的相同。也就是说,人物驱动图像中的背景和人物图像的相同,人物驱动图像中作为前景的人物和人物图像中的目标人物相同,但是人物驱动图像中作为前景的人物的动作和目标视频中指定的人物的动作相同。
实际上,人物图像用于提供待驱动的目标人物。目标视频帧用于指定人体姿态。图像驱动模型用于将目标人物与指定的人体姿态进行合成,生成能够展现出该指定的人体姿态的人物图像,作为人物驱动图像。
图像驱动模型为预先训练的深度学习模型,示例性的,图像驱动模型可以为图3F所示的训练完成的图像驱动模型303。
在一些可选实施例中,在获取人物图像和指定视频中的目标视频帧后,可以先将它们输入如图3F所示的训练完成的关键点检测模型301,从而从所述人物图像帧中提取初始姿态特征,从所述目标视频帧中提取目标姿态特征,并生成从所述初始姿态特征指向目标姿态特征的局部仿射变换矩阵。然后,可以将所述局部仿射变换矩阵和所述人物图像输入如图3F所示的训练完成的密集运动估计模型302,从而根据所述局部仿射变换矩阵和所述人物图像,生成像素运动数据和像素遮挡数据。最后,可以将所述人物图像、所述像素运动数据和所述像素遮挡数据输入到如图3F所示的预先训练得到的图像驱动模型303中,获取所述图像驱动模型输出的人物驱动图像。其中,所述图像驱动模型通过如本公开实施例中任一项所述的图像驱动模型训练方法训练生成。
图像驱动模型通过如本公开实施例中任一项所述的图像驱动模型训练方法训练生成。具体的,将人物图像按照目标视频帧进行驱动,实际是将人物图像中的初始人物像素调整成与目标视频帧中指定的人体姿态匹配的目标像素。其中,目标像素不是目标视频帧中的真实像素,而是根据目标视频帧中指定的人体姿态确定的初始人物像素期望变换后的像素。初始人物像素包括表示人体关键点的像素。根据初始人物像素和匹配的目标像素可以确定对初始人物像素的调整数据,具体可以包括初始人物像素的运动情况和多个初始人物像素匹配的目标像素之间的遮挡情况。换言之,可以根据人物图像和目标视频帧确定相应的像素运动数据和像素遮挡数据,并基于所述像素运动数据和像素遮挡数据采用如本公开实施例中任一项所述的图像驱动模型训练方法来训练得到图像驱动模型。
在一个例子中,如图4B所示,初始人物图像包括左边第一列的两张图像;依次将上面第一行的三张图像作为指定视频中的目标视频帧,所形成的人物驱动图像可分别如图4B的第二行的第二列到第四列,和第三行的第二列到第四列所示。
本公开实施例通过采用第一图像帧以及驱动信息关联的人物像素运动数据和人物像素遮挡数据作为训练样本,对图像驱动模型进行训练,可以使图像驱动模型自动学习到遮挡特征,从而可以提高利用训练得到的该图像驱动模型生成的人物驱动图像中人物的人体遮挡关系准确性,进而提高该人物驱动图像的真实性。
实施例五
图5为根据本公开实施例五的一种图像驱动模型训练装置的示意图。实施例五是实现本公开上述实施例提供的图像驱动模型训练方法的相应装置,该装置可采用软件和/或硬件的方式实现,并一般可集成在计算机设备中等。如图5所示,该图像驱动模型训练装置包括:
图像获取模块510,用于获取第一图像帧以及第二图像帧,所述第一图像帧和所述第二图像帧可为同一视频中不同的视频帧,所述第一图像帧包括的人物图像可称为初始人物图像,所述第二图像帧包括的人物图像的人体姿态可称为指定的人体姿态;
特征提取模块520,用于从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵;
数据生成模块530,用于根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据;
模型训练模块540,用于根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
本公开实施例通过采用第一图像帧以及驱动信息关联的像素运动数据和像素遮挡数据作为训练样本,对图像驱动模型进行训练,可以使图像驱动模型自动学习到遮挡特征,从而有效保证利用训练得到的该图像驱动模型输出的人物驱动图像中人物的人体遮挡关系准确性,并有效提高该人物驱动图像的真实性。
进一步的,所述特征提取模块520包括局部仿射变换矩阵计算单元,用于:将所述第一图像帧输入到关键点检测模型中,获取所述关键点检测模型输出的多个初始人物关键点和各所述初始人物关键点对应的热力图;根据各所述初始人物关键点和对应的热力图,生成初始局部仿射变换矩阵作为初始姿态特征;将所述第二图像帧输入到所述关键点检测模型中,获取所述关键点检测器输出的多个目标姿态关键点和各所述目标姿态关键点对应的热力图;根据各所述目标姿态关键点和对应的热力图,生成目标局部仿射变换矩阵作为目标姿态特征;将所述初始局部仿射变换矩阵与所述目标局部仿射变换矩阵相乘,获取从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵。
进一步的,所述局部仿射变换矩阵计算单元可用于:获取各所述初始人物关键点或各所述目标姿态关键点的坐标,以及匹配的置信度;根据各所述初始人物关键点或各所述目标姿态关键点的坐标以及匹配的置信度,生成分别与各所述初始人物关键点或各所述目标姿态关键点匹配的热力图区域;针对每个所述初始人物关键点或各所述目标姿态关键点匹配的热力图区域,将所述热力图区域转换为设定规则形状的热力图区域,并获取所述设定规则形状的热力图区域对应的局部仿射变换矩阵,确定为所述初始人物关键点或所述目标姿态关键点对应的局部仿射变换矩阵;将各所述初始人物关键点或各所述目标姿态关键点对应的局部仿射变换矩阵,确定为初始局部仿射变换矩阵或所述目标局部仿射变换矩阵。
进一步的,所述关键点检测模型包括U型网络。
进一步的,所述数据生成模块530可用于:将所述局部仿射变换矩阵和所述第一图像帧输入到预先训练的密集运动估计模型中,获取所述密集运动估计模型输出的像素运动数据和像素遮挡数据。其中,所述密集运动估计模型包括深度学习模型,所述像素运动数据包括所述第一图像帧中初始人物像素指向所述第二图像帧中匹配的目标像素的运动方向,所述像素遮挡数据包括所述第一图像帧中多个初始人物像素通过仿射变换到所述第二图像帧中匹配的目标像素位置所形成的目标像素之间的遮挡顺序关系。
进一步的,所述图像驱动模型训练装置还可包括密集运动估计模型训练模块,用于如下训练所述密集运动估计模型:将训练视频中的视频帧与空间转换视频帧的光测误差的最小值作为训练目标,对深度学习模型进行训练以生成密集运动估计模型。其中,所述空间转换视频帧通过将所述训练视频中的视频帧输入到空间转换模型生成,所述训练视频中的视频帧的局部空间特征与所述空间转换视频帧中匹配的局部空间特征相同。
进一步的,所述模型训练模块540可用于:根据损失函数配置信息计算所述深度学习模型的损失函数,所述损失函数配置信息用于在初始损失函数的基础上添加同变性约束函数作为约束条件,所述同变性约束函数通过对初始人物关键点进行空间变换后的坐标与期望关键点的坐标之间的差值确定;如果确定所述损失函数满足稳定条件,则将当前训练得到的深度学习模型确定为所述图像驱动模型。
上述图像生成装置可执行本公开实施例任一所提供的图像驱动模型训练方法,具备执行的图像驱动模型训练方法相应的功能模块和有益效果。
实施例六
图6为根据本公开实施例六的一种图像生成装置的示意图。实施例六是实现本公开上述实施例提供的图像生成方法的相应装置,该装置可采用软件和/或硬件的方式实现,并一般可集成在计算机设备中等。
相应的,本实施例的装置可以包括:
人物图像获取模块610,用于获取人物图像;
目标视频帧获取模块620,用于获取指定视频中的目标视频帧;
人物驱动图像生成模块630,用于将所述人物图像和所述目标视频帧输入到预先训练的图像驱动模型中,获取所述图像驱动模型输出的人物驱动图像,所述图像驱动模型通过如前述任一实施例所述的图像驱动模型训练方法训练生成。
本公开实施例通过采用第一图像帧以及驱动信息关联的像素运动数据和像素遮挡数据作为训练样本,对图像驱动模型进行训练,可以使图像驱动模型自动学习到遮挡特征,从而有效保证利用训练得到的该图像驱动模型输出的人物驱动图像中的人物的人体遮挡关系准确性,进而提高该人物驱动图像的真实性。
上述图像生成装置可执行本公开实施例任一所提供的图像生成方法,具备执行的图像生成方法相应的功能模块和有益效果。
实施例七
图7为根据本公开实施例七提供的一种计算机设备的结构示意图。图7示出了适于用来实现本公开实施方式的示例性计算机设备12的框图。图7显示的计算机设备12仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图7所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线、外围总线、或者使用多种总线结构中的任意总线结构。举例来说,这些总线结构包括但不限于工业标准体系结构(Industry Standard Architecture,ISA)总线、微通道体系结构(Micro Channel Architecture,MCA)总线、增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。计算机设备12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图7未显示,通常称为“硬盘运动器”)。尽管图7中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘运动器,以及对可移动非易失性光盘(例如紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM),数字视盘(Digital Video Disc-Read Only Memory,DVD-ROM))或者其它光介质读写的光盘运动器。在这些情况下,每个运动器可以通过一个或者多个数据介质接口与总线18相连。系统存储器28可以存储至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本公开各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如系统存储器28中。程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据。这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本公开所描述的实施例中的功能和/或方法。
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)通信。如图所示,网络适配器20通过总线18与计算机设备12的其它模块通信。应当明白,尽管图7中未示出,可以结合计算机设备12使用其它硬件和/或软件模块,包括但不限于微代码、设备运动器、冗余处理单元、外部磁盘运动阵列、(Redundant Arrays of Inexpensive Disks,RAID)系统、磁带运动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序模块42,从而执行各种功能应用以及数据处理,例如实现本公开任意实施例所提供的一种图像驱动模型训练和/或图像生成方法。
实施例八
本公开实施例八提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请所有公开实施例提供的图像驱动模型训练方法,或者实现如本申请所有公开实施例提供的图像生成方法。
本公开实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机磁盘、硬盘、RAM、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、便携式CD-ROM、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、电线、光缆、无线电频率(Radio Frequency,RF)等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络(包括LAN或WAN)连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
上述仅为本公开的较佳实施例及所运用技术原理。本领域技术人员会理解,本公开不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本公开的保护范围。因此,虽然通过以上实施例对本公开进行了较为详细的说明,但是本公开不仅仅限于以上实施例,在不脱离本公开构思的情况下,还可以包括更多其他等效实施例,而本公开的范围由所附的权利要求范围决定。

Claims (14)

  1. 一种图像驱动模型训练方法,包括:
    获取第一图像帧以及第二图像帧;
    从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵;
    根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据;
    根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵,包括:
    将所述第一图像帧输入到关键点检测模型中,获取所述关键点检测模型输出的多个初始人物关键点和各所述初始人物关键点对应的热力图;
    根据各所述初始人物关键点和各所述初始人物关键点对应的热力图,生成初始局部仿射变换矩阵,作为初始姿态特征;
    将所述第二图像帧输入到所述关键点检测模型中,获取所述关键点检测器输出的多个目标姿态关键点和各所述目标姿态关键点对应的热力图;
    根据各所述目标姿态关键点和各所述目标姿态关键点对应的热力图,生成目标局部仿射变换矩阵,作为目标姿态特征;
    将所述初始局部仿射变换矩阵与所述目标局部仿射变换矩阵相乘,获取从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵。
  3. 根据权利要求2所述的方法,其特征在于,所述根据各所述初始人物关键点和各所述初始人物关键点对应的热力图,生成初始局部仿射变换矩阵,包括:
    针对各所述初始人物关键点,
    获取所述初始人物关键点的坐标,以及匹配的置信度;
    根据所述初始人物关键点的坐标以及匹配的置信度,生成与所述初始人物关键点匹配的热力图区域;
    将所述热力图区域转换为设定规则形状的热力图区域,并
    获取所述设定规则形状的热力图区域对应的局部仿射变换矩阵,作为所述初始人物关键点对应的局部仿射变换矩阵;
    基于各所述初始人物关键点对应的局部仿射变换矩阵,确定所述初始局部仿射变换矩阵。
  4. 根据权利要求2或3所述的方法,其特征在于,所述根据各所述目标姿态关键点和各所述目标姿态关键点对应的热力图,生成目标局部仿射变换矩阵,包括:
    针对各所述目标姿态关键点,
    获取所述目标姿态关键点的坐标,以及匹配的置信度;
    根据所述目标姿态关键点的坐标以及匹配的置信度,生成与所述目标姿态关键点匹配的热力图区域;
    将所述热力图区域转换为设定规则形状的热力图区域,并
    获取所述设定规则形状的热力图区域对应的局部仿射变换矩阵,作为所述目标姿态关键点对应的局部仿射变换矩阵;
    基于各所述目标姿态关键点对应的局部仿射变换矩阵,确定所述目标局部仿射变换矩阵。
  5. 根据权利要求2所述的方法,其特征在于,所述关键点检测模型包括U型网络。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据,包括:
    将所述局部仿射变换矩阵和所述第一图像帧输入到预先训练的密集运动估计模型中,获取所述密集运动估计模型输出的像素运动数据和像素遮挡数据;
    其中,所述像素运动数据包括所述第一图像帧中各初始人物像素指向所述第二图像帧中匹配的目标像素位置的运动方向,
    所述像素遮挡数据包括所述第一图像帧中各所述初始人物像素通过仿射变换到所述第二图像帧中匹配的目标像素位置所形成的目标像素之间的遮挡顺序关系。
  7. 根据权利要求6所述的方法,其特征在于,通过如下方式训练所述密集运动估计模型:
    将训练视频中的视频帧与空间转换视频帧的光测误差的最小值作为训练目标,对基于深度学习模型的所述密集运动估计模型迭代训练;
    其中,所述空间转换视频帧通过将所述训练视频中的视频帧输入到空间转换模型生成,所述训练视频中的视频帧的局部空间特征与所述空间转换视频帧中匹配的局部空间特征相同。
  8. 根据权利要求2所述的方法,其特征在于,所述根据所述第一图像帧、像素运动数据和像素遮挡数据,训练基于深度学习模型的图像驱动模型,包括:
    根据损失函数配置信息计算所述深度学习模型的损失函数,所述损失函数配置信息用于在所述深度学习模型的初始损失函数的基础上添加同变性约束函数,所述同变性约束函数通过对初始人物关键点进行空间变换后的坐标与期望关键点的坐标之间的差值确定;
    如果确定所述损失函数满足稳定条件,则将当前训练得到的深度学习模型确定为所述图像驱动模型。
  9. 根据权利要求8所述的方法,其特征在于,所述同变性约束函数包括欧氏距离范数。
  10. 一种图像生成方法,包括:
    获取人物图像;
    获取指定视频中的目标视频帧;
    将所述人物图像和所述目标视频帧输入到预先训练的图像驱动模型中,获取所述图像驱动模型输出的人物驱动图像,所述图像驱动模型通过如权利要求1至9任一项所述的方法训练生成。
  11. 一种图像驱动模型训练装置,包括:
    图像获取模块,用于获取第一图像帧以及第二图像帧;
    特征提取模块,用于从所述第一图像帧提取初始姿态特征,以及从所述第二图像帧提取目标姿态特征,并生成从所述初始姿态特征指向所述目标姿态特征的局部仿射变换矩阵;
    数据生成模块,用于根据所述局部仿射变换矩阵和所述第一图像帧,生成像素运动数据和像素遮挡数据;
    模型训练模块,用于根据所述第一图像帧、所述像素运动数据和所述像素遮挡数据,训练基于深度学习模型的图像驱动模型。
  12. 一种图像生成装置,包括:
    人物图像获取模块,用于获取人物图像;
    目标视频帧获取模块,用于获取指定视频中的目标视频帧;
    人物驱动图像生成模块,用于将所述人物图像和所述目标视频帧输入到预先训练的图像驱动模型中,获取所述图像驱动模型输出的人物驱动图像,所述图像驱动模型通过如权利要求1至9任一项所述的方法训练生成。
  13. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至9中任一 所述的图像驱动模型训练方法,或实现如权利要求10所述的图像生成方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至9中任一所述的图像驱动模型训练方法,或实现如权利要求10所述的图像生成方法。
PCT/CN2021/103042 2020-06-29 2021-06-29 图像驱动模型训练、图像生成 WO2022002032A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010610862.1A CN111797753B (zh) 2020-06-29 2020-06-29 图像驱动模型的训练、图像生成方法、装置、设备及介质
CN202010610862.1 2020-06-29

Publications (1)

Publication Number Publication Date
WO2022002032A1 true WO2022002032A1 (zh) 2022-01-06

Family

ID=72809861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/103042 WO2022002032A1 (zh) 2020-06-29 2021-06-29 图像驱动模型训练、图像生成

Country Status (2)

Country Link
CN (1) CN111797753B (zh)
WO (1) WO2022002032A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114571463A (zh) * 2022-03-28 2022-06-03 达闼机器人股份有限公司 动作检测方法、装置、可读存储介质及电子设备
CN114663594A (zh) * 2022-03-25 2022-06-24 中国电信股份有限公司 图像特征点检测方法、装置、介质及设备
CN114663511A (zh) * 2022-03-28 2022-06-24 京东科技信息技术有限公司 一种图像生成方法、装置、设备及存储介质
CN114842508A (zh) * 2022-05-20 2022-08-02 合肥工业大学 一种基于深度图匹配的可见光-红外行人重识别方法
CN115375802A (zh) * 2022-06-17 2022-11-22 北京百度网讯科技有限公司 动态图像的生成方法、装置、存储介质及电子设备
CN116071825A (zh) * 2023-01-31 2023-05-05 天翼爱音乐文化科技有限公司 一种动作行为识别方法、系统、电子设备及存储介质
CN116612495A (zh) * 2023-05-05 2023-08-18 阿里巴巴(中国)有限公司 图像处理方法及装置
WO2023160448A1 (zh) * 2022-02-24 2023-08-31 北京字跳网络技术有限公司 图像处理方法、装置、设备及存储介质
CN117315792A (zh) * 2023-11-28 2023-12-29 湘潭荣耀智能科技有限公司 一种基于卧姿人体测量的实时调控系统
CN117541478A (zh) * 2022-02-28 2024-02-09 荣耀终端有限公司 图像处理方法及其相关设备
CN117593449A (zh) * 2023-11-07 2024-02-23 书行科技(北京)有限公司 人-物交互运动视频的构建方法、装置、设备及存储介质
CN117788982A (zh) * 2024-02-26 2024-03-29 中国铁路设计集团有限公司 基于铁路工程地形图成果的大规模深度学习数据集制作方法

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797753B (zh) * 2020-06-29 2024-02-27 北京灵汐科技有限公司 图像驱动模型的训练、图像生成方法、装置、设备及介质
CN112348743B (zh) * 2020-11-06 2023-01-31 天津大学 一种融合判别式网络和生成式网络的图像超分辨率方法
CN112183506A (zh) * 2020-11-30 2021-01-05 成都市谛视科技有限公司 一种人体姿态生成方法及其系统
CN113762017B (zh) * 2021-01-13 2024-04-16 北京京东振世信息技术有限公司 一种动作识别方法、装置、设备及存储介质
CN113284041B (zh) * 2021-05-14 2023-04-18 北京市商汤科技开发有限公司 一种图像处理方法、装置、设备及计算机存储介质
TWI787841B (zh) * 2021-05-27 2022-12-21 中強光電股份有限公司 影像識別方法
CN113507627B (zh) * 2021-07-08 2022-03-25 北京的卢深视科技有限公司 视频生成方法、装置、电子设备及存储介质
CN115708120A (zh) * 2021-08-10 2023-02-21 腾讯科技(深圳)有限公司 脸部图像处理方法、装置、设备以及存储介质
CN113870313B (zh) * 2021-10-18 2023-11-14 南京硅基智能科技有限公司 一种动作迁移方法
CN113870314B (zh) * 2021-10-18 2023-09-19 南京硅基智能科技有限公司 一种动作迁移模型的训练方法及动作迁移方法
CN114783039B (zh) * 2022-06-22 2022-09-16 南京信息工程大学 一种3d人体模型驱动的运动迁移方法
CN114998814B (zh) * 2022-08-04 2022-11-15 广州此声网络科技有限公司 目标视频生成方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107507216A (zh) * 2017-08-17 2017-12-22 北京觅己科技有限公司 图像中局部区域的替换方法、装置及存储介质
CN107564080A (zh) * 2017-08-17 2018-01-09 北京觅己科技有限公司 一种人脸图像的替换系统
CN109558832A (zh) * 2018-11-27 2019-04-02 广州市百果园信息技术有限公司 一种人体姿态检测方法、装置、设备及存储介质
CN111797753A (zh) * 2020-06-29 2020-10-20 北京灵汐科技有限公司 图像驱动模型的训练、图像生成方法、装置、设备及介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7508990B2 (en) * 2004-07-30 2009-03-24 Euclid Discoveries, Llc Apparatus and method for processing video data
CN109978754A (zh) * 2017-12-28 2019-07-05 广东欧珀移动通信有限公司 图像处理方法、装置、存储介质及电子设备
CN109492608B (zh) * 2018-11-27 2019-11-05 腾讯科技(深圳)有限公司 图像分割方法、装置、计算机设备及存储介质
CN109492624A (zh) * 2018-12-29 2019-03-19 北京灵汐科技有限公司 一种人脸识别方法、特征提取模型的训练方法及其装置
CN110827342B (zh) * 2019-10-21 2023-06-02 中国科学院自动化研究所 三维人体模型重建方法及存储设备、控制设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107507216A (zh) * 2017-08-17 2017-12-22 北京觅己科技有限公司 图像中局部区域的替换方法、装置及存储介质
CN107564080A (zh) * 2017-08-17 2018-01-09 北京觅己科技有限公司 一种人脸图像的替换系统
CN109558832A (zh) * 2018-11-27 2019-04-02 广州市百果园信息技术有限公司 一种人体姿态检测方法、装置、设备及存储介质
CN111797753A (zh) * 2020-06-29 2020-10-20 北京灵汐科技有限公司 图像驱动模型的训练、图像生成方法、装置、设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ALIAKSANDR SIAROHIN; ST\'EPHANE LATHUILI\`ERE; SERGEY TULYAKOV; ELISA RICCI; NICU SEBE: "First Order Motion Model for Image Animation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 February 2020 (2020-02-29), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081610932 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023160448A1 (zh) * 2022-02-24 2023-08-31 北京字跳网络技术有限公司 图像处理方法、装置、设备及存储介质
CN117541478A (zh) * 2022-02-28 2024-02-09 荣耀终端有限公司 图像处理方法及其相关设备
CN114663594A (zh) * 2022-03-25 2022-06-24 中国电信股份有限公司 图像特征点检测方法、装置、介质及设备
CN114663511A (zh) * 2022-03-28 2022-06-24 京东科技信息技术有限公司 一种图像生成方法、装置、设备及存储介质
CN114571463A (zh) * 2022-03-28 2022-06-03 达闼机器人股份有限公司 动作检测方法、装置、可读存储介质及电子设备
CN114571463B (zh) * 2022-03-28 2023-10-20 达闼机器人股份有限公司 动作检测方法、装置、可读存储介质及电子设备
CN114842508B (zh) * 2022-05-20 2024-03-01 合肥工业大学 一种基于深度图匹配的可见光-红外行人重识别方法
CN114842508A (zh) * 2022-05-20 2022-08-02 合肥工业大学 一种基于深度图匹配的可见光-红外行人重识别方法
CN115375802A (zh) * 2022-06-17 2022-11-22 北京百度网讯科技有限公司 动态图像的生成方法、装置、存储介质及电子设备
CN116071825A (zh) * 2023-01-31 2023-05-05 天翼爱音乐文化科技有限公司 一种动作行为识别方法、系统、电子设备及存储介质
CN116071825B (zh) * 2023-01-31 2024-04-19 天翼爱音乐文化科技有限公司 一种动作行为识别方法、系统、电子设备及存储介质
CN116612495A (zh) * 2023-05-05 2023-08-18 阿里巴巴(中国)有限公司 图像处理方法及装置
CN116612495B (zh) * 2023-05-05 2024-04-30 阿里巴巴(中国)有限公司 图像处理方法及装置
CN117593449A (zh) * 2023-11-07 2024-02-23 书行科技(北京)有限公司 人-物交互运动视频的构建方法、装置、设备及存储介质
CN117315792A (zh) * 2023-11-28 2023-12-29 湘潭荣耀智能科技有限公司 一种基于卧姿人体测量的实时调控系统
CN117315792B (zh) * 2023-11-28 2024-03-05 湘潭荣耀智能科技有限公司 一种基于卧姿人体测量的实时调控系统
CN117788982A (zh) * 2024-02-26 2024-03-29 中国铁路设计集团有限公司 基于铁路工程地形图成果的大规模深度学习数据集制作方法

Also Published As

Publication number Publication date
CN111797753A (zh) 2020-10-20
CN111797753B (zh) 2024-02-27

Similar Documents

Publication Publication Date Title
WO2022002032A1 (zh) 图像驱动模型训练、图像生成
US10546408B2 (en) Retargeting skeleton motion sequences through cycle consistency adversarial training of a motion synthesis neural network with a forward kinematics layer
WO2021254499A1 (zh) 编辑模型生成、人脸图像编辑方法、装置、设备及介质
WO2020006961A1 (zh) 用于提取图像的方法和装置
US10565792B2 (en) Approximating mesh deformations for character rigs
US20220044352A1 (en) Cross-domain image translation
JP2016218999A (ja) ターゲット環境の画像内に表現されたオブジェクトを検出するように分類器をトレーニングする方法およびシステム
WO2020048484A1 (zh) 超分辨图像重建方法、装置、终端和存储介质
CN114144790A (zh) 具有三维骨架正则化和表示性身体姿势的个性化语音到视频
CN112085835B (zh) 三维卡通人脸生成方法、装置、电子设备及存储介质
US20220358675A1 (en) Method for training model, method for processing video, device and storage medium
WO2024032464A1 (zh) 三维人脸重建方法及其装置、设备、介质、产品
US20230326173A1 (en) Image processing method and apparatus, and computer-readable storage medium
US20230037339A1 (en) Contact-aware retargeting of motion
WO2021212411A1 (en) Kinematic interaction system with improved pose tracking
AU2022241513B2 (en) Transformer-based shape models
US20240062495A1 (en) Deformable neural radiance field for editing facial pose and facial expression in neural 3d scenes
US20240013477A1 (en) Point-based neural radiance field for three dimensional scene representation
US11715248B2 (en) Deep relightable appearance models for animatable face avatars
US20220301348A1 (en) Face reconstruction using a mesh convolution network
US20240013357A1 (en) Recognition system, recognition method, program, learning method, trained model, distillation model and training data set generation method
CN115775300A (zh) 人体模型的重建方法、人体重建模型的训练方法及装置
Wang et al. Generative model with coordinate metric learning for object recognition based on 3D models
US20230326137A1 (en) Garment rendering techniques
CN117576248B (zh) 基于姿态引导的图像生成方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21834160

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04.05.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21834160

Country of ref document: EP

Kind code of ref document: A1