WO2023155533A1 - Image driving method and apparatus, device and medium - Google Patents

Image driving method and apparatus, device and medium Download PDF

Info

Publication number
WO2023155533A1
WO2023155533A1 PCT/CN2022/134869 CN2022134869W WO2023155533A1 WO 2023155533 A1 WO2023155533 A1 WO 2023155533A1 CN 2022134869 W CN2022134869 W CN 2022134869W WO 2023155533 A1 WO2023155533 A1 WO 2023155533A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
driving
target
body part
motion information
Prior art date
Application number
PCT/CN2022/134869
Other languages
French (fr)
Chinese (zh)
Inventor
唐斯伟
朱昊
吴文岩
范蕤
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023155533A1 publication Critical patent/WO2023155533A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the embodiments of the present disclosure relate to the technical field of computer vision, and in particular to an image driving method, device, device and medium.
  • facial image driving means that given a facial video, the facial movements of this facial video can be transferred to the facial image specified by the user.
  • the face of a specific user is to be driven, a video of the specific user needs to be obtained for driving, and the driving method is cumbersome and has low processing efficiency.
  • an image driving method comprising: acquiring a target image and a driving reference image, the target image including a body part of a target object, and the driving reference image including a reference object presenting a reference action body parts; based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determine the motion information of multiple pixels in the target image, the motion The information is used to adjust the motion of the body part of the target object to the reference motion; adjust multiple pixels in the target image according to the motion information to obtain a driving effect image, and all the driving effect images in the driving effect image The body part of the target object presents the reference motion.
  • an image driving device configured to acquire a target image and a driving reference image, the target image includes body parts of a target subject, and the driving reference image includes a presentation The body part of the reference object of the reference action; a pixel motion module, configured to determine the target image based on the correspondence between each key point of the body part of the target object and each key point of the body part of the reference object Motion information of multiple pixels in the target image, the motion information is used to adjust the motion of the body part of the target object to the reference motion; an image adjustment module is used to adjust multiple pixels in the target image according to the motion information Pixels are adjusted to obtain a driving effect image, in which the body part of the target object presents the reference action.
  • an electronic device in a third aspect, includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the present disclosure when executing the computer instructions
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.
  • a computer program product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.
  • the image driving method provided by the embodiment of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between the key points of the body parts of the target object and the key points of the body parts of the reference object, so as to directly perform Morphing, which can make the target image show the same body parts as the driving reference image.
  • a single driving reference image including the reference object can be used to drive the target object, which simplifies the realization of target object driving. operation mode, and can effectively improve the processing efficiency of the driving target object.
  • FIG. 1A is a flowchart of an image driving method shown in at least one embodiment of the present disclosure
  • Fig. 1B is a key point mode shown in at least one embodiment of the present disclosure
  • Fig. 1C is another key point mode shown in at least one embodiment of the present disclosure.
  • FIG. 2A is a flowchart of another image driving method shown in at least one embodiment of the present disclosure.
  • Fig. 2B is a schematic diagram of a target image shown in at least one embodiment of the present disclosure.
  • Fig. 2C is a schematic diagram of a driving reference image shown in at least one embodiment of the present disclosure.
  • Fig. 2D is a schematic diagram of key points of body parts of a target object shown in at least one embodiment of the present disclosure
  • Fig. 2E is a schematic diagram of key points of body parts of a reference subject shown in at least one embodiment of the present disclosure
  • Fig. 2F is a facial key point mode shown in at least one embodiment of the present disclosure.
  • Fig. 2G is a driving effect image shown in at least one embodiment of the present disclosure.
  • Fig. 3 is a schematic structural diagram of an image driving model shown in at least one embodiment of the present disclosure
  • Fig. 4 is a schematic structural diagram of a motion module in an image driving model shown in at least one embodiment of the present disclosure
  • Fig. 5 is a flowchart of another image driving method shown in at least one embodiment of the present disclosure.
  • Fig. 6 is a schematic structural diagram of another image driving model shown in at least one embodiment of the present disclosure.
  • Fig. 7 is a schematic structural diagram of another image driving model shown in at least one embodiment of the present disclosure.
  • Fig. 8 is a block diagram of an image driving device shown in at least one embodiment of the present disclosure.
  • Fig. 9 is a block diagram of another image driving device shown in at least one embodiment of the present disclosure.
  • Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to at least one embodiment of the present disclosure.
  • first, second, third, etc. may be used in this specification to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this specification, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word “if” as used herein may be interpreted as “at” or “when” or “in response to a determination.”
  • FIG. 1A is a flowchart of an image driving method according to at least one embodiment of the present disclosure, and the method may include steps 102 to 106 .
  • step 102 a target image and a driving reference image are acquired.
  • the target image includes the body parts of the target object
  • the driving reference image includes the body parts of the reference object presenting the reference action.
  • the method of this embodiment aims at transferring the reference action on the body part of the reference subject to the body part of the target subject, so that the body part of the target subject can present the reference action.
  • the target object is a driven object. This embodiment does not limit the scope of the target object.
  • the target object may include a real person, animation character, cartoon image or doll in the target image.
  • the body part of the target object in the target image can be any body part such as head, face, limbs (such as hands, legs, torso, etc.), or a combination of at least two body parts.
  • the reference object may include a real person, animation character, cartoon image, or doll in the driving reference image.
  • the reference action presented by the reference object in the driving reference image may include any action, for example, when the body part is a face, the reference action presented by the reference object may be various facial expressions, such as frowning, turning the head, opening the mouth, etc.; When the body parts are limbs, the reference actions presented by the reference object may be various gesture actions, such as walking, greeting, raising hands, and so on.
  • the image containing the mouth of the real person is the driving reference image
  • the image containing the mouth of the animation character is the target image
  • This embodiment does not limit the acquisition manners of the target image and the driving reference image.
  • a single target image designated by the user or uploaded by the user may be acquired.
  • the driving reference image you can obtain a single driving reference image or multiple driving reference images specified by the user or uploaded by the user, or you can first obtain a driving video specified by the user or uploaded by the user, and obtain multiple driving reference images from the driving video. image.
  • step 104 based on the correspondence between each key point of the body part of the target object and each key point of the body part of the reference object, determine the number of pixel points in the target image Sports information.
  • the motion information is used to adjust the motion of the body part of the target object to the reference motion.
  • the key points of the body part are used to describe the semantic information and location information of the body part, wherein the semantic information of the body part includes the category information of the body part, and the key point of the body part can be any position of the body part area and the area around the body part point.
  • the key points of the body part are the key points of the face
  • the key points of the face may include points on the skin of the face or facial features, points on the contour line of the face and the contour line of the facial features, Points on hair, on the neck, or can also include points on objects such as glasses, earrings, and hats that are worn.
  • the key points of the body part are the key points of the limbs.
  • the key points of the limbs can include points on the skeleton such as hand joints, elbow joints, head vertices, ankle joints, and tail vertebrae, and can also include clothes. , backpacks and other objects.
  • the positions of the key points of the body part can be different.
  • the positions of the key points of the body part form different position combinations, it can indicate that the body part presents different action.
  • This embodiment does not limit the number and position distribution of the key points of the body parts used, and key points with different numbers and position distributions can form different key point patterns.
  • the 83 key points of the face as shown in Figure 1B are respectively located on the contour line of the face and the contour lines of the eyes, eyebrows, nose and mouth, and the 83 different key points shown in Figure 1B
  • the location distribution of facial keypoints can be referred to as Mode 1.
  • the 49 facial key points shown in FIG. 1C are respectively located on the contour lines of facial features.
  • the 49 facial key points distributed in different positions shown in FIG. 1C can be called mode 2.
  • the modes used to describe key points of body parts in the target object and the reference object are the same.
  • the face key points of the target object include 83 points shown in Figure 1B
  • the face key points of the reference object also include 83 points shown in Figure 1B. points.
  • the facial key points of the target object include 49 points on the facial features contour line shown in Figure 1C
  • the facial key points of the reference object also include 49 points on the facial features contour line shown in Figure 1C.
  • each key point of the body part in the target object and each key point of the body part in the reference object may be respectively extracted first.
  • the number and definition of the key points of the body parts in the target object and the key points of the body parts in the reference object are the same, and the key points of the body parts in the target object are in one-to-one correspondence with the key points of the body parts in the reference object.
  • the position combination of the key points of the body part in the reference object reflects the reference action presented by the body part in the reference object in the driving reference image, and can be adjusted by adjusting the position combination of the key points of the body part in the target object to The position combination of the key points of the body part is consistent so that the body part in the target object presents the same reference action.
  • the number of pixels that need to be moved in the target image can be calculated according to the difference between the position of each key point of the body part in the target object and the position of each key point of the body part in the reference object point motion information.
  • the position of each key point of the body part of the reference object in the driving reference image and the position of each key point of the body part of the target object in the target image can be input into a pre-trained neural network, and the output can be obtained in the target image. motion information of pixels.
  • the motion information of each pixel point may be the displacement of the pixel point, and the displacement may be represented by the motion direction in two dimensions of the horizontal axis and the vertical axis.
  • the motion information of each pixel may include optical flow information.
  • the motion information of each pixel can be represented by a two-dimensional vector, and the motion information of multiple pixels in the target image can be represented by a matrix.
  • step 106 a plurality of pixels in the target image are adjusted according to the motion information to obtain a driving effect image.
  • the body part of the target object in the driving effect image presents the reference motion.
  • the motion information includes the displacement of pixels
  • multiple pixels in the target image are moved according to their respective displacements, so as to realize the deformation processing of the target image and obtain the body parts of the target object presenting the reference action.
  • Drive effect image when the motion information includes the displacement of pixels, multiple pixels in the target image are moved according to their respective displacements, so as to realize the deformation processing of the target image and obtain the body parts of the target object presenting the reference action. Drive effect image.
  • the image driving method provided by the embodiments of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between the key points of the body part of the target object and the key points of the body part of the reference object, so as to perform the image driving on the target image.
  • Direct deformation can make the target image show the same body parts as the driving reference image.
  • the driving reference image of the single reference object can be used to drive the target object, which simplifies the realization of target object driving. operation mode, and can effectively improve the processing efficiency of the driving target object.
  • FIG. 2A is a flowchart of an image driving method according to at least one embodiment of the present disclosure
  • FIG. 3 illustrates a schematic structural diagram of an image driving model for implementing the image driving method of this embodiment.
  • the image-driven model in FIG. 3 includes a body part key point detection module 31 , a motion module 32 and a coarse deformation module 33 .
  • the processing flow of the image driving method is described below with reference to FIG. 2A and FIG. 3 .
  • the processing flow of the image driving method includes step 202 to step 208 .
  • step 202 a target image and a driving reference image are acquired.
  • the obtained target image may be as shown in FIG. 2B, and the target object is a cartoon character;
  • the obtained driving reference image may be as shown in FIG. 2C, and the reference object is a real person, and the driving reference image can be seen
  • the reference actions presented by the reference subject are: tilting the head to the left, looking up to the left and opening the mouth wide.
  • the target image and the driving reference image can be input to the body part key point detection module 31 in the image driving model.
  • the key point detection module 31 of the body part is a face key point detection module; when the body part is a limb, the key point detection module 31 of the body part is a limb key point detection module; In the case of hands, the key point detection module 31 of body parts is a hand key point detection module.
  • step 204 identify the first positions respectively corresponding to the key points of the body parts of the target subject in the target image; and identify the key points of the body parts of the reference subject in the driving reference image respectively corresponding to the second position.
  • identifying the first positions corresponding to each key point of the body part of the target object in the target image may include extracting the key points of the body part of the target image to obtain the body part key points of the target image in the target image.
  • the first positions corresponding to the key points of the body parts respectively; identifying the second positions respectively corresponding to the key points of the body parts of the reference object in the driving reference image may include extracting the key points of the body parts from the driving reference image to obtain Each key point of the body part of the reference object in the driving reference image corresponds to the second position respectively.
  • the same extraction method can be used to extract key points of body parts from the target image and the driving reference image. This embodiment does not limit the specific manner of extracting key points of body parts.
  • a two-dimensional coordinate can be used to represent the position of the key point, which is recorded as the first position; for each key point of the body part of the reference object, a two-dimensional coordinate can be used The dimensional coordinates represent the position of the key point, which is recorded as the second position.
  • the number of key points of the body part of the target object is the same as the number of key points of the body part of the reference object, and there is a one-to-one correspondence. Since the body part of the target object exhibits different actions than the body part of the reference object, any first position may be different from the second position corresponding to the first position.
  • the corresponding key point n1 of the reference object is also located at the right corner of the mouth.
  • the first position corresponding to m1 may be the coordinate (x1, y1) of m1
  • the second position corresponding to n1 may be the coordinate (x2, y2) of n1.
  • a body part key point detection network (Landmark Detector) can be used for body part key point extraction.
  • a body part key point detection network is pre-trained, and the target image and the driving reference image can be sequentially input to the body part key point detection network, and the body part key point detection network outputs each key point of the detected body part of the target object respectively The corresponding first position, and the second position respectively corresponding to each key point of the body part of the reference object.
  • the extracted body part key points can use the user-defined body part key point mode, or use the body part key point detection network in The body part key point mode that is automatically learned in the pre-training process, wherein, when training the body part key point detection network, the number of body part key points that the body part key point detection network needs to learn can be set, and the number can be A specific value, or a certain range.
  • the body part key point detection network can learn the key points of different patterns of body parts. When the trained body part key point detection network is applied, the key points of the corresponding body parts are extracted according to the body part key point patterns learned by it.
  • FIG. 2F shows a face key point pattern automatically learned by the facial key point detection network after training.
  • the neural network model used by the body part key point detection network can include DAN (Deep Alignment Network, depth alignment network), DCNN (Deep Convolutional Neural Network, depth convolutional network) or TCNN (Tweaked Convolutional Neural Network, adjustment convolutional network ) and other models.
  • the body part key point detection network may include a self-supervised learning model, an unsupervised learning model or a supervised learning model.
  • the key point detection module 31 of the body part in the image-driven model can be a body part key point detection network, which is used to perform body part keying on the target object in the target image and the reference object in the driving reference image respectively.
  • Point extraction output the first position corresponding to each key point of the body part of the target object and the second position corresponding to each key point of the body part of the reference object respectively, and input the first position and the second position to the motion module 32.
  • step 206 based on the corresponding relationship between each of the first positions and each of the second positions, motion information of a plurality of pixels in the target image is determined.
  • the motion displayed by the body part of the target object can be adjusted to the reference motion by using the motion information.
  • the second positions corresponding to the key points of the body parts of the reference object constitute a position combination, which is called the second position combination here
  • the first positions corresponding to the key points of the body parts of the target object also constitute a position combination , which is referred to as the first position combination here
  • the motion information is the displacements of the respective movement of multiple pixels in the target image when the first position combination is adjusted to the second position combination.
  • the motion optical flow information may be calculated first according to the corresponding relationship between each first position and each second position.
  • the motion optical flow information may include optical flow information when pixels in the local area where the body part is located in the target image move respectively, and the calculation here may use a motion optical flow estimation method.
  • the Lucas-Kanade algorithm based on Taylor expansion can be used to calculate the motion optical flow information
  • the neural network based on deep learning such as FlowNet and FlowNet2.0 can also be used to calculate the motion optical flow information, and the specific implementation is not limited to this.
  • the four key points of the body part of the target object are the right mouth corner m1 , the left mouth corner m2 , the right eye m3 and the left eye m4 .
  • the four key points of the body part of the reference object are right mouth corner n1 , left mouth corner n2 , right eye n3 and left eye n4 .
  • m1 corresponds to n1
  • m2 corresponds to n2
  • m3 corresponds to n3
  • m4 corresponds to n4.
  • the first position used is the coordinate information of m1 , m2 , m3 and m4
  • the second position used is the coordinate information of n1 , n2 , n3 and n4 .
  • the motion information of multiple pixels in the final target image can be calculated by combining the motion optical flow information with the target image.
  • a predictive neural network can be pre-trained, the input of the predictive neural network is the target image and motion optical flow information, and the output of the predictive neural network is the motion information of multiple pixels in the target image.
  • the motion module 32 includes a motion information calculation unit 320, wherein an algorithm 3201 is used to calculate motion optical flow information according to the correspondence between each first position and each second position , the motion optical flow information is the optical flow information when the pixels in the local area where the body parts are located in the target image move respectively.
  • the algorithm 3201 may include a motion optical flow estimation method, for example, the Lucas-Kanade algorithm based on Taylor expansion, and algorithms based on deep learning neural networks such as FlowNet and FlowNet2.0.
  • the prediction neural network 3202 is used to output motion information of multiple pixels in the target image according to the input motion optical flow information and the target image.
  • step 208 a plurality of pixels in the target image are adjusted according to the motion information to obtain a driving effect image.
  • the body part of the target object in the driving effect image exhibits the reference motion.
  • the motion information includes the displacement of each pixel, and the pixels in the target image are moved according to the displacement and motion direction indicated by their respective displacements to obtain the adjusted target image, that is, the coarsely deformed image.
  • the final target image is used as the driving effect image containing the target object presenting the reference action on the body part.
  • the doll in Figure 2G exhibits motions consistent with the reference motions driving the body parts in the reference image in Figure 2C: head tilted to the left, eyes looking up and to the left, and mouth wide open.
  • the motion information and the target image are input into the rough deformation module 33 , and the driving effect image after adjusting multiple pixels in the target image is output.
  • the adjusted target image may be further optimized after the above-mentioned adjustment is performed on the target image, so as to improve the display effect of the image. Such as removing noise, repairing missing content, adjusting brightness and color enhancement, etc.
  • the image driving method provided by the embodiment of the present disclosure obtains the position combination of the key points of the body parts of the target object and the position of the key points of the body parts of the reference object by extracting the key points of the body parts from the target image and the driving reference image respectively.
  • the position of the key points of the body part is combined into the action of the body part, and the movement of multiple pixels in the target image is obtained through the corresponding relationship between the key points of the body part in the target object and the key points of the body part in the reference object Information, so that the pixels of the target image can be moved correspondingly according to the displacement of each pixel in the motion information, which makes the action presented by the body part of the target object consistent with the reference action presented by the body part of the reference object, and finally obtains A driving effect image of the body part of the target subject of the reference motion is presented. This not only makes the operation of driving the target object simple and the processing effect of the driving target object is high, but also improves the accuracy of the driving effect by using the above method.
  • the model, the target sample image and the driving sample image used for training may be images of any object.
  • the image driving model trained in this embodiment is not limited to driving a specific target object, and any target object can be driven by the image driving model to obtain a corresponding driving effect image.
  • the model does not rely on the model to learn the relevant features of the body parts of the sample object before driving, so that there is no need to shoot a video of the sample object, and it can be used for a wider range of target objects.
  • the target sample image includes the body part of the sample subject exhibiting the first action
  • the driving sample image includes the body part of the sample subject exhibiting the second action.
  • the image input by the key point detection module 31 of the body part is the target sample image and the driving sample image
  • the image output by the coarse deformation module 33 is the training image.
  • the target object in the target sample image used in training and the reference object in the driving sample image are the same object
  • the target sample image includes the body part of the sample object that presents the first action
  • the driving sample image includes the body part that presents The body part of the sample object for the second motion.
  • the training image and the driving sample image generated by image driving on the target sample image both include the body parts of the sample object, and the third action presented by the body parts of the sample object in the training image is the same as the third action of the sample object in the driving sample image.
  • the second motion exhibited by the body part is likely to be different.
  • the head is not angled enough and the shape of the mouth is not consistent.
  • the position of pixels in the training image and the driving sample image are different.
  • the network loss is calculated by the difference between the training image and the driving sample image, and the values of various network parameters in the image driving model are simultaneously adjusted according to the network loss.
  • Each network parameter value in the image-driven model may include the body part key point detection network in the body part key point detection module 31 , and the network parameter value in the prediction neural network 3202 in the motion module 32 .
  • network parameter values in the image-driven model can be adjusted by backpropagation.
  • the network training is ended, wherein the end condition may include that the iteration reaches a certain number of times, or the loss value is less than a certain threshold.
  • the target sample image and driving sample image used in the training process can be images of any sample object, so there is no need to shoot a video of the target object for training.
  • the trained model does not rely on the model to learn the body part characteristics of the target object Drive, but directly deform the target image at the pixel level according to the corresponding relationship of the key points of the body parts, so the trained model is versatile, and there is no need to collect training samples again to retrain the model for different target objects. Improve model training efficiency.
  • the difficulty of training is also reduced, and the sample collection method for model training is simplified.
  • a video of the driven target object synchronized with the reference motion of the body part of the reference object in the driving video can be obtained.
  • one frame of the target image and multiple frames of driving reference images in the driving video can be acquired first, the target image includes the body parts of the target object, and the multiple frames of driving reference images include the body parts of the same reference object , and the body parts of the reference subject in different driving reference images present different reference motions.
  • the reference action presented by the body parts in the first driving reference image is to open the left eye and close the right eye;
  • the reference action is to close the left eye and open the right eye;
  • the reference action presented by the body parts in the third frame driving reference image is to open the left and right eyes.
  • one frame of the driving reference image is acquired from the multiple frames of the driving reference image to perform driving processing on the target image, and the processing order of the multiple frames of the driving reference image is not limited here.
  • Each frame of the driving reference image may be sequentially processed according to the sequence of the driving reference image in the driving video, or multiple frames of driving reference images may be processed in parallel.
  • step 204 it is only necessary to perform body part key point extraction on the target image once to obtain the first positions corresponding to each key point of the target object's body part in the target image.
  • each frame of the driving reference image needs to extract the key points of the body part to obtain the second positions corresponding to each key point of the body part of the reference object in each frame of the driving reference image.
  • the motion information of multiple pixels in the target image is respectively determined by using each frame of the driving reference image, so as to obtain a multi-frame driving effect image of the target object.
  • the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, and the multi-frame driving effect images respectively present corresponding
  • the reference action of the body part of the reference object in the multi-frame driving reference image and generate a target video based on the multi-frame driving effect image
  • the body part action of the target object in the target video is the same as the body part of the reference object in the driving video
  • the movements of the parts are the same.
  • three frames of driving effect images of the target object are obtained after processing the three frames of driving reference images in the driving video, and the three frames of driving effect images of the target object respectively present the same body parts as the corresponding three frames of driving reference images
  • the three frames of driving effect images are synthesized into the target video according to the order of their corresponding driving reference images.
  • the three frames of images in the target video show that the target object opens the left eye and closes the right eye, closes the left eye and opens the right eye. eyes and the movement of opening the left and right eyes.
  • FIG. 5 is a flowchart of an image driving method shown in at least one embodiment of the present disclosure
  • FIG. 6 illustrates a schematic structural diagram of an image driving model for implementing the image driving method of this embodiment.
  • the image-driven model is based on the image-driven model shown in FIG. 3
  • an image generation module 34 is added.
  • the image generation module 34 can generate a network for an image, including an encoding network 341 , a feature deformation unit 342 and a decoding network 343 .
  • FIG. 6 is only an exemplary network structure, and is not limited thereto in specific implementation.
  • the processing flow of the image driving method is described below with reference to FIG. 5 and FIG. 6 , wherein the steps repeated with the above embodiment will not be repeated, and the processing flow of the image driving method includes step 502 to step 514 .
  • step 502 a target image and a driving reference image are acquired.
  • the target image and the driving reference image can be input to the body part key point detection module 31 in the image driving model.
  • step 504 identify first positions corresponding to each key point of the body part of the target object in the target image; and identify each key point of the body part of the reference object in the driving reference image corresponding to the second position.
  • the key point detection module 31 of the body part is used to extract key points of the body part from the target object in the target image and the reference object in the driving reference image respectively, and output each key point of the body part of the target object
  • the corresponding first position and each key point of the body part of the reference object respectively correspond to the second position, and the first position and the second position are input to the motion module 32 .
  • step 506 based on the correspondence between each of the first positions and each of the second positions, motion information of a plurality of pixels in the target image is determined.
  • the first position, the second position and the target image are input into the motion module 32 , and the motion information of multiple pixels in the target image is output.
  • step 508 a plurality of pixel points in the target image are adjusted according to the motion information.
  • each pixel point of the target image is moved to deform the target image, so that the body part of the target object in the target image presents a reference action.
  • the adjusted target image, motion information and image generation network can be used to generate a driving effect image.
  • the image generation network can be a pre-trained neural network, which can perform detailed processing on the adjusted target image with the help of motion information, such as removing noise, repairing defective content, adjusting brightness and color enhancement, etc.
  • step 510 feature extraction is performed on the adjusted target image by using the encoding network in the image generation network to obtain a feature map.
  • an encoding network can be used to perform feature extraction on the adjusted target image to obtain a feature map, or it can also be extracted in other ways.
  • the coarsely deformed image is input to the encoding network 341 in the image generation module 34 to obtain a feature map.
  • step 512 based on the motion information, the pixels in the feature map are adjusted to obtain an adjusted feature map.
  • the feature map can be adjusted in the same way as adjusting the target image, and the pixel points corresponding to the target image in the feature map are moved according to the displacement indicated by the motion information to obtain the adjusted feature map.
  • the feature map and motion information are input into the feature deformation unit 342 in the image generation module 34 , the pixels in the feature map are adjusted, and the adjusted feature map is output.
  • motion information may be used to determine a mask corresponding to the target image. And use the adjusted target image, motion information, mask and image generation network to generate the driving effect image.
  • the mask is used to identify the movement degree of each pixel in the process of adjusting the multiple pixels in the target image according to the motion information.
  • the mask can indicate to the image generation network the image area in the adjusted target image that needs to be optimized and adjusted. For each pixel in the mask, pixels that need to be adjusted and pixels that do not need to be adjusted can be distinguished using different identifiers.
  • the motion information is input into the mask generation network to obtain a mask corresponding to the target image, and the mask may be an image of the same size as the target image.
  • the mask generation network may include a binary classifier, which divides each pixel into a pixel with a large degree of movement or a pixel with a small degree of movement according to the motion information.
  • a certain pixel output by the binary classifier belongs to When the confidence degree of a pixel with a large degree of movement reaches the preset threshold, the pixel is determined as a pixel with a large degree of movement; otherwise, when the pixel with a large degree of movement does not reach the preset threshold, the pixel is determined as Pixels with little movement.
  • the identification of pixels with a large degree of movement in the mask may be 1, and the identification of pixels with a small degree of movement may be 0 in the mask.
  • the pixels with the largest movement degree among the pixels of the target image are the pixels of the mouth area, and the corresponding mask of the target image
  • the film can be a graph in which the pixel points in the mouth area are marked as 1, and the pixel points in other areas are marked as 0.
  • the pixels in the feature map can be adjusted based on the motion information and the mask.
  • the pixels in the feature map corresponding to the region marked as 1 in the mask can be adjusted so that the subsequent decoding process can complete and generate details of the larger deformed part of the target image, and retain the deformation Not a big area.
  • Adjusted feature map feature map * mask + feature map deformed by motion information * (1-mask) (1)
  • the feature map deformed by the motion information is a deformed feature map obtained by moving the pixel points in the feature map corresponding to the target image according to the displacement indicated by the motion information.
  • the motion module 32 includes a mask generation network 321 in addition to the motion information calculation unit 320
  • the motion information output by the motion information calculation unit 320 is input into the mask generation network 321 , get the mask corresponding to the output target image.
  • Input the mask, feature map and motion information into the feature deformation unit 342 in the image generation module 34, and the feature deformation unit 342 adjusts the pixels in the feature map based on the motion information and the mask, and outputs the adjusted feature map.
  • step 514 the adjusted feature map is decoded by the decoding network in the image generation network to obtain the driving effect image.
  • a decoding network can be used to decode the adjusted feature map to obtain the feature map, or it can also be decoded in other ways.
  • the encoding network and decoding network in this embodiment can use a convolutional neural network.
  • the adjusted feature map is input to the decoding network 343 in the image generation module 34 , and the driving effect image of the body part of the target object presenting the reference action is output.
  • the driving effect image of the body part of the target object presenting the reference action is output.
  • the presented image effect is closer to the real action of the target object.
  • the image driving method provided by the embodiments of the present disclosure can realize the driving of the target object by using a driving reference image of a single reference object, simplifies the operation method for driving the target object, and can effectively improve the processing efficiency of driving the target object. While making the body part of the target object in the target image present the same action as the body part of the reference object in the driving reference image, it is possible to perform detailed processing on the part with large deformation in the target image, so that the movement of the body part of the target object is more accurate. Be true to nature.
  • the training method for the image-driven model with the structure shown in FIG. 3 in the above embodiment can still be used.
  • the adjusted network parameter value of the image-driven model may include at least one of the following network parameter values: key points of the body part in the key point detection module 31 of the body part The detection network, the prediction neural network 3202 and the mask generation network 321 in the motion module 32 , the encoding network 341 and the decoding network 343 in the image generation module 34 .
  • FIG. 8 is a block diagram of an image driving device according to at least one embodiment of the present disclosure, and the device includes: an image acquisition module 81 , a pixel movement module 82 and an image adjustment module 83 .
  • the image acquisition module 81 is configured to acquire a target image and a driving reference image, the target image includes body parts of the target object, and the driving reference image includes body parts of the reference object exhibiting the reference action.
  • a pixel motion module 82 configured to determine the motion information of multiple pixel points in the target image based on the correspondence between each key point of the body part in the target object and each key point of the body part in the reference object , the motion information is used to adjust the motion of the body part of the target object to the reference motion.
  • the image adjustment module 83 is configured to adjust a plurality of pixels in the target image according to the motion information to obtain a driving effect image, in which the body parts of the target object present the reference motion.
  • the image driving device provided by the embodiments of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, so that the target image Direct deformation can make the target image show the same body parts as the driving reference image.
  • the driving reference image of the single reference object can be used to drive the target object, which simplifies the realization of the target object.
  • the operation mode of the drive can effectively improve the processing efficiency of the drive target object.
  • the pixel motion module 82 is configured to: identify the first positions corresponding to each key point of the body part of the target object in the target image; and identify the key points in the driving reference image Second positions corresponding to each key point of the body part of the reference object; based on the correspondence between each of the first positions and each of the second positions, determining the movement of multiple pixel points in the target image information.
  • the motion information includes the corresponding displacements of each pixel in the plurality of pixels; the image adjustment module 83 is further configured to: adjust the plurality of pixels in the target image according to their respective displacements displacement to move.
  • the image adjustment module 83 is configured to: adjust a plurality of pixels in the target image according to the motion information to obtain an adjusted target image, and the adjusted target image is the the driving effect image; or, adjust a plurality of pixels in the target image according to the motion information to obtain an adjusted target image, and use the adjusted target image, the motion information, and an image generation network, The driving effect image is generated.
  • the image generation network includes an encoding network and a decoding network; the image adjustment module 83 is further configured to: use the encoding network to perform feature extraction on the adjusted target image to obtain a feature map; Based on the motion information, the pixels in the feature map are adjusted to obtain an adjusted feature map; the decoding network is used to decode the adjusted feature map to obtain the driving effect image.
  • the device further includes: a mask generation module 84; the mask generation module 84 is configured to use the motion information to determine a mask corresponding to the target image, The mask is used to identify the movement degree of each pixel in the process of adjusting multiple pixels in the target image according to the motion information; the image adjustment module 83 is also used to: use the adjustment The driving effect image is generated from the final target image, the motion information, the mask, and the image generation network.
  • the image generation network includes an encoding network and a decoding network; the image adjustment module 83 is further configured to: use the encoding network to perform feature extraction on the adjusted target image to obtain a feature map; Based on the motion information and the mask, adjust the pixels in the feature map to obtain an adjusted feature map; use the decoding network to decode the adjusted feature map to obtain the described Drive effect image.
  • the apparatus is performed by an image-driven model trained and generated from target sample images and drive sample images; wherein the target sample images include body parts of sample subjects exhibiting the first motion , the driving sample image includes the body part of the sample object presenting the second action; in training, the target sample image and the driving sample image are input into an initial image-driven model, and the initial image-driven model outputs Presenting training images of body parts of sample subjects in the third action, adjusting the initial image-driven model during training through the difference between the training image and the driving sample image, and obtaining the image-driven model after training.
  • the target sample images include body parts of sample subjects exhibiting the first motion
  • the driving sample image includes the body part of the sample object presenting the second action
  • the initial image-driven model outputs Presenting training images of body parts of sample subjects in the third action, adjusting the initial image-driven model during training through the difference between the training image and the driving sample image, and obtaining the image-driven model after training.
  • the image acquisition module 81 is further configured to: acquire multiple frames of driving reference images in the driving video, wherein the multiple frames of driving reference images include body parts of the same reference object, and different driving reference images include The reference movements presented by the body parts of the reference subject are different; one frame of the driving reference image is obtained from the multiple frames of the driving reference image.
  • the image adjustment module 83 is further configured to: generate a target video based on the multi-frame driving effect image in response to obtaining a multi-frame driving effect image of the target object based on the multi-frame driving reference image, the The motion of the body part of the target object in the target video is consistent with the motion of the body part of the reference object in the driving video, wherein the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, And the multi-frame driving effect images respectively correspond to the reference actions of the body parts of the reference object in the multi-frame driving reference images.
  • the image driving method in at least one embodiment of the present disclosure may be performed by an electronic device, for example, may be performed by a terminal device or a server or other processing device, where the terminal device may include a user device, a mobile device, a terminal, a cellular phone, a cordless phone, Personal digital processing, handheld devices, computing devices, automotive devices, wearable devices, etc.
  • the image driving method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • An embodiment of the present disclosure also provides an electronic device. As shown in FIG.
  • the device 12 is configured to implement the image driving method described in any embodiment of the present disclosure when executing the computer instructions.
  • An embodiment of the present disclosure further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, implements the image driving method described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.
  • the device embodiment since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment.
  • the device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. It can be understood and implemented by those skilled in the art without creative effort.

Abstract

Provided are an image driving method and apparatus, a device and a medium. The method comprises: acquiring a target image and a driving reference image, the target image comprising a body part of a target subject, and the driving reference image comprising a body part of a reference subject presenting a reference action; on the basis of a correspondence between key points of the body part of the target subject and key points of the body part of the reference subject, determining motion information of a plurality of pixel points in the target image, the motion information being used for adjusting an action of the body part of the target subject to be the reference action; and adjusting the plurality of pixel points in the target image according to the motion information so as to obtain a driving effect image, the body part of the target subject in the driving effect image presenting the reference action.

Description

一种图像驱动方法、装置、设备和介质An image driving method, device, device and medium
相关申请交叉引用Related Application Cross Reference
本申请主张申请号为202210147579.9、申请日为2022年2月17日的中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application claims the priority of the Chinese patent application with the application number 202210147579.9 and the filing date of February 17, 2022. The entire content of the Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本公开实施例涉及计算机视觉技术领域,具体涉及一种图像驱动方法、装置、设备和介质。The embodiments of the present disclosure relate to the technical field of computer vision, and in particular to an image driving method, device, device and medium.
背景技术Background technique
在例如虚拟会议、活照片等计算机视觉热门领域,需要用到图像驱动技术,来驱动目标对象的身体部位产生相应的动作。比如,脸部图像驱动是指给定一段脸部视频,能够将这段脸部视频的脸部动作转移到用户指定的脸部图像上。但是,如果要实现对特定用户的脸部进行驱动,需要获取该特定用户的一段视频来进行驱动,驱动方式操作繁琐且处理效率较低。In popular fields of computer vision such as virtual conferences and live photos, image-driven technology is needed to drive the body parts of the target object to produce corresponding actions. For example, facial image driving means that given a facial video, the facial movements of this facial video can be transferred to the facial image specified by the user. However, if the face of a specific user is to be driven, a video of the specific user needs to be obtained for driving, and the driving method is cumbersome and has low processing efficiency.
发明内容Contents of the invention
第一方面,提供了一种图像驱动方法,所述方法包括:获取目标图像和驱动参考图像,所述目标图像包括目标对象的身体部位,所述驱动参考图像中包括呈现参考动作的参考对象的身体部位;基于所述目标对象的身体部位的各个关键点和所述参考对象的身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息,所述运动信息用于将所述目标对象的身体部位的动作调整为所述参考动作;根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,所述驱动效果图像中所述目标对象的身体部位呈现所述参考动作。In a first aspect, an image driving method is provided, the method comprising: acquiring a target image and a driving reference image, the target image including a body part of a target object, and the driving reference image including a reference object presenting a reference action body parts; based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determine the motion information of multiple pixels in the target image, the motion The information is used to adjust the motion of the body part of the target object to the reference motion; adjust multiple pixels in the target image according to the motion information to obtain a driving effect image, and all the driving effect images in the driving effect image The body part of the target object presents the reference motion.
第二方面,提供了一种图像驱动装置,所述装置包括:图像获取模块,用于获取目标图像和驱动参考图像,所述目标图像包括目标对象的身体部位,所述驱动参考图像中包括呈现参考动作的参考对象的身体部位;像素运动模块,用于基于所述目标对象的身体部位的各个关键点和所述参考对象的身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息,所述运动信息用于将所述目标对象的身体部位的动作调整为所述参考动作;图像调整模块,用于依据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,所述驱动效果图像中所述目标对象的身体部位呈现所述参考动作。In a second aspect, an image driving device is provided, the device includes: an image acquisition module, configured to acquire a target image and a driving reference image, the target image includes body parts of a target subject, and the driving reference image includes a presentation The body part of the reference object of the reference action; a pixel motion module, configured to determine the target image based on the correspondence between each key point of the body part of the target object and each key point of the body part of the reference object Motion information of multiple pixels in the target image, the motion information is used to adjust the motion of the body part of the target object to the reference motion; an image adjustment module is used to adjust multiple pixels in the target image according to the motion information Pixels are adjusted to obtain a driving effect image, in which the body part of the target object presents the reference action.
第三方面,提供一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开任一实施例所述的图像驱动方法。In a third aspect, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the present disclosure when executing the computer instructions The image driving method described in any one of the embodiments.
第四方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的图像驱动方法。In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.
第五方面,提供一种计算机程序产品,所述产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现本公开任一实施例所述的图像驱动方法。According to a fifth aspect, a computer program product is provided, the product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.
本公开实施例提供的图像驱动方法,通过根据目标对象的身体部位的关键点与参考对象的身体部位的关键点之间的对应关系,来调整目标图像中的像素点,从而对目标图像进行直接变形,能够让目标图像呈现与驱动参考图像相同的身体部位的动作,无需上传目标对象的视频,便可利用单张包括参考对象的驱动参考图像实现对目标对象的驱动,简化了实现目标对象驱动的操作方式,且能够有效提升驱动目标对象的处理效率。The image driving method provided by the embodiment of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between the key points of the body parts of the target object and the key points of the body parts of the reference object, so as to directly perform Morphing, which can make the target image show the same body parts as the driving reference image. Without uploading the video of the target object, a single driving reference image including the reference object can be used to drive the target object, which simplifies the realization of target object driving. operation mode, and can effectively improve the processing efficiency of the driving target object.
附图说明Description of drawings
为了更清楚地说明本公开一个或多个实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in one or more embodiments of the present disclosure, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only one example of the present disclosure. Or some embodiments described in multiple embodiments, for those skilled in the art, other drawings can also be obtained according to these drawings without creative work.
图1A是本公开至少一个实施例示出的一种图像驱动方法的流程图;FIG. 1A is a flowchart of an image driving method shown in at least one embodiment of the present disclosure;
图1B是本公开至少一个实施例示出的一种关键点模式;Fig. 1B is a key point mode shown in at least one embodiment of the present disclosure;
图1C是本公开至少一个实施例示出的又一种关键点模式;Fig. 1C is another key point mode shown in at least one embodiment of the present disclosure;
图2A是本公开至少一个实施例示出的另一种图像驱动方法的流程图;FIG. 2A is a flowchart of another image driving method shown in at least one embodiment of the present disclosure;
图2B是本公开至少一个实施例示出的一种目标图像的示意图;Fig. 2B is a schematic diagram of a target image shown in at least one embodiment of the present disclosure;
图2C是本公开至少一个实施例示出的一种驱动参考图像的示意图;Fig. 2C is a schematic diagram of a driving reference image shown in at least one embodiment of the present disclosure;
图2D是本公开至少一个实施例示出的目标对象的身体部位的关键点的示意图;Fig. 2D is a schematic diagram of key points of body parts of a target object shown in at least one embodiment of the present disclosure;
图2E是本公开至少一个实施例示出的参考对象的身体部位的关键点的示意图;Fig. 2E is a schematic diagram of key points of body parts of a reference subject shown in at least one embodiment of the present disclosure;
图2F是本公开至少一个实施例示出的一种脸部关键点模式;Fig. 2F is a facial key point mode shown in at least one embodiment of the present disclosure;
图2G是本公开至少一个实施例示出的一种驱动效果图像;Fig. 2G is a driving effect image shown in at least one embodiment of the present disclosure;
图3是本公开至少一个实施例示出的一种图像驱动模型的结构示意图;Fig. 3 is a schematic structural diagram of an image driving model shown in at least one embodiment of the present disclosure;
图4是本公开至少一个实施例示出的一种图像驱动模型中的运动模块的结构示意图;Fig. 4 is a schematic structural diagram of a motion module in an image driving model shown in at least one embodiment of the present disclosure;
图5是本公开至少一个实施例示出的又一种图像驱动方法的流程图;Fig. 5 is a flowchart of another image driving method shown in at least one embodiment of the present disclosure;
图6是本公开至少一个实施例示出的另一种图像驱动模型的结构示意图;Fig. 6 is a schematic structural diagram of another image driving model shown in at least one embodiment of the present disclosure;
图7是本公开至少一个实施例示出的又一种图像驱动模型的结构示意图;Fig. 7 is a schematic structural diagram of another image driving model shown in at least one embodiment of the present disclosure;
图8是本公开至少一个实施例示出的一种图像驱动装置的框图;Fig. 8 is a block diagram of an image driving device shown in at least one embodiment of the present disclosure;
图9是本公开至少一个实施例示出的另一种图像驱动装置的框图;Fig. 9 is a block diagram of another image driving device shown in at least one embodiment of the present disclosure;
图10是本公开至少一个实施例示出的一种电子设备的硬件结构示意图。Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to at least one embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with this specification. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present specification as recited in the appended claims.
在本说明书使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书。在本说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in this specification are for the purpose of describing particular embodiments only, and are not intended to limit the specification. As used in this specification and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本说明书可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this specification, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."
如图1A所示,图1A是本公开至少一个实施例示出的一种图像驱动方法的流程图,该方法可以包括步骤102至步骤106。As shown in FIG. 1A , FIG. 1A is a flowchart of an image driving method according to at least one embodiment of the present disclosure, and the method may include steps 102 to 106 .
在步骤102中,获取目标图像和驱动参考图像。In step 102, a target image and a driving reference image are acquired.
其中,目标图像包括目标对象的身体部位,驱动参考图像中包括呈现参考动作的参考对象的身体部位。本实施例的方法旨在将参考对象的身体部位上的参考动作转移到目标对象的身体部位,以使得目标对象的身体部位能够呈现该参考动作。Wherein, the target image includes the body parts of the target object, and the driving reference image includes the body parts of the reference object presenting the reference action. The method of this embodiment aims at transferring the reference action on the body part of the reference subject to the body part of the target subject, so that the body part of the target subject can present the reference action.
目标对象为被驱动对象,本实施例不限制目标对象的范围,目标对象可以包括目标图像中真实的人、动漫人物、卡通形象或玩偶等。目标图像中目标对象的身体部位可以是头部、脸部、肢体(如手部、腿部、躯干等)等任意一种身体部位或至少两种身体部 位的组合。The target object is a driven object. This embodiment does not limit the scope of the target object. The target object may include a real person, animation character, cartoon image or doll in the target image. The body part of the target object in the target image can be any body part such as head, face, limbs (such as hands, legs, torso, etc.), or a combination of at least two body parts.
本实施例不限制参考对象的范围,参考对象可以包括驱动参考图像中真实的人、动漫人物、卡通形象或玩偶等。驱动参考图像中的参考对象所呈现的参考动作可以包括任何动作,比如,当身体部位为脸部时,参考对象所呈现的参考动作可以是各种表情动作,如皱眉、扭头、张大嘴巴等;当身体部位为肢体时,参考对象所呈现的参考动作可以是各种姿态动作,如走路、打招呼、举手等。This embodiment does not limit the scope of the reference object, and the reference object may include a real person, animation character, cartoon image, or doll in the driving reference image. The reference action presented by the reference object in the driving reference image may include any action, for example, when the body part is a face, the reference action presented by the reference object may be various facial expressions, such as frowning, turning the head, opening the mouth, etc.; When the body parts are limbs, the reference actions presented by the reference object may be various gesture actions, such as walking, greeting, raising hands, and so on.
例如,若要将一个真实的人张大嘴巴的动作转移到一个动漫人物上,那么可以将该真实的人称为参考对象,将所述动漫人物称为目标对象。相应的,包含该真实的人的嘴巴的图像为驱动参考图像,包含动漫人物的嘴巴的图像为目标图像。For example, if you want to transfer the action of opening the mouth of a real person to an anime character, you can call the real person the reference object and the anime character the target object. Correspondingly, the image containing the mouth of the real person is the driving reference image, and the image containing the mouth of the animation character is the target image.
本实施例不限制目标图像和驱动参考图像的获取方式。在获取目标图像时,可以获取由用户指定或者用户上传的单张目标图像。在获取驱动参考图像时,可以获取用户指定或者用户上传的单张驱动参考图像或者多张驱动参考图像,也可以先获取用户指定或者用户上传的一段驱动视频,由该驱动视频获取多张驱动参考图像。This embodiment does not limit the acquisition manners of the target image and the driving reference image. When acquiring the target image, a single target image designated by the user or uploaded by the user may be acquired. When obtaining the driving reference image, you can obtain a single driving reference image or multiple driving reference images specified by the user or uploaded by the user, or you can first obtain a driving video specified by the user or uploaded by the user, and obtain multiple driving reference images from the driving video. image.
在步骤104中,基于所述目标对象的所述身体部位的各个关键点和所述参考对象的所述身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息。In step 104, based on the correspondence between each key point of the body part of the target object and each key point of the body part of the reference object, determine the number of pixel points in the target image Sports information.
其中,运动信息用于将所述目标对象的身体部位的动作调整为所述参考动作。身体部位的关键点用于描述身体部位的语义信息及位置信息,其中,身体部位的语义信息包括身体部位的类别信息,身体部位的关键点可以是身体部位区域及身体部位周围区域的任何位置的点。Wherein, the motion information is used to adjust the motion of the body part of the target object to the reference motion. The key points of the body part are used to describe the semantic information and location information of the body part, wherein the semantic information of the body part includes the category information of the body part, and the key point of the body part can be any position of the body part area and the area around the body part point.
比如,在身体部位是脸部的情况下,身体部位的关键点为脸部关键点,脸部关键点可以包括脸部皮肤或五官上的点、脸部轮廓线和五官轮廓线上的点、头发、脖子上的点,或者也可以包括佩戴的眼镜、耳饰和帽子等物体上的点。For example, when the body part is a face, the key points of the body part are the key points of the face, and the key points of the face may include points on the skin of the face or facial features, points on the contour line of the face and the contour line of the facial features, Points on hair, on the neck, or can also include points on objects such as glasses, earrings, and hats that are worn.
比如,在身体部位是肢体的情况下,身体部位的关键点为肢体关键点,肢体关键点可以包括手关节、肘关节、头顶点、踝关节及尾椎等骨架上的点,也可以包括衣服、背包等物体上的点。For example, when the body part is a limb, the key points of the body part are the key points of the limbs. The key points of the limbs can include points on the skeleton such as hand joints, elbow joints, head vertices, ankle joints, and tail vertebrae, and can also include clothes. , backpacks and other objects.
对于同一个身体部位而言,当身体部位的动作不同时,该身体部位的关键点的位置可以不同,该身体部位的各个关键点的位置组成不同的位置组合时可以表示该身体部位呈现不同的动作。For the same body part, when the actions of the body part are different, the positions of the key points of the body part can be different. When the positions of the key points of the body part form different position combinations, it can indicate that the body part presents different action.
本实施例不限制所使用的身体部位的关键点的数目和位置分布,不同数目和位置分布的关键点可以组成不同的关键点模式。以身体部位为脸部为例,如图1B所示的83个脸部关键点,分别位于脸部轮廓线以及眼睛、眉毛、鼻子和嘴巴的轮廓线上,该图1B所示的83个不同位置分布的脸部关键点可以称作模式1。如图1C所示的49个脸部关键点,分别位于五官轮廓线上,该图1C所示的49个不同位置分布的脸部关键点可以称作模式2。This embodiment does not limit the number and position distribution of the key points of the body parts used, and key points with different numbers and position distributions can form different key point patterns. Taking the body part as the face as an example, the 83 key points of the face as shown in Figure 1B are respectively located on the contour line of the face and the contour lines of the eyes, eyebrows, nose and mouth, and the 83 different key points shown in Figure 1B The location distribution of facial keypoints can be referred to as Mode 1. The 49 facial key points shown in FIG. 1C are respectively located on the contour lines of facial features. The 49 facial key points distributed in different positions shown in FIG. 1C can be called mode 2.
本实施例中,描述目标对象和参考对象中身体部位的关键点所使用的模式是相同的。In this embodiment, the modes used to describe key points of body parts in the target object and the reference object are the same.
比如,当描述身体部位的关键点所使用的模式是模式1时,目标对象的脸部关键点包括图1B所示的83个点,参考对象的脸部关键点同样包括图1B所示的83个点。For example, when the mode used to describe the key points of body parts is mode 1, the face key points of the target object include 83 points shown in Figure 1B, and the face key points of the reference object also include 83 points shown in Figure 1B. points.
又比如,当描述身体部位的关键点所使用的模式是模式2时,目标对象的脸部关键点包括图1C所示的五官轮廓线上的49个点,参考对象的脸部关键点同样包括图1C所示的五官轮廓线上的49个点。For another example, when the mode used to describe the key points of body parts is mode 2, the facial key points of the target object include 49 points on the facial features contour line shown in Figure 1C, and the facial key points of the reference object also include 49 points on the facial features contour line shown in Figure 1C.
本步骤中,可以是先分别提取目标对象中身体部位的各个关键点以及参考对象中身体部位的各个关键点。其中,目标对象中身体部位的关键点和参考对象中身体部位的关键点的数目和定义是相同的,目标对象中身体部位的关键点与参考对象中身体部位的关键点一一对应。In this step, each key point of the body part in the target object and each key point of the body part in the reference object may be respectively extracted first. Wherein, the number and definition of the key points of the body parts in the target object and the key points of the body parts in the reference object are the same, and the key points of the body parts in the target object are in one-to-one correspondence with the key points of the body parts in the reference object.
参考对象中身体部位的关键点的位置组合反映了该驱动参考图像中参考对象中该身 体部位呈现的参考动作,可以通过将目标对象中该身体部位的关键点的位置组合调整为与参考对象中该身体部位的关键点的位置组合一致来使目标对象中该身体部位呈现同样的参考动作。The position combination of the key points of the body part in the reference object reflects the reference action presented by the body part in the reference object in the driving reference image, and can be adjusted by adjusting the position combination of the key points of the body part in the target object to The position combination of the key points of the body part is consistent so that the body part in the target object presents the same reference action.
在一些实施例中,可以根据目标对象中该身体部位的各个关键点的位置与参考对象中该身体部位的各个关键点的位置之间的差异,计算得到目标图像中需要进行移动的多个像素点的运动信息。比如,可以将驱动参考图像中参考对象的该身体部位的各个关键点的位置和目标图像中目标对象的该身体部位的各个关键点的位置输入预先训练好的神经网络,输出得到目标图像中多个像素点的运动信息。In some embodiments, the number of pixels that need to be moved in the target image can be calculated according to the difference between the position of each key point of the body part in the target object and the position of each key point of the body part in the reference object point motion information. For example, the position of each key point of the body part of the reference object in the driving reference image and the position of each key point of the body part of the target object in the target image can be input into a pre-trained neural network, and the output can be obtained in the target image. motion information of pixels.
其中,每个像素点的运动信息可以是该像素点的位移,该位移可以用横轴和纵轴两个维度的运动方向表示。示例性的,每个像素点的运动信息可以包括光流信息。每个像素点的运动信息可以用一个二维向量表示,目标图像中多个像素点的运动信息可以用矩阵来表示。Wherein, the motion information of each pixel point may be the displacement of the pixel point, and the displacement may be represented by the motion direction in two dimensions of the horizontal axis and the vertical axis. Exemplarily, the motion information of each pixel may include optical flow information. The motion information of each pixel can be represented by a two-dimensional vector, and the motion information of multiple pixels in the target image can be represented by a matrix.
在步骤106中,根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像。In step 106, a plurality of pixels in the target image are adjusted according to the motion information to obtain a driving effect image.
其中,驱动效果图像中目标对象的身体部位呈现所述参考动作。Wherein, the body part of the target object in the driving effect image presents the reference motion.
例如,当运动信息包括像素点的位移时,将目标图像中多个像素点按照其各自对应的位移进行移动,从而实现对目标图像的变形处理,得到呈现该参考动作的目标对象的身体部位的驱动效果图像。For example, when the motion information includes the displacement of pixels, multiple pixels in the target image are moved according to their respective displacements, so as to realize the deformation processing of the target image and obtain the body parts of the target object presenting the reference action. Drive effect image.
本公开实施例提供的图像驱动方法,通过根据目标对象的身体部位的关键点和参考对象的该身体部位的关键点之间的对应关系,来调整目标图像中的像素点,从而对目标图像进行直接变形,能够让目标图像呈现与驱动参考图像相同的身体部位的动作,无需上传目标对象的视频,便可利用单张参考对象的驱动参考图像实现对目标对象的驱动,简化了实现目标对象驱动的操作方式,且能够有效提升驱动目标对象的处理效率。The image driving method provided by the embodiments of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between the key points of the body part of the target object and the key points of the body part of the reference object, so as to perform the image driving on the target image. Direct deformation can make the target image show the same body parts as the driving reference image. Without uploading the video of the target object, the driving reference image of the single reference object can be used to drive the target object, which simplifies the realization of target object driving. operation mode, and can effectively improve the processing efficiency of the driving target object.
如图2A所示,图2A是本公开至少一个实施例示出的一种图像驱动方法的流程图,图3示例了一种用于实现本实施例的图像驱动方法的图像驱动模型的结构示意图。图3中的图像驱动模型包含身体部位的关键点检测模块31、运动模块32和粗变形模块33。As shown in FIG. 2A , FIG. 2A is a flowchart of an image driving method according to at least one embodiment of the present disclosure, and FIG. 3 illustrates a schematic structural diagram of an image driving model for implementing the image driving method of this embodiment. The image-driven model in FIG. 3 includes a body part key point detection module 31 , a motion module 32 and a coarse deformation module 33 .
需要说明的是,该图3所示的模型只是一种示例性的网络结构,具体实施中不限于此。如下结合图2A和图3,对该图像驱动方法的处理流程进行描述,该图像驱动方法的处理流程包括步骤202至步骤208。It should be noted that the model shown in FIG. 3 is only an exemplary network structure, and is not limited thereto in specific implementation. The processing flow of the image driving method is described below with reference to FIG. 2A and FIG. 3 . The processing flow of the image driving method includes step 202 to step 208 .
在步骤202中,获取目标图像和驱动参考图像。In step 202, a target image and a driving reference image are acquired.
示例性的,获取到的目标图像可以如图2B所示,目标对象为卡通人物;获取到的驱动参考图像可以如图2C所示,参考对象为真实的人,可以看到驱动参考图像中的参考对象所呈现的参考动作是:向左偏头、眼睛向左上方看且张大嘴巴。Exemplarily, the obtained target image may be as shown in FIG. 2B, and the target object is a cartoon character; the obtained driving reference image may be as shown in FIG. 2C, and the reference object is a real person, and the driving reference image can be seen The reference actions presented by the reference subject are: tilting the head to the left, looking up to the left and opening the mouth wide.
结合图3所示,目标图像和驱动参考图像可以输入到图像驱动模型中的身体部位的关键点检测模块31。当身体部位为脸部时,身体部位的关键点检测模块31为脸部关键点检测模块;当身体部位为肢体时,身体部位的关键点检测模块31为肢体关键点检测模块;当身体部位为手部时,身体部位的关键点检测模块31为手部关键点检测模块。As shown in FIG. 3 , the target image and the driving reference image can be input to the body part key point detection module 31 in the image driving model. When the body part is a face, the key point detection module 31 of the body part is a face key point detection module; when the body part is a limb, the key point detection module 31 of the body part is a limb key point detection module; In the case of hands, the key point detection module 31 of body parts is a hand key point detection module.
在步骤204中,识别所述目标图像中所述目标对象的身体部位的各个关键点分别对应的第一位置;以及,识别所述驱动参考图像中所述参考对象的身体部位的各个关键点分别对应的第二位置。In step 204, identify the first positions respectively corresponding to the key points of the body parts of the target subject in the target image; and identify the key points of the body parts of the reference subject in the driving reference image respectively corresponding to the second position.
本步骤中,识别所述目标图像中所述目标对象的身体部位的各个关键点分别对应的第一位置可以包括对目标图像进行身体部位关键点提取,得到所述目标图像中所述目标对象的身体部位的各个关键点分别对应的第一位置;识别所述驱动参考图像中所述参考对象的身体部位的关键点分别对应的第二位置可以包括对驱动参考图像进行身体部位关键点提取,得到所述驱动参考图像中所述参考对象的身体部位的各个关键点分别对应的第二位置。其中,对目标图像和驱动参考图像进行身体部位关键点提取可以采用相同 的提取方式。本实施例不限制进行身体部位关键点提取的具体方式。比如,可以是使用神经网络的方式提取,或者也可以通过基于级联形状回归的方法、基于组件的检测算法等其他算法的方式提取身体部位的关键点。本实施例中,针对目标对象的身体部位的各个关键点,可以用一个二维坐标表示该关键点的位置,记为第一位置;针对参考对象的身体部位的各个关键点,可以用一个二维坐标表示该关键点的位置,记为第二位置。In this step, identifying the first positions corresponding to each key point of the body part of the target object in the target image may include extracting the key points of the body part of the target image to obtain the body part key points of the target image in the target image. The first positions corresponding to the key points of the body parts respectively; identifying the second positions respectively corresponding to the key points of the body parts of the reference object in the driving reference image may include extracting the key points of the body parts from the driving reference image to obtain Each key point of the body part of the reference object in the driving reference image corresponds to the second position respectively. Among them, the same extraction method can be used to extract key points of body parts from the target image and the driving reference image. This embodiment does not limit the specific manner of extracting key points of body parts. For example, it can be extracted by using a neural network, or the key points of body parts can be extracted by other algorithms such as methods based on cascade shape regression and component-based detection algorithms. In this embodiment, for each key point of the body part of the target object, a two-dimensional coordinate can be used to represent the position of the key point, which is recorded as the first position; for each key point of the body part of the reference object, a two-dimensional coordinate can be used The dimensional coordinates represent the position of the key point, which is recorded as the second position.
目标对象的身体部位的关键点的数目与参考对象的该身体部位的关键点的数目相同且一一对应。由于目标对象的该身体部位呈现的动作和参考对象的该身体部位呈现的动作不同,任一第一位置和与该第一位置对应的第二位置可以不同。The number of key points of the body part of the target object is the same as the number of key points of the body part of the reference object, and there is a one-to-one correspondence. Since the body part of the target object exhibits different actions than the body part of the reference object, any first position may be different from the second position corresponding to the first position.
比如,以身体部位为脸部为例,如图2D所示,目标对象的其中一个关键点m1位于右嘴角,如图2E所示,参考对象中与之对应的关键点n1同样位于右嘴角。m1对应的第一位置可以是m1的坐标(x1,y1),n1对应的第二位置可以是n1的坐标(x2,y2)。For example, taking the body part as a face, as shown in Figure 2D, one of the key points m1 of the target object is located at the right corner of the mouth, as shown in Figure 2E, the corresponding key point n1 of the reference object is also located at the right corner of the mouth. The first position corresponding to m1 may be the coordinate (x1, y1) of m1, and the second position corresponding to n1 may be the coordinate (x2, y2) of n1.
示例性的,可以使用身体部位关键点检测网络(Landmark Detector)进行身体部位关键点提取。预先训练一个身体部位关键点检测网络,可以依次输入目标图像和驱动参考图像至该身体部位关键点检测网络,该身体部位关键点检测网络分别输出检测到的目标对象的身体部位的各个关键点分别对应的第一位置,以及参考对象的身体部位的各个关键点分别对应的第二位置。Exemplarily, a body part key point detection network (Landmark Detector) can be used for body part key point extraction. A body part key point detection network is pre-trained, and the target image and the driving reference image can be sequentially input to the body part key point detection network, and the body part key point detection network outputs each key point of the detected body part of the target object respectively The corresponding first position, and the second position respectively corresponding to each key point of the body part of the reference object.
需要说明的是,当使用身体部位关键点检测网络进行身体部位关键点提取时,所提取的身体部位关键点可以使用用户自定义的身体部位关键点模式,也可以使用身体部位关键点检测网络在预先训练过程中自动学习到的身体部位关键点模式,其中,在训练身体部位关键点检测网络时,可以设置身体部位关键点检测网络所需要学习到的身体部位关键点的数目,该数目可以是一个具体值,也可以是一定范围。当使用不同的训练方式时,身体部位关键点检测网络可以学习到不同模式的身体部位的关键点。训练完毕的身体部位关键点检测网络在应用时按照其所学习到的身体部位关键点模式提取对应的身体部位的关键点。It should be noted that when using the body part key point detection network to extract body part key points, the extracted body part key points can use the user-defined body part key point mode, or use the body part key point detection network in The body part key point mode that is automatically learned in the pre-training process, wherein, when training the body part key point detection network, the number of body part key points that the body part key point detection network needs to learn can be set, and the number can be A specific value, or a certain range. When using different training methods, the body part key point detection network can learn the key points of different patterns of body parts. When the trained body part key point detection network is applied, the key points of the corresponding body parts are extracted according to the body part key point patterns learned by it.
如图2F所示,图2F示出了脸部关键点检测网络经过训练自动学习到的一种脸部关键点模式。As shown in FIG. 2F , FIG. 2F shows a face key point pattern automatically learned by the facial key point detection network after training.
身体部位关键点检测网络的所使用的神经网络模型可以包括DAN(Deep Alignment Network,深度对齐网络)、DCNN(Deep Convolutional Neural Network,深度卷积网络)或TCNN(Tweaked Convolutional Neural Network,调整卷积网络)等模型。该身体部位关键点检测网络可以包括自监督学习模型,也可以是无监督学习模型或者有监督学习模型。The neural network model used by the body part key point detection network can include DAN (Deep Alignment Network, depth alignment network), DCNN (Deep Convolutional Neural Network, depth convolutional network) or TCNN (Tweaked Convolutional Neural Network, adjustment convolutional network ) and other models. The body part key point detection network may include a self-supervised learning model, an unsupervised learning model or a supervised learning model.
结合图3所示,图像驱动模型中的身体部位的关键点检测模块31可以是身体部位关键点检测网络,用于分别对目标图像中的目标对象和驱动参考图像中的参考对象进行身体部位关键点提取,输出得到目标对象的身体部位的各个关键点分别对应的第一位置和参考对象的身体部位的各个关键点分别对应的第二位置,并将第一位置和第二位置输入到运动模块32。As shown in FIG. 3 , the key point detection module 31 of the body part in the image-driven model can be a body part key point detection network, which is used to perform body part keying on the target object in the target image and the reference object in the driving reference image respectively. Point extraction, output the first position corresponding to each key point of the body part of the target object and the second position corresponding to each key point of the body part of the reference object respectively, and input the first position and the second position to the motion module 32.
在步骤206中,基于各个所述第一位置和各个所述第二位置之间的对应关系,确定所述目标图像中多个像素点的运动信息。In step 206, based on the corresponding relationship between each of the first positions and each of the second positions, motion information of a plurality of pixels in the target image is determined.
利用运动信息可以将所述目标对象的身体部位呈现的动作调整为所述参考动作。参考对象的身体部位的各个关键点分别对应的第二位置构成了一个位置组合,这里称作第二位置组合,目标对象的身体部位的各个关键点分别对应的第一位置也构成了一个位置组合,这里称作第一位置组合,运动信息为将第一位置组合调整为第二位置组合时,目标图像中的多个像素点分别移动的位移。The motion displayed by the body part of the target object can be adjusted to the reference motion by using the motion information. The second positions corresponding to the key points of the body parts of the reference object constitute a position combination, which is called the second position combination here, and the first positions corresponding to the key points of the body parts of the target object also constitute a position combination , which is referred to as the first position combination here, and the motion information is the displacements of the respective movement of multiple pixels in the target image when the first position combination is adjusted to the second position combination.
本步骤中,可以先根据各个第一位置与各个第二位置之间的对应关系,计算得到运动光流信息。运动光流信息可以包括目标图像中身体部位所在的局部区域的像素分别移动时的光流信息,这里的计算可以使用运动光流估计的方法。比如,可以使用基于泰勒 展开的Lucas-Kanade算法计算运动光流信息,也可以使用FlowNet、FlowNet2.0等基于深度学习的神经网络来计算运动光流信息,具体实施中不限于此。In this step, the motion optical flow information may be calculated first according to the corresponding relationship between each first position and each second position. The motion optical flow information may include optical flow information when pixels in the local area where the body part is located in the target image move respectively, and the calculation here may use a motion optical flow estimation method. For example, the Lucas-Kanade algorithm based on Taylor expansion can be used to calculate the motion optical flow information, and the neural network based on deep learning such as FlowNet and FlowNet2.0 can also be used to calculate the motion optical flow information, and the specific implementation is not limited to this.
示例性的,如图2D所示,目标对象的身体部位的四个关键点为右嘴角m1、左嘴角m2、右眼m3和左眼m4。如图2E所示,参考对象的身体部位的四个关键点为右嘴角n1、左嘴角n2、右眼n3和左眼n4。其中,m1与n1对应,m2与n2对应,m3与n3对应,以及m4与n4对应。在计算运动光流信息时,所用到的第一位置为m1、m2、m3、m4的坐标信息,所用到的第二位置为n1、n2、n3以及n4的坐标信息。Exemplarily, as shown in FIG. 2D , the four key points of the body part of the target object are the right mouth corner m1 , the left mouth corner m2 , the right eye m3 and the left eye m4 . As shown in FIG. 2E , the four key points of the body part of the reference object are right mouth corner n1 , left mouth corner n2 , right eye n3 and left eye n4 . Among them, m1 corresponds to n1, m2 corresponds to n2, m3 corresponds to n3, and m4 corresponds to n4. When calculating the motion optical flow information, the first position used is the coordinate information of m1 , m2 , m3 and m4 , and the second position used is the coordinate information of n1 , n2 , n3 and n4 .
在计算得到运动光流信息后,可以结合运动光流信息与目标图像计算得到最终的目标图像中多个像素点的运动信息。比如,可以预先训练一个预测神经网络,该预测神经网络的输入是目标图像和运动光流信息,该预测神经网络的输出是目标图像中多个像素点的运动信息。After the motion optical flow information is calculated, the motion information of multiple pixels in the final target image can be calculated by combining the motion optical flow information with the target image. For example, a predictive neural network can be pre-trained, the input of the predictive neural network is the target image and motion optical flow information, and the output of the predictive neural network is the motion information of multiple pixels in the target image.
结合图3所示,将第一位置、第二位置以及目标图像输入运动模块(Motion Module)32,输出运动信息。在一个例子中,如图4所示,运动模块32中包括运动信息计算单元320,其中,算法3201用于根据各个第一位置和各个第二位置之间的对应关系,计算得到运动光流信息,运动光流信息是目标图像中身体部位所在的局部区域的像素分别移动时的光流信息。算法3201可以包括运动光流估计法,比如,基于泰勒展开的Lucas-Kanade算法,FlowNet、FlowNet2.0等基于深度学习的神经网络的算法。预测神经网络3202用于根据输入的运动光流信息和目标图像输出目标图像中多个像素点的运动信息。As shown in FIG. 3 , input the first position, the second position and the target image into the motion module (Motion Module) 32, and output the motion information. In one example, as shown in FIG. 4 , the motion module 32 includes a motion information calculation unit 320, wherein an algorithm 3201 is used to calculate motion optical flow information according to the correspondence between each first position and each second position , the motion optical flow information is the optical flow information when the pixels in the local area where the body parts are located in the target image move respectively. The algorithm 3201 may include a motion optical flow estimation method, for example, the Lucas-Kanade algorithm based on Taylor expansion, and algorithms based on deep learning neural networks such as FlowNet and FlowNet2.0. The prediction neural network 3202 is used to output motion information of multiple pixels in the target image according to the input motion optical flow information and the target image.
在步骤208中,根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像。In step 208, a plurality of pixels in the target image are adjusted according to the motion information to obtain a driving effect image.
驱动效果图像中目标对象的身体部位呈现所述参考动作。The body part of the target object in the driving effect image exhibits the reference motion.
本步骤中,运动信息包含每个像素点的位移,目标图像中的像素点按照各自对应的位移所指示的位移量和运动方向进行移动,得到调整后的目标图像,即粗变形图像,该调整后的目标图像作为包含呈现该身体部位上的参考动作的目标对象的驱动效果图像。如图2G所示,图2G中的玩偶呈现了与图2C中驱动参考图像中身体部位的参考动作一致的动作:向左偏头、眼睛向左上方看且张大嘴巴。In this step, the motion information includes the displacement of each pixel, and the pixels in the target image are moved according to the displacement and motion direction indicated by their respective displacements to obtain the adjusted target image, that is, the coarsely deformed image. The final target image is used as the driving effect image containing the target object presenting the reference action on the body part. As shown in Figure 2G, the doll in Figure 2G exhibits motions consistent with the reference motions driving the body parts in the reference image in Figure 2C: head tilted to the left, eyes looking up and to the left, and mouth wide open.
结合图3所示,将运动信息和目标图像输入粗变形模块33,输出对目标图像中多个像素点进行调整后的驱动效果图像。As shown in FIG. 3 , the motion information and the target image are input into the rough deformation module 33 , and the driving effect image after adjusting multiple pixels in the target image is output.
在其他例子中,可以在对目标图像进行上述调整后进一步对调整后的目标图像进行优化,提升图像的展示效果。比如去除噪声、修复缺损内容、调节亮度以及色彩增强等。In other examples, the adjusted target image may be further optimized after the above-mentioned adjustment is performed on the target image, so as to improve the display effect of the image. Such as removing noise, repairing missing content, adjusting brightness and color enhancement, etc.
本公开实施例提供的图像驱动方法,通过分别对目标图像和驱动参考图像进行身体部位关键点提取,得到目标对象的身体部位的关键点的位置组合和参考对象的该身体部位的关键点的位置组合,身体部位的关键点的位置组合为身体部位的动作,通过目标对象中该身体部位的关键点和参考对象中该身体部位的关键点的对应关系,得到目标图像中多个像素点的运动信息,从而能够按照运动信息中每个像素点的位移对目标图像的像素点进行相应的移动,这使得目标对象的身体部位呈现的动作与参考对象的该身体部位的呈现参考动作一致,最终得到呈现该参考动作的目标对象的身体部位的驱动效果图像。这不仅能使驱动目标对象的操作简便且驱动目标对象的处理效果较高,还能够采用上述方法来提升驱动效果的准确率。The image driving method provided by the embodiment of the present disclosure obtains the position combination of the key points of the body parts of the target object and the position of the key points of the body parts of the reference object by extracting the key points of the body parts from the target image and the driving reference image respectively. Combination, the position of the key points of the body part is combined into the action of the body part, and the movement of multiple pixels in the target image is obtained through the corresponding relationship between the key points of the body part in the target object and the key points of the body part in the reference object Information, so that the pixels of the target image can be moved correspondingly according to the displacement of each pixel in the motion information, which makes the action presented by the body part of the target object consistent with the reference action presented by the body part of the reference object, and finally obtains A driving effect image of the body part of the target subject of the reference motion is presented. This not only makes the operation of driving the target object simple and the processing effect of the driving target object is high, but also improves the accuracy of the driving effect by using the above method.
另外,如图3所示的图像驱动模型的结构,在训练模型时,不需要拍摄特定的目标对象的视频来进行训练,可以根据目标样本图像和驱动样本图像进行端到端的训练来生成图像驱动模型,训练时所使用的目标样本图像和驱动样本图像可以是任意对象的图像。In addition, with the structure of the image-driven model shown in Figure 3, when training the model, it is not necessary to take a video of a specific target object for training, and end-to-end training can be performed according to the target sample image and the driving sample image to generate an image-driven model. The model, the target sample image and the driving sample image used for training may be images of any object.
之前的图像驱动技术中,在训练图像驱动模型时,大多都需要样本对象的一段视频进行模型训练,使模型充分学习到样本对象的身体部位特征来进行图像驱动,且训练完的模型只能针对该样本对象进行驱动,如果更换样本对象则需要重新进行模型训练。而 本实施例中训练得到的图像驱动模型并不限于实现对特定的目标对象的驱动,任一目标对象均可通过图像驱动模型驱动得到对应的驱动效果图像。In the previous image-driven technology, when training the image-driven model, most of them need a video of the sample object for model training, so that the model can fully learn the characteristics of the body parts of the sample object for image driving, and the trained model can only be used for The sample object is used for driving. If the sample object is replaced, the model training needs to be performed again. However, the image driving model trained in this embodiment is not limited to driving a specific target object, and any target object can be driven by the image driving model to obtain a corresponding driving effect image.
以及,之前的图像驱动技术中,在获取模型训练所需要的样本对象的一段视频时需要对样本对象进行一些指导,比如要求样本对象作出相应的动作等,数据获取的难度比较高,同时,由于需要对样本对象进行指导,限制了图像驱动所能应用的目标对象的范围,比如,不能对真实的人之外的其他类型的对象进行驱动。本实施例中不依赖模型学习到样本对象的身体部位的相关特征后进行驱动,从而不需要拍摄样本对象的视频,可以使用于更加广泛类型的目标对象。And, in the previous image-driven technology, when obtaining a video of the sample object required for model training, it is necessary to provide some guidance to the sample object, such as requiring the sample object to make corresponding actions, etc., the difficulty of data acquisition is relatively high, and at the same time, due to The need to guide the sample object limits the range of target objects to which image driving can be applied, for example, it cannot drive other types of objects other than real people. In this embodiment, the model does not rely on the model to learn the relevant features of the body parts of the sample object before driving, so that there is no need to shoot a video of the sample object, and it can be used for a wider range of target objects.
其中,目标样本图像包括呈现第一动作的样本对象的身体部位,驱动样本图像中包括呈现第二动作的样本对象的该身体部位。在训练图像驱动模型时,将目标样本图像和驱动样本图像输入初始图像驱动模型,初始图像驱动模型输出呈现第三动作的样本对象的身体部位的训练图像,通过训练图像和所述驱动样本图像的差异对训练过程中的初始图像驱动模型进行调整,训练后得到上述使用的图像驱动模型。Wherein, the target sample image includes the body part of the sample subject exhibiting the first action, and the driving sample image includes the body part of the sample subject exhibiting the second action. When training the image-driven model, the target sample image and the driving sample image are input into the initial image-driven model, and the initial image-driven model outputs the training image of the body part of the sample object presenting the third action, through the training image and the driving sample image. The difference adjusts the initial image-driven model during training to obtain the image-driven model used above after training.
其中,身体部位的关键点检测模块31输入的图像为目标样本图像和驱动样本图像,粗变形模块33输出的图像是训练图像。本实施例在训练时所使用的目标样本图像中的目标对象和驱动样本图像中的参考对象为同一个对象,目标样本图像包括呈现第一动作的样本对象的身体部位,驱动样本图像中包括呈现第二动作的样本对象的身体部位。Among them, the image input by the key point detection module 31 of the body part is the target sample image and the driving sample image, and the image output by the coarse deformation module 33 is the training image. In this embodiment, the target object in the target sample image used in training and the reference object in the driving sample image are the same object, the target sample image includes the body part of the sample object that presents the first action, and the driving sample image includes the body part that presents The body part of the sample object for the second motion.
在训练阶段,对目标样本图像进行图像驱动产生的训练图像和驱动样本图像中均包括样本对象的身体部位,训练图像中样本对象的身体部位所呈现的第三动作与驱动样本图像中样本对象的身体部位所呈现的第二动作很可能会有差别。比如,头部偏离的角度不够,嘴巴的形状不一致。例如表现为训练图像和驱动样本图像中像素点的位置不同。通过训练图像和驱动样本图像的差别计算网络损失,根据网络损失同时调整图像驱动模型中的各个网络参数值。In the training phase, the training image and the driving sample image generated by image driving on the target sample image both include the body parts of the sample object, and the third action presented by the body parts of the sample object in the training image is the same as the third action of the sample object in the driving sample image. The second motion exhibited by the body part is likely to be different. For example, the head is not angled enough and the shape of the mouth is not consistent. For example, the position of pixels in the training image and the driving sample image are different. The network loss is calculated by the difference between the training image and the driving sample image, and the values of various network parameters in the image driving model are simultaneously adjusted according to the network loss.
图像驱动模型中的各个网络参数值可以包括身体部位的关键点检测模块31中的身体部位关键点检测网络、运动模块32中的预测神经网络3202中的网络参数值。Each network parameter value in the image-driven model may include the body part key point detection network in the body part key point detection module 31 , and the network parameter value in the prediction neural network 3202 in the motion module 32 .
在一些实施例中,可以通过反向传播调整图像驱动模型中的网络参数值。当达到网络迭代结束条件时,结束网络训练,其中,该结束条件可以包括迭代达到一定的次数,或者损失值小于一定阈值。In some embodiments, network parameter values in the image-driven model can be adjusted by backpropagation. When the network iteration end condition is reached, the network training is ended, wherein the end condition may include that the iteration reaches a certain number of times, or the loss value is less than a certain threshold.
该训练过程所使用的目标样本图像和驱动样本图像可以是任一个样本对象的图像,从而不需要拍摄目标对象的视频进行训练,训练后的模型由于并不依赖模型学习到目标对象的身体部位特征进行驱动,而是根据身体部位关键点的对应关系对目标图像直接进行像素级的变形,因此该训练后的模型具有通用性,针对不同的目标对象也不需要再次收集训练样本来重新训练模型,提升模型训练效率。此外,由于不需要对样本对象进行指导以获取训练视频,也降低了训练难度,简化了模型训练的样本收集方式。The target sample image and driving sample image used in the training process can be images of any sample object, so there is no need to shoot a video of the target object for training. The trained model does not rely on the model to learn the body part characteristics of the target object Drive, but directly deform the target image at the pixel level according to the corresponding relationship of the key points of the body parts, so the trained model is versatile, and there is no need to collect training samples again to retrain the model for different target objects. Improve model training efficiency. In addition, since there is no need to guide the sample object to obtain the training video, the difficulty of training is also reduced, and the sample collection method for model training is simplified.
此外,在一个实施例中,可以根据单帧目标图像和一段驱动视频,得到一段与驱动视频中的参考对象的身体部位的参考动作同步的被驱动的目标对象的视频。在一些实施例中,在步骤202中可以先获取一帧目标图像和驱动视频中的多帧驱动参考图像,目标图像包括目标对象的身体部位,多帧驱动参考图像中包括同一参考对象的身体部位,且不同驱动参考图像中的参考对象的身体部位呈现的参考动作不同。In addition, in one embodiment, based on a single frame of the target image and a driving video, a video of the driven target object synchronized with the reference motion of the body part of the reference object in the driving video can be obtained. In some embodiments, in step 202, one frame of the target image and multiple frames of driving reference images in the driving video can be acquired first, the target image includes the body parts of the target object, and the multiple frames of driving reference images include the body parts of the same reference object , and the body parts of the reference subject in different driving reference images present different reference motions.
比如,假设驱动视频中包括三帧驱动参考图像,第一帧驱动参考图像中的身体部位呈现的参考动作是睁开左眼,闭上右眼;第二帧驱动参考图像中的身体部位呈现的参考动作是闭上左眼,睁开右眼;第三帧驱动参考图像中的身体部位呈现的参考动作是左右两只眼睛睁开。For example, assuming that the driving video includes three frames of driving reference images, the reference action presented by the body parts in the first driving reference image is to open the left eye and close the right eye; The reference action is to close the left eye and open the right eye; the reference action presented by the body parts in the third frame driving reference image is to open the left and right eyes.
然后由多帧驱动参考图像中获取一帧驱动参考图像对目标图像进行驱动处理,这里不限制对多帧驱动参考图像的处理次序。可以是按照驱动参考图像在驱动视频中的顺序对每帧驱动参考图像依次处理,也可以是并行对多帧驱动参考图像进行处理。Then, one frame of the driving reference image is acquired from the multiple frames of the driving reference image to perform driving processing on the target image, and the processing order of the multiple frames of the driving reference image is not limited here. Each frame of the driving reference image may be sequentially processed according to the sequence of the driving reference image in the driving video, or multiple frames of driving reference images may be processed in parallel.
在步骤204中,仅需执行一次对目标图像进行身体部位关键点提取,得到目标图像中目标对象的身体部位的各个关键点分别对应的第一位置。在对驱动参考图像进行处理的过程中,每帧驱动参考图像均需进行身体部位关键点提取,得到每帧驱动参考图像中参考对象的该身体部位的各个关键点分别对应的第二位置。对每帧驱动参考图像均进行处理完毕后,利用每帧驱动参考图像分别确定目标图像中多个像素点的运动信息,从而得到目标对象的多帧驱动效果图像。In step 204, it is only necessary to perform body part key point extraction on the target image once to obtain the first positions corresponding to each key point of the target object's body part in the target image. In the process of processing the driving reference image, each frame of the driving reference image needs to extract the key points of the body part to obtain the second positions corresponding to each key point of the body part of the reference object in each frame of the driving reference image. After processing each frame of the driving reference image, the motion information of multiple pixels in the target image is respectively determined by using each frame of the driving reference image, so as to obtain a multi-frame driving effect image of the target object.
在一个例子中,响应于获取到目标对象的多帧驱动效果图像,所述多帧驱动效果图像与所述多帧驱动参考图像的数量相同,且所述多帧驱动效果图像中分别对应呈现所述多帧驱动参考图像中的参考对象的身体部位的参考动作,基于所述多帧驱动效果图像生成目标视频,所述目标视频中目标对象的身体部位动作与所述驱动视频中参考对象的身体部位的动作一致。In an example, in response to acquiring the multi-frame driving effect images of the target object, the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, and the multi-frame driving effect images respectively present corresponding The reference action of the body part of the reference object in the multi-frame driving reference image, and generate a target video based on the multi-frame driving effect image, the body part action of the target object in the target video is the same as the body part of the reference object in the driving video The movements of the parts are the same.
比如,延续上例,由驱动视频中的三帧驱动参考图像处理后得到目标对象的三帧驱动效果图像,目标对象的三帧驱动效果图像分别呈现与对应的三帧驱动参考图像相同的身体部位的参考动作,将三帧驱动效果图像按照其对应的驱动参考图像的顺序合成目标视频,目标视频中的三帧图像依次呈现目标对象睁开左眼闭上右眼、闭上左眼睁开右眼以及左右两只眼睛睁开的动作。For example, continuing the above example, three frames of driving effect images of the target object are obtained after processing the three frames of driving reference images in the driving video, and the three frames of driving effect images of the target object respectively present the same body parts as the corresponding three frames of driving reference images According to the reference action, the three frames of driving effect images are synthesized into the target video according to the order of their corresponding driving reference images. The three frames of images in the target video show that the target object opens the left eye and closes the right eye, closes the left eye and opens the right eye. eyes and the movement of opening the left and right eyes.
如图5所示,图5是本公开至少一个实施例示出的一种图像驱动方法的流程图,图6示例了一种用于实现本实施例的图像驱动方法的图像驱动模型的结构示意图,该图像驱动模型以图3所示的图像驱动模型为基础,新增了图像生成模块34,该图像生成模块34可以为图像生成网络,包括编码网络341、特征变形单元342以及解码网络343。As shown in FIG. 5, FIG. 5 is a flowchart of an image driving method shown in at least one embodiment of the present disclosure, and FIG. 6 illustrates a schematic structural diagram of an image driving model for implementing the image driving method of this embodiment. The image-driven model is based on the image-driven model shown in FIG. 3 , and an image generation module 34 is added. The image generation module 34 can generate a network for an image, including an encoding network 341 , a feature deformation unit 342 and a decoding network 343 .
需要说明的是,该图6所示的模型只是一种示例性的网络结构,具体实施中不限于此。如下结合图5和图6,对该图像驱动方法的处理流程进行描述,其中,与上述实施例重复的步骤将不再进行赘述,该图像驱动方法的处理流程包括步骤502至步骤514。It should be noted that the model shown in FIG. 6 is only an exemplary network structure, and is not limited thereto in specific implementation. The processing flow of the image driving method is described below with reference to FIG. 5 and FIG. 6 , wherein the steps repeated with the above embodiment will not be repeated, and the processing flow of the image driving method includes step 502 to step 514 .
在步骤502中,获取目标图像和驱动参考图像。In step 502, a target image and a driving reference image are acquired.
结合图6所示,目标图像和驱动参考图像可以输入到图像驱动模型中的身体部位的关键点检测模块31。As shown in FIG. 6 , the target image and the driving reference image can be input to the body part key point detection module 31 in the image driving model.
在步骤504中,识别所述目标图像中所述目标对象的身体部位的各个关键点分别对应的第一位置;以及,识别所述驱动参考图像中所述参考对象的该身体部位各个关键点分别对应的第二位置。In step 504, identify first positions corresponding to each key point of the body part of the target object in the target image; and identify each key point of the body part of the reference object in the driving reference image corresponding to the second position.
结合图6所示,身体部位的关键点检测模块31用于分别对目标图像中的目标对象和驱动参考图像中的参考对象进行身体部位关键点提取,输出得到目标对象的身体部位的各个关键点分别对应的第一位置和参考对象的身体部位的各个关键点分别对应的第二位置,并将第一位置和第二位置输入到运动模块32。As shown in FIG. 6 , the key point detection module 31 of the body part is used to extract key points of the body part from the target object in the target image and the reference object in the driving reference image respectively, and output each key point of the body part of the target object The corresponding first position and each key point of the body part of the reference object respectively correspond to the second position, and the first position and the second position are input to the motion module 32 .
在步骤506中,基于各个所述第一位置和各个所述第二位置之间的对应关系,确定所述目标图像中多个像素点的运动信息。In step 506, based on the correspondence between each of the first positions and each of the second positions, motion information of a plurality of pixels in the target image is determined.
结合图6所示,将第一位置、第二位置以及目标图像输入运动模块32,输出目标图像中多个像素点的运动信息。As shown in FIG. 6 , the first position, the second position and the target image are input into the motion module 32 , and the motion information of multiple pixels in the target image is output.
在步骤508中,根据所述运动信息对所述目标图像中多个像素点进行调整。In step 508, a plurality of pixel points in the target image are adjusted according to the motion information.
结合图6所示,将运动信息和目标图像输入粗变形模块33,得到调整后的目标图像。As shown in FIG. 6 , input the motion information and the target image into the rough deformation module 33 to obtain the adjusted target image.
在调整过程中,对目标图像的各个像素点进行了移动从而使目标图像变形,以使目标图像中的目标对象的身体部位呈现参考动作。为了使得目标对象所呈现的表情动作或姿态动作更加自然和真实,需要对粗变形图像中被变形的区域进行进一步处理。During the adjustment process, each pixel point of the target image is moved to deform the target image, so that the body part of the target object in the target image presents a reference action. In order to make the facial expressions or gestures presented by the target object more natural and real, it is necessary to further process the deformed regions in the coarsely deformed image.
在一个例子中,在步骤508之后,可以利用调整后的目标图像、运动信息以及图像生成网络(图像生成模块34),生成驱动效果图像。该图像生成网络可以是预先训练得到的神经网络,能够借助运动信息对调整后的目标图像进行细节上的处理,比如去除噪 声、修复缺损内容、调节亮度以及色彩增强等。In one example, after step 508, the adjusted target image, motion information and image generation network (image generation module 34) can be used to generate a driving effect image. The image generation network can be a pre-trained neural network, which can perform detailed processing on the adjusted target image with the help of motion information, such as removing noise, repairing defective content, adjusting brightness and color enhancement, etc.
在步骤510中,利用图像生成网络中的编码网络对调整后的目标图像进行特征提取,得到特征图。In step 510, feature extraction is performed on the adjusted target image by using the encoding network in the image generation network to obtain a feature map.
例如,可以利用编码网络,对调整后的目标图像进行特征提取,得到特征图,或者也可以通过其他方式提取。For example, an encoding network can be used to perform feature extraction on the adjusted target image to obtain a feature map, or it can also be extracted in other ways.
结合图6所示,将粗变形图像输入图像生成模块34中的编码网络341,得到特征图。As shown in FIG. 6 , the coarsely deformed image is input to the encoding network 341 in the image generation module 34 to obtain a feature map.
在步骤512中,基于所述运动信息,对所述特征图中的像素点进行调整,得到调整后的特征图。In step 512, based on the motion information, the pixels in the feature map are adjusted to obtain an adjusted feature map.
例如,可以采取与调整目标图像一样的方式对特征图进行调整,将特征图中与目标图像对应的像素点按照运动信息所指示的位移进行移动,得到调整后的特征图。For example, the feature map can be adjusted in the same way as adjusting the target image, and the pixel points corresponding to the target image in the feature map are moved according to the displacement indicated by the motion information to obtain the adjusted feature map.
结合图6所示,将特征图和运动信息输入图像生成模块34中的特征变形单元342,对特征图中的像素点进行调整,输出调整后的特征图。As shown in FIG. 6 , the feature map and motion information are input into the feature deformation unit 342 in the image generation module 34 , the pixels in the feature map are adjusted, and the adjusted feature map is output.
又例如,可以先计算出目标图像在调整过程中变形程度较大的像素点,然后对这些变形程度较大的像素点所对应的特征图中的像素点进行调整,以实现有针对性地调整。For another example, it is possible to first calculate the pixels of the target image that are more deformed during the adjustment process, and then adjust the pixels in the feature map corresponding to these more deformed pixels to achieve targeted adjustments. .
为了实现有针对性的调整,在一个实施例中,可以利用运动信息,确定目标图像对应的掩膜。并利用调整后的目标图像、运动信息、掩膜以及图像生成网络,生成驱动效果图像。所述掩膜用于标识在根据运动信息对目标图像中多个像素点进行调整的过程中各个像素点的移动程度。掩膜可以为图像生成网络指示出调整后的目标图像中需要重点进行优化调整的图像区域。对于掩膜中的各个像素,需要进行调整的像素和不需要进行调整的像素可以使用不同的标识进行区分。在一些实施例中,将运动信息输入掩膜生成网络,得到目标图像对应的掩膜,该掩膜可以是和目标图像大小一致的图像。In order to achieve targeted adjustment, in an embodiment, motion information may be used to determine a mask corresponding to the target image. And use the adjusted target image, motion information, mask and image generation network to generate the driving effect image. The mask is used to identify the movement degree of each pixel in the process of adjusting the multiple pixels in the target image according to the motion information. The mask can indicate to the image generation network the image area in the adjusted target image that needs to be optimized and adjusted. For each pixel in the mask, pixels that need to be adjusted and pixels that do not need to be adjusted can be distinguished using different identifiers. In some embodiments, the motion information is input into the mask generation network to obtain a mask corresponding to the target image, and the mask may be an image of the same size as the target image.
该掩膜生成网络可以包括二分类器,根据运动信息将每个像素点分成移动程度大的像素点或移动程度小的像素点,示例性的,可以在二分类器输出的某个像素点属于移动程度大的像素点的置信度达到预设阈值时,将该像素点确定为移动程度大的像素点,反之,属于移动程度大的像素点未达到预设阈值时,将该像素点确定为移动程度小的像素点。其中,移动程度大的像素点在掩膜中的标识可以是1,而移动程度小的像素点在掩膜中的标识可以是0。The mask generation network may include a binary classifier, which divides each pixel into a pixel with a large degree of movement or a pixel with a small degree of movement according to the motion information. Exemplarily, a certain pixel output by the binary classifier belongs to When the confidence degree of a pixel with a large degree of movement reaches the preset threshold, the pixel is determined as a pixel with a large degree of movement; otherwise, when the pixel with a large degree of movement does not reach the preset threshold, the pixel is determined as Pixels with little movement. Wherein, the identification of pixels with a large degree of movement in the mask may be 1, and the identification of pixels with a small degree of movement may be 0 in the mask.
比如,假设身体部位产生的参考动作是张嘴时,将目标对象的动作调整为参考动作时,目标图像的各个像素点中移动程度大的像素点为嘴部区域的像素点,该目标图像对应的掩膜可以是嘴部区域的像素点标识为1,其他区域的像素点标识为0的图。For example, assuming that the reference action generated by the body part is to open the mouth, when the target object’s action is adjusted to the reference action, the pixels with the largest movement degree among the pixels of the target image are the pixels of the mouth area, and the corresponding mask of the target image The film can be a graph in which the pixel points in the mouth area are marked as 1, and the pixel points in other areas are marked as 0.
在得到掩膜后,可以基于运动信息和掩膜,对特征图中的像素点进行调整。After obtaining the mask, the pixels in the feature map can be adjusted based on the motion information and the mask.
比如,可以对掩膜中标识为1的区域所对应的特征图中的像素点进行调整,以使得后续的解码处理能够对目标图像中变形较大的部分进行细节补全及生成,并保留变形不大的区域。For example, the pixels in the feature map corresponding to the region marked as 1 in the mask can be adjusted so that the subsequent decoding process can complete and generate details of the larger deformed part of the target image, and retain the deformation Not a big area.
实际实施中,可以使用下述公式将掩膜与特征图进行融合以根据运动信息对特征图进行调整:In actual implementation, the following formula can be used to fuse the mask with the feature map to adjust the feature map according to the motion information:
调整后的特征图=特征图*掩膜+由运动信息形变后的特征图*(1-掩膜)(1)Adjusted feature map = feature map * mask + feature map deformed by motion information * (1-mask) (1)
其中,由运动信息形变后的特征图是将特征图中与目标图像对应的像素点按照运动信息所指示的位移进行移动,得到的形变后的特征图。Wherein, the feature map deformed by the motion information is a deformed feature map obtained by moving the pixel points in the feature map corresponding to the target image according to the displacement indicated by the motion information.
结合图7所示的又一种图像驱动模型,其中,运动模块32除了包括运动信息计算单元320,还包括掩膜生成网络321,将运动信息计算单元320输出的运动信息输入掩膜生成网络321,得到输出目标图像对应的掩膜。将掩膜、特征图和运动信息输入图像生成模块34中的特征变形单元342,特征变形单元342基于运动信息和掩膜,对特征图中的像素点进行调整,输出调整后的特征图。In conjunction with another image driving model shown in FIG. 7 , wherein the motion module 32 includes a mask generation network 321 in addition to the motion information calculation unit 320 , the motion information output by the motion information calculation unit 320 is input into the mask generation network 321 , get the mask corresponding to the output target image. Input the mask, feature map and motion information into the feature deformation unit 342 in the image generation module 34, and the feature deformation unit 342 adjusts the pixels in the feature map based on the motion information and the mask, and outputs the adjusted feature map.
在步骤514中,利用图像生成网络中的解码网络对调整后的特征图进行解码处理, 得到所述驱动效果图像。In step 514, the adjusted feature map is decoded by the decoding network in the image generation network to obtain the driving effect image.
例如,可以利用解码网络,对调整后的特征图进行解码处理,得到特征图,或者也可以通过其他方式解码。本实施例中的编码网络和解码网络可以使用卷积神经网络。For example, a decoding network can be used to decode the adjusted feature map to obtain the feature map, or it can also be decoded in other ways. The encoding network and decoding network in this embodiment can use a convolutional neural network.
结合图6所示,将调整后的特征图输入图像生成模块34中的解码网络343,输出呈现所述参考动作的目标对象的身体部位的驱动效果图像。驱动效果图像中的目标对象的身体部位除了身体部位的动作与参考动作一致之外,由于对变形较大的部分进行了细节补全及生成,目标对象的身体部位的细节更完整、情态更自然,呈现的图像效果更接近目标对象真实的动作。As shown in FIG. 6 , the adjusted feature map is input to the decoding network 343 in the image generation module 34 , and the driving effect image of the body part of the target object presenting the reference action is output. In addition to the fact that the body parts of the target object in the driving effect image are consistent with the reference motion, the details of the body parts of the target object are more complete and the mood is more natural due to the detail completion and generation of the larger deformed part , the presented image effect is closer to the real action of the target object.
本公开实施例提供的图像驱动方法,能够利用单张参考对象的驱动参考图像实现对目标对象的驱动,简化了实现目标对象驱动的操作方式,且能够有效提升驱动目标对象的处理效率,并且在使目标图像中的目标对象身体部位呈现与驱动参考图像中的参考对象身体部位相同动作的同时,能够对目标图像中变形较大的部分进行细节上的处理,使得目标对象的身体部位的动作更为自然真实。The image driving method provided by the embodiments of the present disclosure can realize the driving of the target object by using a driving reference image of a single reference object, simplifies the operation method for driving the target object, and can effectively improve the processing efficiency of driving the target object. While making the body part of the target object in the target image present the same action as the body part of the reference object in the driving reference image, it is possible to perform detailed processing on the part with large deformation in the target image, so that the movement of the body part of the target object is more accurate. Be true to nature.
另外,在对图6或者图7所示结构的图像驱动模型进行训练时,仍可以采用上述实施例中图3所示结构的图像驱动模型的训练方法。训练过程中,对图像驱动模型的网络参数值进行调整时,所调整的图像驱动模型的网络参数值可以包括以下至少一个网络的参数值:身体部位的关键点检测模块31中的身体部位关键点检测网络、运动模块32中的预测神经网络3202和掩膜生成网络321、图像生成模块34中的编码网络341和解码网络343。In addition, when training the image-driven model with the structure shown in FIG. 6 or FIG. 7 , the training method for the image-driven model with the structure shown in FIG. 3 in the above embodiment can still be used. During the training process, when adjusting the network parameter value of the image-driven model, the adjusted network parameter value of the image-driven model may include at least one of the following network parameter values: key points of the body part in the key point detection module 31 of the body part The detection network, the prediction neural network 3202 and the mask generation network 321 in the motion module 32 , the encoding network 341 and the decoding network 343 in the image generation module 34 .
如图8所示,图8是本公开至少一个实施例示出的一种图像驱动装置的框图,所述装置包括:图像获取模块81、像素运动模块82和图像调整模块83。As shown in FIG. 8 , FIG. 8 is a block diagram of an image driving device according to at least one embodiment of the present disclosure, and the device includes: an image acquisition module 81 , a pixel movement module 82 and an image adjustment module 83 .
其中,图像获取模块81,用于获取目标图像和驱动参考图像,所述目标图像包括目标对象的身体部位,所述驱动参考图像中包括呈现参考动作的参考对象的身体部位。Wherein, the image acquisition module 81 is configured to acquire a target image and a driving reference image, the target image includes body parts of the target object, and the driving reference image includes body parts of the reference object exhibiting the reference action.
像素运动模块82,用于基于所述目标对象中身体部位的各个关键点和所述参考对象中身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息,所述运动信息用于将所述目标对象的身体部位的动作调整为所述参考动作。A pixel motion module 82, configured to determine the motion information of multiple pixel points in the target image based on the correspondence between each key point of the body part in the target object and each key point of the body part in the reference object , the motion information is used to adjust the motion of the body part of the target object to the reference motion.
图像调整模块83,用于根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,所述驱动效果图像中所述目标对象的身体部位呈现所述参考动作。The image adjustment module 83 is configured to adjust a plurality of pixels in the target image according to the motion information to obtain a driving effect image, in which the body parts of the target object present the reference motion.
本公开实施例提供的图像驱动装置,通过根据目标对象的身体部位的各个关键点和参考对象的身体部位的各个关键点之间的对应关系,来调整目标图像中的像素点,从而对目标图像进行直接变形,能够让目标图像呈现与驱动参考图像相同的身体部位的动作,无需上传目标对象的视频,便可利用单张参考对象的驱动参考图像实现对目标对象的驱动,简化了实现目标对象驱动的操作方式,且能够有效提升驱动目标对象的处理效率。The image driving device provided by the embodiments of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, so that the target image Direct deformation can make the target image show the same body parts as the driving reference image. Without uploading the video of the target object, the driving reference image of the single reference object can be used to drive the target object, which simplifies the realization of the target object. The operation mode of the drive can effectively improve the processing efficiency of the drive target object.
在一些实施例中,所述像素运动模块82,用于:对识别所述目标图像中所述目标对象的身体部位的各个关键点分别对应的第一位置;以及,识别所述驱动参考图像中所述参考对象的身体部位的各个关键点分别对应的第二位置;基于各个所述第一位置和各个所述第二位置之间的对应关系,确定所述目标图像中多个像素点的运动信息。In some embodiments, the pixel motion module 82 is configured to: identify the first positions corresponding to each key point of the body part of the target object in the target image; and identify the key points in the driving reference image Second positions corresponding to each key point of the body part of the reference object; based on the correspondence between each of the first positions and each of the second positions, determining the movement of multiple pixel points in the target image information.
在一些实施例中,所述运动信息包括所述多个像素点中各个像素点各自对应的位移;所述图像调整模块83,还用于:将所述目标图像中多个像素点按照各自对应的位移进行移动。In some embodiments, the motion information includes the corresponding displacements of each pixel in the plurality of pixels; the image adjustment module 83 is further configured to: adjust the plurality of pixels in the target image according to their respective displacements displacement to move.
在一些实施例中,所述图像调整模块83,用于:根据所述运动信息对所述目标图像中多个像素点进行调整,得到调整后的目标图像,所述调整后的目标图像为所述驱动效果图像;或者,根据所述运动信息对所述目标图像中多个像素点进行调整,得到调整后的目标图像,利用所述调整后的目标图像、所述运动信息以及图像生成网络,生成所述驱动效果图像。In some embodiments, the image adjustment module 83 is configured to: adjust a plurality of pixels in the target image according to the motion information to obtain an adjusted target image, and the adjusted target image is the the driving effect image; or, adjust a plurality of pixels in the target image according to the motion information to obtain an adjusted target image, and use the adjusted target image, the motion information, and an image generation network, The driving effect image is generated.
在一些实施例中,所述图像生成网络包括编码网络和解码网络;所述图像调整模块 83,还用于:利用所述编码网络对所述调整后的目标图像进行特征提取,得到特征图;基于所述运动信息,对所述特征图中的像素点进行调整,得到调整后的特征图;利用所述解码网络对所述调整后的特征图进行解码处理,得到所述驱动效果图像。In some embodiments, the image generation network includes an encoding network and a decoding network; the image adjustment module 83 is further configured to: use the encoding network to perform feature extraction on the adjusted target image to obtain a feature map; Based on the motion information, the pixels in the feature map are adjusted to obtain an adjusted feature map; the decoding network is used to decode the adjusted feature map to obtain the driving effect image.
在一些实施例中,如图9所示,所述装置还包括:掩膜生成模块84;所述掩膜生成模块84,用于利用所述运动信息,确定所述目标图像对应的掩膜,所述掩膜用于标识在根据所述运动信息对所述目标图像中多个像素点进行调整的过程中各个像素点的移动程度;所述图像调整模块83,还用于:利用所述调整后的目标图像、所述运动信息、所述掩膜以及所述图像生成网络,生成所述驱动效果图像。In some embodiments, as shown in FIG. 9 , the device further includes: a mask generation module 84; the mask generation module 84 is configured to use the motion information to determine a mask corresponding to the target image, The mask is used to identify the movement degree of each pixel in the process of adjusting multiple pixels in the target image according to the motion information; the image adjustment module 83 is also used to: use the adjustment The driving effect image is generated from the final target image, the motion information, the mask, and the image generation network.
在一些实施例中,所述图像生成网络包括编码网络和解码网络;所述图像调整模块83,还用于:利用所述编码网络对所述调整后的目标图像进行特征提取,得到特征图;基于所述运动信息及所述掩膜,对所述特征图中的像素点进行调整,得到调整后的特征图;利用所述解码网络对所述调整后的特征图进行解码处理,得到所述驱动效果图像。In some embodiments, the image generation network includes an encoding network and a decoding network; the image adjustment module 83 is further configured to: use the encoding network to perform feature extraction on the adjusted target image to obtain a feature map; Based on the motion information and the mask, adjust the pixels in the feature map to obtain an adjusted feature map; use the decoding network to decode the adjusted feature map to obtain the described Drive effect image.
在一些实施例中,所述装置由图像驱动模型执行,所述图像驱动模型根据目标样本图像和驱动样本图像经训练生成;其中,所述目标样本图像包括呈现第一动作的样本对象的身体部位,所述驱动样本图像中包括呈现第二动作的所述样本对象的身体部位;在训练中,将所述目标样本图像和所述驱动样本图像输入初始图像驱动模型,所述初始图像驱动模型输出呈现第三动作的样本对象的身体部位的训练图像,通过所述训练图像和所述驱动样本图像的差异对训练过程中的所述初始图像驱动模型进行调整,训练后得到所述图像驱动模型。In some embodiments, the apparatus is performed by an image-driven model trained and generated from target sample images and drive sample images; wherein the target sample images include body parts of sample subjects exhibiting the first motion , the driving sample image includes the body part of the sample object presenting the second action; in training, the target sample image and the driving sample image are input into an initial image-driven model, and the initial image-driven model outputs Presenting training images of body parts of sample subjects in the third action, adjusting the initial image-driven model during training through the difference between the training image and the driving sample image, and obtaining the image-driven model after training.
在一些实施例中,所述图像获取模块81,还用于:获取驱动视频中的多帧驱动参考图像,所述多帧驱动参考图像中包括同一参考对象的身体部位,且不同驱动参考图像中的所述参考对象的身体部位呈现的参考动作不同;由所述多帧驱动参考图像中获取一帧所述驱动参考图像。In some embodiments, the image acquisition module 81 is further configured to: acquire multiple frames of driving reference images in the driving video, wherein the multiple frames of driving reference images include body parts of the same reference object, and different driving reference images include The reference movements presented by the body parts of the reference subject are different; one frame of the driving reference image is obtained from the multiple frames of the driving reference image.
在一些实施例中,所述图像调整模块83,还用于:响应于基于多帧驱动参考图像获取到目标对象的多帧驱动效果图像,基于所述多帧驱动效果图像生成目标视频,所述目标视频中所述目标对象的身体部位的动作与所述驱动视频中所述参考对象的身体部位的动作一致,其中,所述多帧驱动效果图像与所述多帧驱动参考图像的数量相同,且所述多帧驱动效果图像中分别对应呈现所述多帧驱动参考图像中的参考对象的身体部位的参考动作。In some embodiments, the image adjustment module 83 is further configured to: generate a target video based on the multi-frame driving effect image in response to obtaining a multi-frame driving effect image of the target object based on the multi-frame driving reference image, the The motion of the body part of the target object in the target video is consistent with the motion of the body part of the reference object in the driving video, wherein the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, And the multi-frame driving effect images respectively correspond to the reference actions of the body parts of the reference object in the multi-frame driving reference images.
上述装置中各个模块的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For the implementation process of the functions and effects of each module in the above-mentioned device, please refer to the implementation process of the corresponding steps in the above-mentioned method for details, and details will not be repeated here.
本公开至少一个实施例的图像驱动方法可以由电子设备执行,例如,可以由终端设备或服务器或其它处理设备执行,其中,终端设备可以包括用户设备、移动设备、终端、蜂窝电话、无绳电话、个人数字处理、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该图像驱动方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。The image driving method in at least one embodiment of the present disclosure may be performed by an electronic device, for example, may be performed by a terminal device or a server or other processing device, where the terminal device may include a user device, a mobile device, a terminal, a cellular phone, a cordless phone, Personal digital processing, handheld devices, computing devices, automotive devices, wearable devices, etc. In some possible implementation manners, the image driving method may be implemented by a processor invoking computer-readable instructions stored in a memory.
本公开实施例还提供了一种电子设备,如图10所示,所述电子设备包括存储器11、处理器12,所述存储器11用于存储可在处理器上运行的计算机指令,所述处理器12用于在执行所述计算机指令时实现本公开任一实施例所述的图像驱动方法。An embodiment of the present disclosure also provides an electronic device. As shown in FIG. The device 12 is configured to implement the image driving method described in any embodiment of the present disclosure when executing the computer instructions.
本公开实施例还提供了一种计算机程序产品,该产品包括计算机程序/指令,该计算机程序/指令被处理器执行时实现本公开任一实施例所述的图像驱动方法。An embodiment of the present disclosure further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, implements the image driving method described in any embodiment of the present disclosure.
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的图像驱动方法。An embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可 以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment. The device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. It can be understood and implemented by those skilled in the art without creative effort.
上述对本说明书一些实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes some embodiments of this specification. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.
本领域技术人员在考虑说明书及实践本公开后,将容易想到本说明书的其它实施方案。本说明书旨在涵盖本说明书的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本说明书的一般性原理并包括本说明书未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本说明书的真正范围和精神由下面的权利要求指出。Other embodiments of the specification will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This description is intended to cover any modification, use or adaptation of this description. These modifications, uses or adaptations follow the general principles of this description and include common knowledge or conventional technical means in this technical field for which this description does not apply . The specification and examples are to be considered exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
应当理解的是,本说明书并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本说明书的范围仅由所附的权利要求来限制。It should be understood that this specification is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the specification is limited only by the appended claims.
以上所述仅为本说明书的一些实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。The above descriptions are only some examples of this specification, and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in the protection of this specification. within the range.

Claims (14)

  1. 一种图像驱动方法,包括:An image driving method comprising:
    获取目标图像和驱动参考图像,所述目标图像包括目标对象的身体部位,所述驱动参考图像中包括呈现参考动作的参考对象的身体部位;acquiring a target image and a driving reference image, the target image including a body part of a target subject, and the driving reference image including a body part of a reference subject presenting a reference action;
    基于所述目标对象的身体部位的各个关键点和所述参考对象的身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息,所述运动信息用于将所述目标对象的身体部位的动作调整为所述参考动作;Based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determine motion information of multiple pixels in the target image, and the motion information is used for adjusting the motion of the body part of the target subject to the reference motion;
    根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,所述驱动效果图像中所述目标对象的身体部位呈现所述参考动作。A plurality of pixel points in the target image are adjusted according to the motion information to obtain a driving effect image, in which the body parts of the target object present the reference motion.
  2. 根据权利要求1所述的方法,其中,所述基于所述目标对象的身体部位的各个关键点和所述参考对象的身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息,包括:The method according to claim 1, wherein, based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determining how many points in the target image Pixel motion information, including:
    识别所述目标图像中所述目标对象的身体部位的各个关键点分别对应的第一位置;以及,识别所述驱动参考图像中所述参考对象的身体部位的各个关键点分别对应的第二位置;Identifying first positions corresponding to key points of body parts of the target subject in the target image; and identifying second positions corresponding to key points of body parts of the reference subject in the driving reference image ;
    基于各个所述第一位置和各个所述第二位置之间的对应关系,确定所述目标图像中多个像素点的运动信息。Based on the correspondence between each of the first positions and each of the second positions, motion information of multiple pixels in the target image is determined.
  3. 根据权利要求1或2所述的方法,其中,所述运动信息包括所述多个像素点中各个像素点各自对应的位移;The method according to claim 1 or 2, wherein the motion information includes displacements corresponding to respective pixels in the plurality of pixels;
    所述根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,包括:The step of adjusting multiple pixels in the target image according to the motion information to obtain a driving effect image includes:
    将所述目标图像中多个像素点按照各自对应的位移进行移动,得到调整后的目标对象,所述调整后的目标对象为所述驱动效果图像。Moving a plurality of pixel points in the target image according to their corresponding displacements to obtain an adjusted target object, where the adjusted target object is the driving effect image.
  4. 根据权利要求1-3任一所述的方法,其中,所述根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,包括:The method according to any one of claims 1-3, wherein said adjusting a plurality of pixels in said target image according to said motion information to obtain a driving effect image comprises:
    根据所述运动信息对所述目标图像中多个像素点进行调整,得到调整后的目标图像;adjusting a plurality of pixel points in the target image according to the motion information to obtain an adjusted target image;
    利用所述调整后的目标图像、所述运动信息以及图像生成网络,生成所述驱动效果图像。The driving effect image is generated by using the adjusted target image, the motion information, and an image generation network.
  5. 根据权利要求4所述的方法,其中,所述图像生成网络包括编码网络和解码网络;所述利用所述调整后的目标图像、所述运动信息以及所述图像生成网络,生成所述驱动效果图像,包括:The method according to claim 4, wherein said image generation network comprises an encoding network and a decoding network; said driving effect is generated using said adjusted target image, said motion information, and said image generation network images, including:
    利用所述编码网络对所述调整后的目标图像进行特征提取,得到特征图;Using the encoding network to perform feature extraction on the adjusted target image to obtain a feature map;
    基于所述运动信息,对所述特征图中的像素点进行调整,得到调整后的特征图;Adjusting pixels in the feature map based on the motion information to obtain an adjusted feature map;
    利用所述解码网络对所述调整后的特征图进行解码处理,得到所述驱动效果图像。Decoding the adjusted feature map by using the decoding network to obtain the driving effect image.
  6. 根据权利要求4所述的方法,还包括:The method according to claim 4, further comprising:
    利用所述运动信息,确定所述目标图像对应的掩膜,所述掩膜用于标识在根据所述运动信息对所述目标图像中多个像素点进行调整的过程中各个像素点的移动程度;Using the motion information, determine a mask corresponding to the target image, where the mask is used to identify the degree of movement of each pixel in the process of adjusting multiple pixels in the target image according to the motion information ;
    所述利用所述调整后的目标图像、所述运动信息以及所述图像生成网络,生成所述驱动效果图像,包括:The generating the driving effect image by using the adjusted target image, the motion information and the image generation network includes:
    利用所述调整后的目标图像、所述运动信息、所述掩膜以及所述图像生成网络,生成所述驱动效果图像。The driving effect image is generated using the adjusted target image, the motion information, the mask, and the image generation network.
  7. 根据权利要求6所述的方法,其中,所述图像生成网络包括编码网络和解码网络;所述利用所述调整后的目标图像、所述运动信息、所述掩膜以及所述图像生成网络,生成所述驱动效果图像,包括:The method according to claim 6, wherein said image generation network comprises an encoding network and a decoding network; said utilizing said adjusted target image, said motion information, said mask and said image generation network, Generate the driving effect image, including:
    利用所述编码网络对所述调整后的目标图像进行特征提取,得到特征图;Using the encoding network to perform feature extraction on the adjusted target image to obtain a feature map;
    基于所述运动信息及所述掩膜,对所述特征图中的像素点进行调整,得到调整后的 特征图;Based on the motion information and the mask, the pixels in the feature map are adjusted to obtain an adjusted feature map;
    利用所述解码网络对所述调整后的特征图进行解码处理,得到所述驱动效果图像。Decoding the adjusted feature map by using the decoding network to obtain the driving effect image.
  8. 根据权利要求1至7任一所述的方法,其中,所述方法由图像驱动模型执行,所述图像驱动模型根据目标样本图像和驱动样本图像经训练生成;其中,所述目标样本图像包括呈现第一动作的样本对象的身体部位,所述驱动样本图像中包括呈现第二动作的所述样本对象的身体部位;在训练中,将所述目标样本图像和所述驱动样本图像输入初始图像驱动模型,所述初始图像驱动模型输出呈现第三动作的所述样本对象的身体部位的训练图像,通过所述训练图像和所述驱动样本图像的差异对训练过程中的所述初始图像驱动模型进行调整,训练后得到所述图像驱动模型。The method according to any one of claims 1 to 7, wherein the method is executed by an image-driven model, and the image-driven model is generated through training according to target sample images and driving sample images; wherein, the target sample images include presentation The body part of the sample subject in the first action, the driving sample image includes the body part of the sample subject presenting the second action; in training, input the target sample image and the driving sample image into the initial image driving model, the initial image-driven model outputs a training image of the body part of the sample object presenting a third action, and the initial image-driven model in the training process is performed by the difference between the training image and the driving sample image After tuning, the image-driven model is obtained after training.
  9. 根据权利要求1-8任一所述的方法,其中,获取所述驱动参考图像,包括:The method according to any one of claims 1-8, wherein obtaining the driving reference image comprises:
    获取驱动视频中的多帧驱动参考图像,所述多帧驱动参考图像中包括同一参考对象的身体部位,且不同驱动参考图像中的所述参考对象的身体部位呈现的参考动作不同;Acquiring multiple frames of driving reference images in the driving video, wherein the multiple frames of driving reference images include body parts of the same reference object, and the reference actions presented by the body parts of the reference object in different driving reference images are different;
    由所述多帧驱动参考图像中获取一帧所述驱动参考图像。A frame of the driving reference image is acquired from the multiple frames of the driving reference image.
  10. 根据权利要求9所述的方法,还包括:The method of claim 9, further comprising:
    响应于基于多帧驱动参考图像获取到目标对象的多帧驱动效果图像,基于所述多帧驱动效果图像生成目标视频,所述目标视频中所述目标对象的身体部位的动作与所述驱动视频中所述参考对象的身体部位的动作一致,其中,所述多帧驱动效果图像与所述多帧驱动参考图像的数量相同,且所述多帧驱动效果图像中分别对应呈现所述多帧驱动参考图像中的参考对象的身体部位的参考动作。In response to obtaining a multi-frame driving effect image of the target object based on the multi-frame driving reference image, a target video is generated based on the multi-frame driving effect image, and the action of the body part of the target object in the target video is consistent with the driving video The movements of the body parts of the reference subject are consistent, wherein the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, and the multi-frame driving effect images respectively present the multi-frame driving effect images correspondingly. The reference motion of the body part of the reference subject in the reference image.
  11. 一种图像驱动装置,包括:An image driving device, comprising:
    图像获取模块,用于获取目标图像和驱动参考图像,所述目标图像包括目标对象的身体部位,所述驱动参考图像中包括产生参考动作的参考对象的身体部位;An image acquisition module, configured to acquire a target image and a driving reference image, the target image includes a body part of the target object, and the driving reference image includes a body part of a reference object generating a reference action;
    像素运动模块,用于基于所述目标对象的身体的各个部位关键点和所述参考对象的身体部位的各个关键点之间的对应关系,确定所述目标图像中多个像素点的运动信息,所述运动信息用于将所述目标对象的身体部位的动作调整为所述参考动作;a pixel motion module, configured to determine the motion information of multiple pixel points in the target image based on the correspondence between the key points of the body parts of the target object and the key points of the body parts of the reference object, the motion information is used to adjust the motion of the target subject's body part to the reference motion;
    图像调整模块,用于根据所述运动信息对所述目标图像中多个像素点进行调整,得到驱动效果图像,所述驱动效果图像中所述目标对象的身体部位呈现所述参考动作。An image adjustment module, configured to adjust a plurality of pixels in the target image according to the motion information to obtain a driving effect image, in which body parts of the target object present the reference motion.
  12. 一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至10任一所述的方法。An electronic device comprising a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement any one of claims 1 to 10 when executing the computer instructions the method described.
  13. 一种计算机程序产品,该产品包括计算机程序/指令,该计算机程序/指令被处理器执行时实现权利要求1至10任一所述的方法。A computer program product, the product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the method described in any one of claims 1 to 10 is implemented.
  14. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至10任一所述的方法。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method according to any one of claims 1 to 10 is realized.
PCT/CN2022/134869 2022-02-17 2022-11-29 Image driving method and apparatus, device and medium WO2023155533A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210147579.9A CN114519727A (en) 2022-02-17 2022-02-17 Image driving method, device, equipment and medium
CN202210147579.9 2022-02-17

Publications (1)

Publication Number Publication Date
WO2023155533A1 true WO2023155533A1 (en) 2023-08-24

Family

ID=81599575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134869 WO2023155533A1 (en) 2022-02-17 2022-11-29 Image driving method and apparatus, device and medium

Country Status (2)

Country Link
CN (1) CN114519727A (en)
WO (1) WO2023155533A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519727A (en) * 2022-02-17 2022-05-20 北京大甜绵白糖科技有限公司 Image driving method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120280974A1 (en) * 2011-05-03 2012-11-08 Microsoft Corporation Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN112766027A (en) * 2019-11-05 2021-05-07 广州虎牙科技有限公司 Image processing method, device, equipment and storage medium
CN113965773A (en) * 2021-11-03 2022-01-21 广州繁星互娱信息科技有限公司 Live broadcast display method and device, storage medium and electronic equipment
CN114519727A (en) * 2022-02-17 2022-05-20 北京大甜绵白糖科技有限公司 Image driving method, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120280974A1 (en) * 2011-05-03 2012-11-08 Microsoft Corporation Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN112766027A (en) * 2019-11-05 2021-05-07 广州虎牙科技有限公司 Image processing method, device, equipment and storage medium
CN113965773A (en) * 2021-11-03 2022-01-21 广州繁星互娱信息科技有限公司 Live broadcast display method and device, storage medium and electronic equipment
CN114519727A (en) * 2022-02-17 2022-05-20 北京大甜绵白糖科技有限公司 Image driving method, device, equipment and medium

Also Published As

Publication number Publication date
CN114519727A (en) 2022-05-20

Similar Documents

Publication Publication Date Title
Yi et al. Audio-driven talking face video generation with learning-based personalized head pose
Zakharov et al. Few-shot adversarial learning of realistic neural talking head models
US11600013B2 (en) Facial features tracker with advanced training for natural rendering of human faces in real-time
Thies et al. Facevr: Real-time facial reenactment and eye gaze control in virtual reality
US11736756B2 (en) Producing realistic body movement using body images
US11068698B2 (en) Generating animated three-dimensional models from captured images
US10789453B2 (en) Face reenactment
CN109815826B (en) Method and device for generating face attribute model
US20210174072A1 (en) Microexpression-based image recognition method and apparatus, and related device
CN110807364B (en) Modeling and capturing method and system for three-dimensional face and eyeball motion
Wang et al. Realtime 3D eye gaze animation using a single RGB camera
CN106682632B (en) Method and device for processing face image
WO2021052375A1 (en) Target image generation method, apparatus, server and storage medium
WO2023050992A1 (en) Network training method and apparatus for facial reconstruction, and device and storage medium
Kononenko et al. Learning to look up: Realtime monocular gaze correction using machine learning
Shim et al. A subspace model-based approach to face relighting under unknown lighting and poses
Elgharib et al. Egocentric videoconferencing
WO2023155533A1 (en) Image driving method and apparatus, device and medium
AU2014253687A1 (en) System and method of tracking an object
Bacivarov et al. Smart cameras: 2D affine models for determining subject facial expressions
Elgharib et al. Egoface: Egocentric face performance capture and videorealistic reenactment
CN113076918B (en) Video-based facial expression cloning method
RU2720361C1 (en) Multi-frame training of realistic neural models of speakers heads
Jian et al. Realistic face animation generation from videos
Sun et al. Generation of virtual digital human for customer service industry

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926837

Country of ref document: EP

Kind code of ref document: A1