WO2023155533A1

WO2023155533A1 - Image driving method and apparatus, device and medium

Info

Publication number: WO2023155533A1
Application number: PCT/CN2022/134869
Authority: WO
Inventors: 唐斯伟; 朱昊; 吴文岩; 范蕤; 钱晨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-02-17
Filing date: 2022-11-29
Publication date: 2023-08-24
Also published as: CN114519727A

Abstract

Provided are an image driving method and apparatus, a device and a medium. The method comprises: acquiring a target image and a driving reference image, the target image comprising a body part of a target subject, and the driving reference image comprising a body part of a reference subject presenting a reference action; on the basis of a correspondence between key points of the body part of the target subject and key points of the body part of the reference subject, determining motion information of a plurality of pixel points in the target image, the motion information being used for adjusting an action of the body part of the target subject to be the reference action; and adjusting the plurality of pixel points in the target image according to the motion information so as to obtain a driving effect image, the body part of the target subject in the driving effect image presenting the reference action.

Description

An image driving method, device, device and medium

Related Application Cross Reference

This application claims the priority of the Chinese patent application with the application number 202210147579.9 and the filing date of February 17, 2022. The entire content of the Chinese patent application is hereby incorporated by reference into this application.

technical field

The embodiments of the present disclosure relate to the technical field of computer vision, and in particular to an image driving method, device, device and medium.

Background technique

In popular fields of computer vision such as virtual conferences and live photos, image-driven technology is needed to drive the body parts of the target object to produce corresponding actions. For example, facial image driving means that given a facial video, the facial movements of this facial video can be transferred to the facial image specified by the user. However, if the face of a specific user is to be driven, a video of the specific user needs to be obtained for driving, and the driving method is cumbersome and has low processing efficiency.

Contents of the invention

In a first aspect, an image driving method is provided, the method comprising: acquiring a target image and a driving reference image, the target image including a body part of a target object, and the driving reference image including a reference object presenting a reference action body parts; based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determine the motion information of multiple pixels in the target image, the motion The information is used to adjust the motion of the body part of the target object to the reference motion; adjust multiple pixels in the target image according to the motion information to obtain a driving effect image, and all the driving effect images in the driving effect image The body part of the target object presents the reference motion.

In a second aspect, an image driving device is provided, the device includes: an image acquisition module, configured to acquire a target image and a driving reference image, the target image includes body parts of a target subject, and the driving reference image includes a presentation The body part of the reference object of the reference action; a pixel motion module, configured to determine the target image based on the correspondence between each key point of the body part of the target object and each key point of the body part of the reference object Motion information of multiple pixels in the target image, the motion information is used to adjust the motion of the body part of the target object to the reference motion; an image adjustment module is used to adjust multiple pixels in the target image according to the motion information Pixels are adjusted to obtain a driving effect image, in which the body part of the target object presents the reference action.

In a third aspect, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the present disclosure when executing the computer instructions The image driving method described in any one of the embodiments.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.

According to a fifth aspect, a computer program product is provided, the product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.

The image driving method provided by the embodiment of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between the key points of the body parts of the target object and the key points of the body parts of the reference object, so as to directly perform Morphing, which can make the target image show the same body parts as the driving reference image. Without uploading the video of the target object, a single driving reference image including the reference object can be used to drive the target object, which simplifies the realization of target object driving. operation mode, and can effectively improve the processing efficiency of the driving target object.

Description of drawings

In order to more clearly illustrate the technical solutions in one or more embodiments of the present disclosure, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only one example of the present disclosure. Or some embodiments described in multiple embodiments, for those skilled in the art, other drawings can also be obtained according to these drawings without creative work.

FIG. 1A is a flowchart of an image driving method shown in at least one embodiment of the present disclosure;

Fig. 1B is a key point mode shown in at least one embodiment of the present disclosure;

Fig. 1C is another key point mode shown in at least one embodiment of the present disclosure;

FIG. 2A is a flowchart of another image driving method shown in at least one embodiment of the present disclosure;

Fig. 2B is a schematic diagram of a target image shown in at least one embodiment of the present disclosure;

Fig. 2C is a schematic diagram of a driving reference image shown in at least one embodiment of the present disclosure;

Fig. 2D is a schematic diagram of key points of body parts of a target object shown in at least one embodiment of the present disclosure;

Fig. 2E is a schematic diagram of key points of body parts of a reference subject shown in at least one embodiment of the present disclosure;

Fig. 2F is a facial key point mode shown in at least one embodiment of the present disclosure;

Fig. 2G is a driving effect image shown in at least one embodiment of the present disclosure;

Fig. 3 is a schematic structural diagram of an image driving model shown in at least one embodiment of the present disclosure;

Fig. 4 is a schematic structural diagram of a motion module in an image driving model shown in at least one embodiment of the present disclosure;

Fig. 5 is a flowchart of another image driving method shown in at least one embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of another image driving model shown in at least one embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of another image driving model shown in at least one embodiment of the present disclosure;

Fig. 8 is a block diagram of an image driving device shown in at least one embodiment of the present disclosure;

Fig. 9 is a block diagram of another image driving device shown in at least one embodiment of the present disclosure;

Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to at least one embodiment of the present disclosure.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with this specification. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present specification as recited in the appended claims.

The terms used in this specification are for the purpose of describing particular embodiments only, and are not intended to limit the specification. As used in this specification and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this specification, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

As shown in FIG. 1A , FIG. 1A is a flowchart of an image driving method according to at least one embodiment of the present disclosure, and the method may include steps 102 to 106 .

In step 102, a target image and a driving reference image are acquired.

Wherein, the target image includes the body parts of the target object, and the driving reference image includes the body parts of the reference object presenting the reference action. The method of this embodiment aims at transferring the reference action on the body part of the reference subject to the body part of the target subject, so that the body part of the target subject can present the reference action.

The target object is a driven object. This embodiment does not limit the scope of the target object. The target object may include a real person, animation character, cartoon image or doll in the target image. The body part of the target object in the target image can be any body part such as head, face, limbs (such as hands, legs, torso, etc.), or a combination of at least two body parts.

This embodiment does not limit the scope of the reference object, and the reference object may include a real person, animation character, cartoon image, or doll in the driving reference image. The reference action presented by the reference object in the driving reference image may include any action, for example, when the body part is a face, the reference action presented by the reference object may be various facial expressions, such as frowning, turning the head, opening the mouth, etc.; When the body parts are limbs, the reference actions presented by the reference object may be various gesture actions, such as walking, greeting, raising hands, and so on.

For example, if you want to transfer the action of opening the mouth of a real person to an anime character, you can call the real person the reference object and the anime character the target object. Correspondingly, the image containing the mouth of the real person is the driving reference image, and the image containing the mouth of the animation character is the target image.

This embodiment does not limit the acquisition manners of the target image and the driving reference image. When acquiring the target image, a single target image designated by the user or uploaded by the user may be acquired. When obtaining the driving reference image, you can obtain a single driving reference image or multiple driving reference images specified by the user or uploaded by the user, or you can first obtain a driving video specified by the user or uploaded by the user, and obtain multiple driving reference images from the driving video. image.

In step 104, based on the correspondence between each key point of the body part of the target object and each key point of the body part of the reference object, determine the number of pixel points in the target image Sports information.

Wherein, the motion information is used to adjust the motion of the body part of the target object to the reference motion. The key points of the body part are used to describe the semantic information and location information of the body part, wherein the semantic information of the body part includes the category information of the body part, and the key point of the body part can be any position of the body part area and the area around the body part point.

For example, when the body part is a face, the key points of the body part are the key points of the face, and the key points of the face may include points on the skin of the face or facial features, points on the contour line of the face and the contour line of the facial features, Points on hair, on the neck, or can also include points on objects such as glasses, earrings, and hats that are worn.

For example, when the body part is a limb, the key points of the body part are the key points of the limbs. The key points of the limbs can include points on the skeleton such as hand joints, elbow joints, head vertices, ankle joints, and tail vertebrae, and can also include clothes. , backpacks and other objects.

For the same body part, when the actions of the body part are different, the positions of the key points of the body part can be different. When the positions of the key points of the body part form different position combinations, it can indicate that the body part presents different action.

This embodiment does not limit the number and position distribution of the key points of the body parts used, and key points with different numbers and position distributions can form different key point patterns. Taking the body part as the face as an example, the 83 key points of the face as shown in Figure 1B are respectively located on the contour line of the face and the contour lines of the eyes, eyebrows, nose and mouth, and the 83 different key points shown in Figure 1B The location distribution of facial keypoints can be referred to as Mode 1. The 49 facial key points shown in FIG. 1C are respectively located on the contour lines of facial features. The 49 facial key points distributed in different positions shown in FIG. 1C can be called mode 2.

In this embodiment, the modes used to describe key points of body parts in the target object and the reference object are the same.

For example, when the mode used to describe the key points of body parts is mode 1, the face key points of the target object include 83 points shown in Figure 1B, and the face key points of the reference object also include 83 points shown in Figure 1B. points.

For another example, when the mode used to describe the key points of body parts is mode 2, the facial key points of the target object include 49 points on the facial features contour line shown in Figure 1C, and the facial key points of the reference object also include 49 points on the facial features contour line shown in Figure 1C.

In this step, each key point of the body part in the target object and each key point of the body part in the reference object may be respectively extracted first. Wherein, the number and definition of the key points of the body parts in the target object and the key points of the body parts in the reference object are the same, and the key points of the body parts in the target object are in one-to-one correspondence with the key points of the body parts in the reference object.

The position combination of the key points of the body part in the reference object reflects the reference action presented by the body part in the reference object in the driving reference image, and can be adjusted by adjusting the position combination of the key points of the body part in the target object to The position combination of the key points of the body part is consistent so that the body part in the target object presents the same reference action.

In some embodiments, the number of pixels that need to be moved in the target image can be calculated according to the difference between the position of each key point of the body part in the target object and the position of each key point of the body part in the reference object point motion information. For example, the position of each key point of the body part of the reference object in the driving reference image and the position of each key point of the body part of the target object in the target image can be input into a pre-trained neural network, and the output can be obtained in the target image. motion information of pixels.

Wherein, the motion information of each pixel point may be the displacement of the pixel point, and the displacement may be represented by the motion direction in two dimensions of the horizontal axis and the vertical axis. Exemplarily, the motion information of each pixel may include optical flow information. The motion information of each pixel can be represented by a two-dimensional vector, and the motion information of multiple pixels in the target image can be represented by a matrix.

In step 106, a plurality of pixels in the target image are adjusted according to the motion information to obtain a driving effect image.

Wherein, the body part of the target object in the driving effect image presents the reference motion.

For example, when the motion information includes the displacement of pixels, multiple pixels in the target image are moved according to their respective displacements, so as to realize the deformation processing of the target image and obtain the body parts of the target object presenting the reference action. Drive effect image.

The image driving method provided by the embodiments of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between the key points of the body part of the target object and the key points of the body part of the reference object, so as to perform the image driving on the target image. Direct deformation can make the target image show the same body parts as the driving reference image. Without uploading the video of the target object, the driving reference image of the single reference object can be used to drive the target object, which simplifies the realization of target object driving. operation mode, and can effectively improve the processing efficiency of the driving target object.

As shown in FIG. 2A , FIG. 2A is a flowchart of an image driving method according to at least one embodiment of the present disclosure, and FIG. 3 illustrates a schematic structural diagram of an image driving model for implementing the image driving method of this embodiment. The image-driven model in FIG. 3 includes a body part key point detection module 31 , a motion module 32 and a coarse deformation module 33 .

It should be noted that the model shown in FIG. 3 is only an exemplary network structure, and is not limited thereto in specific implementation. The processing flow of the image driving method is described below with reference to FIG. 2A and FIG. 3 . The processing flow of the image driving method includes step 202 to step 208 .

In step 202, a target image and a driving reference image are acquired.

Exemplarily, the obtained target image may be as shown in FIG. 2B, and the target object is a cartoon character; the obtained driving reference image may be as shown in FIG. 2C, and the reference object is a real person, and the driving reference image can be seen The reference actions presented by the reference subject are: tilting the head to the left, looking up to the left and opening the mouth wide.

As shown in FIG. 3 , the target image and the driving reference image can be input to the body part key point detection module 31 in the image driving model. When the body part is a face, the key point detection module 31 of the body part is a face key point detection module; when the body part is a limb, the key point detection module 31 of the body part is a limb key point detection module; In the case of hands, the key point detection module 31 of body parts is a hand key point detection module.

In step 204, identify the first positions respectively corresponding to the key points of the body parts of the target subject in the target image; and identify the key points of the body parts of the reference subject in the driving reference image respectively corresponding to the second position.

In this step, identifying the first positions corresponding to each key point of the body part of the target object in the target image may include extracting the key points of the body part of the target image to obtain the body part key points of the target image in the target image. The first positions corresponding to the key points of the body parts respectively; identifying the second positions respectively corresponding to the key points of the body parts of the reference object in the driving reference image may include extracting the key points of the body parts from the driving reference image to obtain Each key point of the body part of the reference object in the driving reference image corresponds to the second position respectively. Among them, the same extraction method can be used to extract key points of body parts from the target image and the driving reference image. This embodiment does not limit the specific manner of extracting key points of body parts. For example, it can be extracted by using a neural network, or the key points of body parts can be extracted by other algorithms such as methods based on cascade shape regression and component-based detection algorithms. In this embodiment, for each key point of the body part of the target object, a two-dimensional coordinate can be used to represent the position of the key point, which is recorded as the first position; for each key point of the body part of the reference object, a two-dimensional coordinate can be used The dimensional coordinates represent the position of the key point, which is recorded as the second position.

The number of key points of the body part of the target object is the same as the number of key points of the body part of the reference object, and there is a one-to-one correspondence. Since the body part of the target object exhibits different actions than the body part of the reference object, any first position may be different from the second position corresponding to the first position.

For example, taking the body part as a face, as shown in Figure 2D, one of the key points m1 of the target object is located at the right corner of the mouth, as shown in Figure 2E, the corresponding key point n1 of the reference object is also located at the right corner of the mouth. The first position corresponding to m1 may be the coordinate (x1, y1) of m1, and the second position corresponding to n1 may be the coordinate (x2, y2) of n1.

Exemplarily, a body part key point detection network (Landmark Detector) can be used for body part key point extraction. A body part key point detection network is pre-trained, and the target image and the driving reference image can be sequentially input to the body part key point detection network, and the body part key point detection network outputs each key point of the detected body part of the target object respectively The corresponding first position, and the second position respectively corresponding to each key point of the body part of the reference object.

It should be noted that when using the body part key point detection network to extract body part key points, the extracted body part key points can use the user-defined body part key point mode, or use the body part key point detection network in The body part key point mode that is automatically learned in the pre-training process, wherein, when training the body part key point detection network, the number of body part key points that the body part key point detection network needs to learn can be set, and the number can be A specific value, or a certain range. When using different training methods, the body part key point detection network can learn the key points of different patterns of body parts. When the trained body part key point detection network is applied, the key points of the corresponding body parts are extracted according to the body part key point patterns learned by it.

As shown in FIG. 2F , FIG. 2F shows a face key point pattern automatically learned by the facial key point detection network after training.

The neural network model used by the body part key point detection network can include DAN (Deep Alignment Network, depth alignment network), DCNN (Deep Convolutional Neural Network, depth convolutional network) or TCNN (Tweaked Convolutional Neural Network, adjustment convolutional network ) and other models. The body part key point detection network may include a self-supervised learning model, an unsupervised learning model or a supervised learning model.

As shown in FIG. 3 , the key point detection module 31 of the body part in the image-driven model can be a body part key point detection network, which is used to perform body part keying on the target object in the target image and the reference object in the driving reference image respectively. Point extraction, output the first position corresponding to each key point of the body part of the target object and the second position corresponding to each key point of the body part of the reference object respectively, and input the first position and the second position to the motion module 32.

In step 206, based on the corresponding relationship between each of the first positions and each of the second positions, motion information of a plurality of pixels in the target image is determined.

The motion displayed by the body part of the target object can be adjusted to the reference motion by using the motion information. The second positions corresponding to the key points of the body parts of the reference object constitute a position combination, which is called the second position combination here, and the first positions corresponding to the key points of the body parts of the target object also constitute a position combination , which is referred to as the first position combination here, and the motion information is the displacements of the respective movement of multiple pixels in the target image when the first position combination is adjusted to the second position combination.

In this step, the motion optical flow information may be calculated first according to the corresponding relationship between each first position and each second position. The motion optical flow information may include optical flow information when pixels in the local area where the body part is located in the target image move respectively, and the calculation here may use a motion optical flow estimation method. For example, the Lucas-Kanade algorithm based on Taylor expansion can be used to calculate the motion optical flow information, and the neural network based on deep learning such as FlowNet and FlowNet2.0 can also be used to calculate the motion optical flow information, and the specific implementation is not limited to this.

Exemplarily, as shown in FIG. 2D , the four key points of the body part of the target object are the right mouth corner m1 , the left mouth corner m2 , the right eye m3 and the left eye m4 . As shown in FIG. 2E , the four key points of the body part of the reference object are right mouth corner n1 , left mouth corner n2 , right eye n3 and left eye n4 . Among them, m1 corresponds to n1, m2 corresponds to n2, m3 corresponds to n3, and m4 corresponds to n4. When calculating the motion optical flow information, the first position used is the coordinate information of m1 , m2 , m3 and m4 , and the second position used is the coordinate information of n1 , n2 , n3 and n4 .

After the motion optical flow information is calculated, the motion information of multiple pixels in the final target image can be calculated by combining the motion optical flow information with the target image. For example, a predictive neural network can be pre-trained, the input of the predictive neural network is the target image and motion optical flow information, and the output of the predictive neural network is the motion information of multiple pixels in the target image.

As shown in FIG. 3 , input the first position, the second position and the target image into the motion module (Motion Module) 32, and output the motion information. In one example, as shown in FIG. 4 , the motion module 32 includes a motion information calculation unit 320, wherein an algorithm 3201 is used to calculate motion optical flow information according to the correspondence between each first position and each second position , the motion optical flow information is the optical flow information when the pixels in the local area where the body parts are located in the target image move respectively. The algorithm 3201 may include a motion optical flow estimation method, for example, the Lucas-Kanade algorithm based on Taylor expansion, and algorithms based on deep learning neural networks such as FlowNet and FlowNet2.0. The prediction neural network 3202 is used to output motion information of multiple pixels in the target image according to the input motion optical flow information and the target image.

In step 208, a plurality of pixels in the target image are adjusted according to the motion information to obtain a driving effect image.

The body part of the target object in the driving effect image exhibits the reference motion.

In this step, the motion information includes the displacement of each pixel, and the pixels in the target image are moved according to the displacement and motion direction indicated by their respective displacements to obtain the adjusted target image, that is, the coarsely deformed image. The final target image is used as the driving effect image containing the target object presenting the reference action on the body part. As shown in Figure 2G, the doll in Figure 2G exhibits motions consistent with the reference motions driving the body parts in the reference image in Figure 2C: head tilted to the left, eyes looking up and to the left, and mouth wide open.

As shown in FIG. 3 , the motion information and the target image are input into the rough deformation module 33 , and the driving effect image after adjusting multiple pixels in the target image is output.

In other examples, the adjusted target image may be further optimized after the above-mentioned adjustment is performed on the target image, so as to improve the display effect of the image. Such as removing noise, repairing missing content, adjusting brightness and color enhancement, etc.

The image driving method provided by the embodiment of the present disclosure obtains the position combination of the key points of the body parts of the target object and the position of the key points of the body parts of the reference object by extracting the key points of the body parts from the target image and the driving reference image respectively. Combination, the position of the key points of the body part is combined into the action of the body part, and the movement of multiple pixels in the target image is obtained through the corresponding relationship between the key points of the body part in the target object and the key points of the body part in the reference object Information, so that the pixels of the target image can be moved correspondingly according to the displacement of each pixel in the motion information, which makes the action presented by the body part of the target object consistent with the reference action presented by the body part of the reference object, and finally obtains A driving effect image of the body part of the target subject of the reference motion is presented. This not only makes the operation of driving the target object simple and the processing effect of the driving target object is high, but also improves the accuracy of the driving effect by using the above method.

In addition, with the structure of the image-driven model shown in Figure 3, when training the model, it is not necessary to take a video of a specific target object for training, and end-to-end training can be performed according to the target sample image and the driving sample image to generate an image-driven model. The model, the target sample image and the driving sample image used for training may be images of any object.

In the previous image-driven technology, when training the image-driven model, most of them need a video of the sample object for model training, so that the model can fully learn the characteristics of the body parts of the sample object for image driving, and the trained model can only be used for The sample object is used for driving. If the sample object is replaced, the model training needs to be performed again. However, the image driving model trained in this embodiment is not limited to driving a specific target object, and any target object can be driven by the image driving model to obtain a corresponding driving effect image.

And, in the previous image-driven technology, when obtaining a video of the sample object required for model training, it is necessary to provide some guidance to the sample object, such as requiring the sample object to make corresponding actions, etc., the difficulty of data acquisition is relatively high, and at the same time, due to The need to guide the sample object limits the range of target objects to which image driving can be applied, for example, it cannot drive other types of objects other than real people. In this embodiment, the model does not rely on the model to learn the relevant features of the body parts of the sample object before driving, so that there is no need to shoot a video of the sample object, and it can be used for a wider range of target objects.

Wherein, the target sample image includes the body part of the sample subject exhibiting the first action, and the driving sample image includes the body part of the sample subject exhibiting the second action. When training the image-driven model, the target sample image and the driving sample image are input into the initial image-driven model, and the initial image-driven model outputs the training image of the body part of the sample object presenting the third action, through the training image and the driving sample image. The difference adjusts the initial image-driven model during training to obtain the image-driven model used above after training.

Among them, the image input by the key point detection module 31 of the body part is the target sample image and the driving sample image, and the image output by the coarse deformation module 33 is the training image. In this embodiment, the target object in the target sample image used in training and the reference object in the driving sample image are the same object, the target sample image includes the body part of the sample object that presents the first action, and the driving sample image includes the body part that presents The body part of the sample object for the second motion.

In the training phase, the training image and the driving sample image generated by image driving on the target sample image both include the body parts of the sample object, and the third action presented by the body parts of the sample object in the training image is the same as the third action of the sample object in the driving sample image. The second motion exhibited by the body part is likely to be different. For example, the head is not angled enough and the shape of the mouth is not consistent. For example, the position of pixels in the training image and the driving sample image are different. The network loss is calculated by the difference between the training image and the driving sample image, and the values of various network parameters in the image driving model are simultaneously adjusted according to the network loss.

Each network parameter value in the image-driven model may include the body part key point detection network in the body part key point detection module 31 , and the network parameter value in the prediction neural network 3202 in the motion module 32 .

In some embodiments, network parameter values in the image-driven model can be adjusted by backpropagation. When the network iteration end condition is reached, the network training is ended, wherein the end condition may include that the iteration reaches a certain number of times, or the loss value is less than a certain threshold.

The target sample image and driving sample image used in the training process can be images of any sample object, so there is no need to shoot a video of the target object for training. The trained model does not rely on the model to learn the body part characteristics of the target object Drive, but directly deform the target image at the pixel level according to the corresponding relationship of the key points of the body parts, so the trained model is versatile, and there is no need to collect training samples again to retrain the model for different target objects. Improve model training efficiency. In addition, since there is no need to guide the sample object to obtain the training video, the difficulty of training is also reduced, and the sample collection method for model training is simplified.

In addition, in one embodiment, based on a single frame of the target image and a driving video, a video of the driven target object synchronized with the reference motion of the body part of the reference object in the driving video can be obtained. In some embodiments, in step 202, one frame of the target image and multiple frames of driving reference images in the driving video can be acquired first, the target image includes the body parts of the target object, and the multiple frames of driving reference images include the body parts of the same reference object , and the body parts of the reference subject in different driving reference images present different reference motions.

For example, assuming that the driving video includes three frames of driving reference images, the reference action presented by the body parts in the first driving reference image is to open the left eye and close the right eye; The reference action is to close the left eye and open the right eye; the reference action presented by the body parts in the third frame driving reference image is to open the left and right eyes.

Then, one frame of the driving reference image is acquired from the multiple frames of the driving reference image to perform driving processing on the target image, and the processing order of the multiple frames of the driving reference image is not limited here. Each frame of the driving reference image may be sequentially processed according to the sequence of the driving reference image in the driving video, or multiple frames of driving reference images may be processed in parallel.

In step 204, it is only necessary to perform body part key point extraction on the target image once to obtain the first positions corresponding to each key point of the target object's body part in the target image. In the process of processing the driving reference image, each frame of the driving reference image needs to extract the key points of the body part to obtain the second positions corresponding to each key point of the body part of the reference object in each frame of the driving reference image. After processing each frame of the driving reference image, the motion information of multiple pixels in the target image is respectively determined by using each frame of the driving reference image, so as to obtain a multi-frame driving effect image of the target object.

In an example, in response to acquiring the multi-frame driving effect images of the target object, the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, and the multi-frame driving effect images respectively present corresponding The reference action of the body part of the reference object in the multi-frame driving reference image, and generate a target video based on the multi-frame driving effect image, the body part action of the target object in the target video is the same as the body part of the reference object in the driving video The movements of the parts are the same.

For example, continuing the above example, three frames of driving effect images of the target object are obtained after processing the three frames of driving reference images in the driving video, and the three frames of driving effect images of the target object respectively present the same body parts as the corresponding three frames of driving reference images According to the reference action, the three frames of driving effect images are synthesized into the target video according to the order of their corresponding driving reference images. The three frames of images in the target video show that the target object opens the left eye and closes the right eye, closes the left eye and opens the right eye. eyes and the movement of opening the left and right eyes.

As shown in FIG. 5, FIG. 5 is a flowchart of an image driving method shown in at least one embodiment of the present disclosure, and FIG. 6 illustrates a schematic structural diagram of an image driving model for implementing the image driving method of this embodiment. The image-driven model is based on the image-driven model shown in FIG. 3 , and an image generation module 34 is added. The image generation module 34 can generate a network for an image, including an encoding network 341 , a feature deformation unit 342 and a decoding network 343 .

It should be noted that the model shown in FIG. 6 is only an exemplary network structure, and is not limited thereto in specific implementation. The processing flow of the image driving method is described below with reference to FIG. 5 and FIG. 6 , wherein the steps repeated with the above embodiment will not be repeated, and the processing flow of the image driving method includes step 502 to step 514 .

In step 502, a target image and a driving reference image are acquired.

As shown in FIG. 6 , the target image and the driving reference image can be input to the body part key point detection module 31 in the image driving model.

In step 504, identify first positions corresponding to each key point of the body part of the target object in the target image; and identify each key point of the body part of the reference object in the driving reference image corresponding to the second position.

As shown in FIG. 6 , the key point detection module 31 of the body part is used to extract key points of the body part from the target object in the target image and the reference object in the driving reference image respectively, and output each key point of the body part of the target object The corresponding first position and each key point of the body part of the reference object respectively correspond to the second position, and the first position and the second position are input to the motion module 32 .

In step 506, based on the correspondence between each of the first positions and each of the second positions, motion information of a plurality of pixels in the target image is determined.

As shown in FIG. 6 , the first position, the second position and the target image are input into the motion module 32 , and the motion information of multiple pixels in the target image is output.

In step 508, a plurality of pixel points in the target image are adjusted according to the motion information.

As shown in FIG. 6 , input the motion information and the target image into the rough deformation module 33 to obtain the adjusted target image.

During the adjustment process, each pixel point of the target image is moved to deform the target image, so that the body part of the target object in the target image presents a reference action. In order to make the facial expressions or gestures presented by the target object more natural and real, it is necessary to further process the deformed regions in the coarsely deformed image.

In one example, after step 508, the adjusted target image, motion information and image generation network (image generation module 34) can be used to generate a driving effect image. The image generation network can be a pre-trained neural network, which can perform detailed processing on the adjusted target image with the help of motion information, such as removing noise, repairing defective content, adjusting brightness and color enhancement, etc.

In step 510, feature extraction is performed on the adjusted target image by using the encoding network in the image generation network to obtain a feature map.

For example, an encoding network can be used to perform feature extraction on the adjusted target image to obtain a feature map, or it can also be extracted in other ways.

As shown in FIG. 6 , the coarsely deformed image is input to the encoding network 341 in the image generation module 34 to obtain a feature map.

In step 512, based on the motion information, the pixels in the feature map are adjusted to obtain an adjusted feature map.

For example, the feature map can be adjusted in the same way as adjusting the target image, and the pixel points corresponding to the target image in the feature map are moved according to the displacement indicated by the motion information to obtain the adjusted feature map.

As shown in FIG. 6 , the feature map and motion information are input into the feature deformation unit 342 in the image generation module 34 , the pixels in the feature map are adjusted, and the adjusted feature map is output.

For another example, it is possible to first calculate the pixels of the target image that are more deformed during the adjustment process, and then adjust the pixels in the feature map corresponding to these more deformed pixels to achieve targeted adjustments. .

In order to achieve targeted adjustment, in an embodiment, motion information may be used to determine a mask corresponding to the target image. And use the adjusted target image, motion information, mask and image generation network to generate the driving effect image. The mask is used to identify the movement degree of each pixel in the process of adjusting the multiple pixels in the target image according to the motion information. The mask can indicate to the image generation network the image area in the adjusted target image that needs to be optimized and adjusted. For each pixel in the mask, pixels that need to be adjusted and pixels that do not need to be adjusted can be distinguished using different identifiers. In some embodiments, the motion information is input into the mask generation network to obtain a mask corresponding to the target image, and the mask may be an image of the same size as the target image.

The mask generation network may include a binary classifier, which divides each pixel into a pixel with a large degree of movement or a pixel with a small degree of movement according to the motion information. Exemplarily, a certain pixel output by the binary classifier belongs to When the confidence degree of a pixel with a large degree of movement reaches the preset threshold, the pixel is determined as a pixel with a large degree of movement; otherwise, when the pixel with a large degree of movement does not reach the preset threshold, the pixel is determined as Pixels with little movement. Wherein, the identification of pixels with a large degree of movement in the mask may be 1, and the identification of pixels with a small degree of movement may be 0 in the mask.

For example, assuming that the reference action generated by the body part is to open the mouth, when the target object’s action is adjusted to the reference action, the pixels with the largest movement degree among the pixels of the target image are the pixels of the mouth area, and the corresponding mask of the target image The film can be a graph in which the pixel points in the mouth area are marked as 1, and the pixel points in other areas are marked as 0.

After obtaining the mask, the pixels in the feature map can be adjusted based on the motion information and the mask.

For example, the pixels in the feature map corresponding to the region marked as 1 in the mask can be adjusted so that the subsequent decoding process can complete and generate details of the larger deformed part of the target image, and retain the deformation Not a big area.

In actual implementation, the following formula can be used to fuse the mask with the feature map to adjust the feature map according to the motion information:

Adjusted feature map = feature map * mask + feature map deformed by motion information * (1-mask) (1)

Wherein, the feature map deformed by the motion information is a deformed feature map obtained by moving the pixel points in the feature map corresponding to the target image according to the displacement indicated by the motion information.

In conjunction with another image driving model shown in FIG. 7 , wherein the motion module 32 includes a mask generation network 321 in addition to the motion information calculation unit 320 , the motion information output by the motion information calculation unit 320 is input into the mask generation network 321 , get the mask corresponding to the output target image. Input the mask, feature map and motion information into the feature deformation unit 342 in the image generation module 34, and the feature deformation unit 342 adjusts the pixels in the feature map based on the motion information and the mask, and outputs the adjusted feature map.

In step 514, the adjusted feature map is decoded by the decoding network in the image generation network to obtain the driving effect image.

For example, a decoding network can be used to decode the adjusted feature map to obtain the feature map, or it can also be decoded in other ways. The encoding network and decoding network in this embodiment can use a convolutional neural network.

As shown in FIG. 6 , the adjusted feature map is input to the decoding network 343 in the image generation module 34 , and the driving effect image of the body part of the target object presenting the reference action is output. In addition to the fact that the body parts of the target object in the driving effect image are consistent with the reference motion, the details of the body parts of the target object are more complete and the mood is more natural due to the detail completion and generation of the larger deformed part , the presented image effect is closer to the real action of the target object.

The image driving method provided by the embodiments of the present disclosure can realize the driving of the target object by using a driving reference image of a single reference object, simplifies the operation method for driving the target object, and can effectively improve the processing efficiency of driving the target object. While making the body part of the target object in the target image present the same action as the body part of the reference object in the driving reference image, it is possible to perform detailed processing on the part with large deformation in the target image, so that the movement of the body part of the target object is more accurate. Be true to nature.

In addition, when training the image-driven model with the structure shown in FIG. 6 or FIG. 7 , the training method for the image-driven model with the structure shown in FIG. 3 in the above embodiment can still be used. During the training process, when adjusting the network parameter value of the image-driven model, the adjusted network parameter value of the image-driven model may include at least one of the following network parameter values: key points of the body part in the key point detection module 31 of the body part The detection network, the prediction neural network 3202 and the mask generation network 321 in the motion module 32 , the encoding network 341 and the decoding network 343 in the image generation module 34 .

As shown in FIG. 8 , FIG. 8 is a block diagram of an image driving device according to at least one embodiment of the present disclosure, and the device includes: an image acquisition module 81 , a pixel movement module 82 and an image adjustment module 83 .

Wherein, the image acquisition module 81 is configured to acquire a target image and a driving reference image, the target image includes body parts of the target object, and the driving reference image includes body parts of the reference object exhibiting the reference action.

A pixel motion module 82, configured to determine the motion information of multiple pixel points in the target image based on the correspondence between each key point of the body part in the target object and each key point of the body part in the reference object , the motion information is used to adjust the motion of the body part of the target object to the reference motion.

The image adjustment module 83 is configured to adjust a plurality of pixels in the target image according to the motion information to obtain a driving effect image, in which the body parts of the target object present the reference motion.

The image driving device provided by the embodiments of the present disclosure adjusts the pixel points in the target image according to the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, so that the target image Direct deformation can make the target image show the same body parts as the driving reference image. Without uploading the video of the target object, the driving reference image of the single reference object can be used to drive the target object, which simplifies the realization of the target object. The operation mode of the drive can effectively improve the processing efficiency of the drive target object.

In some embodiments, the pixel motion module 82 is configured to: identify the first positions corresponding to each key point of the body part of the target object in the target image; and identify the key points in the driving reference image Second positions corresponding to each key point of the body part of the reference object; based on the correspondence between each of the first positions and each of the second positions, determining the movement of multiple pixel points in the target image information.

In some embodiments, the motion information includes the corresponding displacements of each pixel in the plurality of pixels; the image adjustment module 83 is further configured to: adjust the plurality of pixels in the target image according to their respective displacements displacement to move.

In some embodiments, the image adjustment module 83 is configured to: adjust a plurality of pixels in the target image according to the motion information to obtain an adjusted target image, and the adjusted target image is the the driving effect image; or, adjust a plurality of pixels in the target image according to the motion information to obtain an adjusted target image, and use the adjusted target image, the motion information, and an image generation network, The driving effect image is generated.

In some embodiments, the image generation network includes an encoding network and a decoding network; the image adjustment module 83 is further configured to: use the encoding network to perform feature extraction on the adjusted target image to obtain a feature map; Based on the motion information, the pixels in the feature map are adjusted to obtain an adjusted feature map; the decoding network is used to decode the adjusted feature map to obtain the driving effect image.

In some embodiments, as shown in FIG. 9 , the device further includes: a mask generation module 84; the mask generation module 84 is configured to use the motion information to determine a mask corresponding to the target image, The mask is used to identify the movement degree of each pixel in the process of adjusting multiple pixels in the target image according to the motion information; the image adjustment module 83 is also used to: use the adjustment The driving effect image is generated from the final target image, the motion information, the mask, and the image generation network.

In some embodiments, the image generation network includes an encoding network and a decoding network; the image adjustment module 83 is further configured to: use the encoding network to perform feature extraction on the adjusted target image to obtain a feature map; Based on the motion information and the mask, adjust the pixels in the feature map to obtain an adjusted feature map; use the decoding network to decode the adjusted feature map to obtain the described Drive effect image.

In some embodiments, the apparatus is performed by an image-driven model trained and generated from target sample images and drive sample images; wherein the target sample images include body parts of sample subjects exhibiting the first motion , the driving sample image includes the body part of the sample object presenting the second action; in training, the target sample image and the driving sample image are input into an initial image-driven model, and the initial image-driven model outputs Presenting training images of body parts of sample subjects in the third action, adjusting the initial image-driven model during training through the difference between the training image and the driving sample image, and obtaining the image-driven model after training.

In some embodiments, the image acquisition module 81 is further configured to: acquire multiple frames of driving reference images in the driving video, wherein the multiple frames of driving reference images include body parts of the same reference object, and different driving reference images include The reference movements presented by the body parts of the reference subject are different; one frame of the driving reference image is obtained from the multiple frames of the driving reference image.

In some embodiments, the image adjustment module 83 is further configured to: generate a target video based on the multi-frame driving effect image in response to obtaining a multi-frame driving effect image of the target object based on the multi-frame driving reference image, the The motion of the body part of the target object in the target video is consistent with the motion of the body part of the reference object in the driving video, wherein the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, And the multi-frame driving effect images respectively correspond to the reference actions of the body parts of the reference object in the multi-frame driving reference images.

For the implementation process of the functions and effects of each module in the above-mentioned device, please refer to the implementation process of the corresponding steps in the above-mentioned method for details, and details will not be repeated here.

The image driving method in at least one embodiment of the present disclosure may be performed by an electronic device, for example, may be performed by a terminal device or a server or other processing device, where the terminal device may include a user device, a mobile device, a terminal, a cellular phone, a cordless phone, Personal digital processing, handheld devices, computing devices, automotive devices, wearable devices, etc. In some possible implementation manners, the image driving method may be implemented by a processor invoking computer-readable instructions stored in a memory.

An embodiment of the present disclosure also provides an electronic device. As shown in FIG. The device 12 is configured to implement the image driving method described in any embodiment of the present disclosure when executing the computer instructions.

An embodiment of the present disclosure further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, implements the image driving method described in any embodiment of the present disclosure.

An embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the image driving method described in any embodiment of the present disclosure is implemented.

As for the device embodiment, since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment. The device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. It can be understood and implemented by those skilled in the art without creative effort.

The foregoing describes some embodiments of this specification. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

Other embodiments of the specification will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This description is intended to cover any modification, use or adaptation of this description. These modifications, uses or adaptations follow the general principles of this description and include common knowledge or conventional technical means in this technical field for which this description does not apply . The specification and examples are to be considered exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It should be understood that this specification is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the specification is limited only by the appended claims.

The above descriptions are only some examples of this specification, and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in the protection of this specification. within the range.

Claims

An image driving method comprising:

acquiring a target image and a driving reference image, the target image including a body part of a target subject, and the driving reference image including a body part of a reference subject presenting a reference action;

Based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determine motion information of multiple pixels in the target image, and the motion information is used for adjusting the motion of the body part of the target subject to the reference motion;

A plurality of pixel points in the target image are adjusted according to the motion information to obtain a driving effect image, in which the body parts of the target object present the reference motion.
The method according to claim 1, wherein, based on the corresponding relationship between each key point of the body part of the target object and each key point of the body part of the reference object, determining how many points in the target image Pixel motion information, including:

Identifying first positions corresponding to key points of body parts of the target subject in the target image; and identifying second positions corresponding to key points of body parts of the reference subject in the driving reference image ;

Based on the correspondence between each of the first positions and each of the second positions, motion information of multiple pixels in the target image is determined.
The method according to claim 1 or 2, wherein the motion information includes displacements corresponding to respective pixels in the plurality of pixels;

The step of adjusting multiple pixels in the target image according to the motion information to obtain a driving effect image includes:

Moving a plurality of pixel points in the target image according to their corresponding displacements to obtain an adjusted target object, where the adjusted target object is the driving effect image.
The method according to any one of claims 1-3, wherein said adjusting a plurality of pixels in said target image according to said motion information to obtain a driving effect image comprises:

adjusting a plurality of pixel points in the target image according to the motion information to obtain an adjusted target image;

The driving effect image is generated by using the adjusted target image, the motion information, and an image generation network.
The method according to claim 4, wherein said image generation network comprises an encoding network and a decoding network; said driving effect is generated using said adjusted target image, said motion information, and said image generation network images, including:

Using the encoding network to perform feature extraction on the adjusted target image to obtain a feature map;

Adjusting pixels in the feature map based on the motion information to obtain an adjusted feature map;

Decoding the adjusted feature map by using the decoding network to obtain the driving effect image.
The method according to claim 4, further comprising:

Using the motion information, determine a mask corresponding to the target image, where the mask is used to identify the degree of movement of each pixel in the process of adjusting multiple pixels in the target image according to the motion information ;

The generating the driving effect image by using the adjusted target image, the motion information and the image generation network includes:

The driving effect image is generated using the adjusted target image, the motion information, the mask, and the image generation network.
The method according to claim 6, wherein said image generation network comprises an encoding network and a decoding network; said utilizing said adjusted target image, said motion information, said mask and said image generation network, Generate the driving effect image, including:

Using the encoding network to perform feature extraction on the adjusted target image to obtain a feature map;

Based on the motion information and the mask, the pixels in the feature map are adjusted to obtain an adjusted feature map;

Decoding the adjusted feature map by using the decoding network to obtain the driving effect image.
The method according to any one of claims 1 to 7, wherein the method is executed by an image-driven model, and the image-driven model is generated through training according to target sample images and driving sample images; wherein, the target sample images include presentation The body part of the sample subject in the first action, the driving sample image includes the body part of the sample subject presenting the second action; in training, input the target sample image and the driving sample image into the initial image driving model, the initial image-driven model outputs a training image of the body part of the sample object presenting a third action, and the initial image-driven model in the training process is performed by the difference between the training image and the driving sample image After tuning, the image-driven model is obtained after training.
The method according to any one of claims 1-8, wherein obtaining the driving reference image comprises:

Acquiring multiple frames of driving reference images in the driving video, wherein the multiple frames of driving reference images include body parts of the same reference object, and the reference actions presented by the body parts of the reference object in different driving reference images are different;

A frame of the driving reference image is acquired from the multiple frames of the driving reference image.
The method of claim 9, further comprising:

In response to obtaining a multi-frame driving effect image of the target object based on the multi-frame driving reference image, a target video is generated based on the multi-frame driving effect image, and the action of the body part of the target object in the target video is consistent with the driving video The movements of the body parts of the reference subject are consistent, wherein the number of the multi-frame driving effect images is the same as that of the multi-frame driving reference images, and the multi-frame driving effect images respectively present the multi-frame driving effect images correspondingly. The reference motion of the body part of the reference subject in the reference image.
An image driving device, comprising:

An image acquisition module, configured to acquire a target image and a driving reference image, the target image includes a body part of the target object, and the driving reference image includes a body part of a reference object generating a reference action;

a pixel motion module, configured to determine the motion information of multiple pixel points in the target image based on the correspondence between the key points of the body parts of the target object and the key points of the body parts of the reference object, the motion information is used to adjust the motion of the target subject's body part to the reference motion;

An image adjustment module, configured to adjust a plurality of pixels in the target image according to the motion information to obtain a driving effect image, in which body parts of the target object present the reference motion.
An electronic device comprising a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement any one of claims 1 to 10 when executing the computer instructions the method described.
A computer program product, the product includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the method described in any one of claims 1 to 10 is implemented.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method according to any one of claims 1 to 10 is realized.