CN114445529A

CN114445529A - Human face image animation method and system based on motion and voice characteristics

Info

Publication number: CN114445529A
Application number: CN202210115682.5A
Authority: CN
Inventors: 杨磊
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2022-05-06

Abstract

The invention discloses a method and a system for animating face images based on motion and voice characteristics, wherein the method comprises the following steps: an image driving mode and a voice driving mode; the image driving mode is as follows: inputting a talking video of one face and a face of another person to obtain a dynamic image video of the other person which is originally a static picture; the voice driving mode is as follows: training is carried out aiming at a specific figure, when the feature of another person is used for prediction, the feature is converted into the voice feature of the trained person in one step, the voice feature is converted into the face feature, and the face image animation is obtained. The invention can realize two driving modes of video and audio to drive the target person, has various driving modes and can meet various requirements.

Description

Human face image animation method and system based on motion and voice characteristics

Technical Field

The invention belongs to the technical field of image animation generation, and particularly relates to a method and a system for animation of a human face image based on motion and voice characteristics.

Background

The image animation has wide application in the fields of movie and television production, photography, e-commerce and the like. Specifically, given a character, we can "move" a person by some means of actuation. If features are obtained from the image data, the image features need to be converted into features of the face or the action, and the features are added to the target face; if features are obtained from the voice data, the voice features can be converted into facial features of the target human face, so that the human face of the target person is generated through the features.

In the three-dimensional method in the image field, conventionally, a target object is modeled in three dimensions, and then a series of motions are input to drive the three-dimensional model, and a motion video of the target object is obtained by arranging a camera in a virtual space. The method firstly needs to carry out three-dimensional modeling on an object, needs a large amount of prior information of the object to constrain a model, and obtains a final result through a graphic technology of a computer, wherein the processes of modeling, projection, rendering and the like need to consume a large amount of computer resources. For the two-dimensional method, with the development of artificial intelligence technology in recent years, a large number of deep learning models are emerging to complete the task of image generation, wherein generation countermeasure Networks (general adaptive Networks), Variational Auto-Encoders (Variational Auto-Encoders), and the like are representative. However, these methods generally require a lot of labels to be made in advance, and the whole process cannot be generalized to any object of the same category. In order to solve the human resources required by labeling, and simultaneously apply the process to any object of the same category, siaarohin et al propose a first image animation method Monkey-Net which can be realized based on the object category, and generate the target object animation by detecting the target object and driving the motion trail of the video key points. This method uses only the information of the mapping function of 0 th order, resulting in an insufficiently good image. The First-Order-Motion-Model proposed later uses information of the First derivative of the Motion trajectory, but the original project only uses training data with lower resolution in Order to reduce training consumption and increase data volume, so that the generated result has poor resolution.

Therefore, how to provide a method and a system for animation of a face image based on motion and speech features becomes an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the invention can realize two driving modes of video and audio to drive the target person, and the driving modes are various and can meet various requirements.

In order to achieve the purpose, the invention adopts the following technical scheme:

a human face image animation method based on motion and voice characteristics comprises the following steps: an image driving mode and a voice driving mode; the image driving mode is as follows: inputting a talking video of one face and a face of another person to obtain a dynamic image video of the other person which is originally a static picture; the voice driving mode is as follows: training is carried out aiming at a specific figure, when the feature of another person is used for prediction, the feature is converted into the voice feature of the trained person in one step, the voice feature is converted into the face feature, and the face image animation is obtained.

Further, the image driving mode comprises three steps of key point detection, action extraction and image generation;

detecting key points, namely respectively inputting a target person and a frame of image of a driving video, and obtaining a plurality of key points and first derivative information near the corresponding key points after the key points pass through an encoder;

extracting actions, namely inputting key points and first-order derivative information obtained by a previous network to obtain a deformation field from a target person to an image of a driving video and a down-sampled dimension-reduced source picture, and obtaining an occlusion judgment picture and a deformation picture after feature integration;

and generating an image, namely inputting the occlusion judgment image, the deformation image and the characteristic image of the target person together, and decoding to obtain a dynamic image video.

Further, the specific method of the voice driving mode is as follows: firstly, extracting the characteristics of source audio, and after obtaining the audio characteristics, carrying out characteristic mapping on the audio characteristics according to the sound characteristics of a trainer so as to find the expression of the characteristics in a training person space; after the audio characteristics are obtained, establishing the corresponding relation between the audio characteristics and the mouth shape characteristics, and after the mouth characteristics are obtained, integrating the parameters of the eyebrows and the head postures of the eyes and the head obtained by sampling to obtain a characteristic map of the whole face; and finally, generating an image of the characteristic image to obtain a human face image animation.

A human face image animation system based on motion and voice characteristics comprises an image driving module and a voice driving module; wherein, the first and the second end of the pipe are connected with each other,

the image driving module is used for inputting the talking video of one person face and the person face of the other person to obtain the dynamic image video of the other person which is originally a static picture;

and the voice driving module is used for training a certain specific figure, and when the feature of another person is used for prediction, the feature is converted into the voice feature of the trained person in one step, the voice feature is converted into the face feature, and the face image animation is obtained.

Furthermore, the image driving module comprises a key point detection unit, an action extraction unit and an image generation unit;

the key point detection unit is used for respectively inputting a target person and a frame of image of the driving video, and obtaining a plurality of key points and first derivative information near the corresponding key points after passing through the encoder;

the action extraction unit is used for inputting key points and first-order derivative information obtained by a previous network to obtain a deformation field from a target person to an image of a driving video and a dimension reduction source picture after down sampling, and obtaining an occlusion judgment picture and a deformation picture after feature integration;

and an image generation unit which inputs the occlusion determination map together with the deformation map and the feature map of the target person and decodes the occlusion determination map to obtain a moving image video.

Further, the voice driving module comprises: the device comprises a target audio feature extraction unit, a feature integration unit and an image generation unit; wherein the content of the first and second substances,

the target audio characteristic extraction unit is used for extracting the characteristics of the source audio, and after the audio characteristics are obtained, the audio characteristics are subjected to characteristic mapping according to the sound characteristics of the trainee so as to find the expression of the characteristics in the space of the trainee;

the feature integration unit is used for establishing a corresponding relation between the audio features and the mouth shape features after the audio features are obtained, integrating the parameters of the eye eyebrows and the head postures obtained by sampling after the mouth features are obtained, and obtaining a feature map of the whole face;

and the image generation unit is used for generating the characteristic image to obtain the human face image animation.

The system further comprises a cloud server, wherein the cloud server allows a user to upload the character image, the audio and the action video to the cloud server, and the cloud server automatically calculates a corresponding result and returns the result to the user after receiving the request.

The invention has the beneficial effects that:

1. the invention provides two driving modes of video and audio to drive the target person, and the driving modes are various and can meet various requirements. The resolution precision of the network reaches 512 x 512, and a relatively high-definition face video result can be obtained.

2. According to the invention, through the cloud service scheme, the user can upload images, audios and driving videos, and the target video is obtained through a remote computing method, so that the problem of no display card resource is avoided.

Drawings

In order to illustrate the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only the present embodiments of the invention, and other drawings can be obtained by those skilled in the art without creative efforts based on the provided drawings.

FIG. 1 is a flowchart of a method for keypoint detection according to the present invention.

FIG. 2 is a flow chart of a method for extracting actions according to the present invention.

FIG. 3 is a flow chart of a method of image generation according to the present invention.

FIG. 4 is a flow chart of a voice-driven method of the present invention.

Fig. 5 is a generated result image of the inventor's avatar under the driving video.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1-4, the present invention provides a method for animating human face images based on motion and speech characteristics, comprising: an image driving mode and a voice driving mode; the image driving mode is as follows: when a video of the conversation of one person exists, the action can be completely transferred to the face of the other person, namely the conversation video of one face and the face of the other person are input to obtain a dynamic image video of the other person which is originally a static picture; the voice driving mode is as follows: training is carried out aiming at a specific figure, when the feature of another person is used for prediction, the feature is converted into the voice feature of the trained person in one step, the voice feature is converted into the face feature, and the face image animation is obtained.

Note that the image driving method of the present invention is substantially different from the popular facechange technique of deep take. The face changing technology is to change a new face into the existing video, and the method of the invention abstracts the action in the video and transfers the action to another face according to the existing video, thereby obtaining a character video with the same action in a new scene.

The invention uses the face video for training, and the video length is from ten seconds to more than one minute. Because it is difficult to obtain training data with two sections of actions identical, the training is carried out in an automatic supervision mode, namely, a first frame of a video is used as an input picture of a face, the actions in the rest video frames are extracted to be used as driving data, and finally the face and the driving action data are input into a generation network to obtain a generated video. On the basis of the original research, the invention doubles the original model of which the input and the output are both the pixel number 256, and finally outputs the video with the pixel number 512. In order to have enough training data for training, on the basis of an original video data set, whether the size of a clipped video exceeds 512 is calculated according to the square position of a human face, the video exceeding the size is reserved, meanwhile, interpolation processing is carried out on the video which does not reach the size, and finally, about 9000 videos are prepared for training and testing. And respectively extracting key points of a target person and a driving person, and then obtaining first derivative information near the key points according to the extracted key points in a second action extraction module so as to obtain a track equation and judge whether a single-channel result is covered. And finally, all the information is combined with the input target person image and input into a generating module, and a final result is obtained through the structure of an encoder-decoder.

The image driving mode comprises three steps of key point detection, action extraction and image generation;

Fig. 1 is a key point detection network, where a frame of image of a target person and a frame of image of a driving video are input, and after passing through an encoder, first derivative information of 10 key points and their corresponding vicinity of 10 key points is obtained. In the action extraction of fig. 2, the information of the key point and the first derivative obtained by the previous network is input to obtain the deformation field from the target person to the image of the driving video and a feature map which is subjected to down-sampling and is used for judging whether the occlusion condition occurs. Finally, in the image generation of fig. 3, the result obtained in fig. 2 and the image of the target person are input to the image generation module, and the final result is obtained after passing through the decoding module.

The invention adopts the method of obtaining the character image and the action data from the same video during training, and in the prediction stage, the same type of image as the action video but different image can be used as input. Fig. 5 shows the result of generating the character image under the driving video.

In fig. 5, the first row is a video frame in the existing driving video, and the second, three rows are the result of generating only one static front portrait. It can be found that the scheme of the invention can accurately and high-definition generate and drive the result that the video has the same expression and head posture. Through tests, the generation speed of the human face image animation generation method on the Nvidia 3090 display card exceeds 25fps, which means that the human face image animation generation speed can reach real time.

In the image driving mode, the core of the method is to extract the characteristics of the motion through a depth network and apply the characteristics to another human face, and the method is essentially two-dimensional image affine transformation and completely does not apply three-dimensional information. The method has obvious defects, and when the extracted action is not accurate enough, the generated image has obvious human face distortion phenomenon and has great influence on the use effect. Meanwhile, the effect of the method on the side face is not satisfactory. Therefore, the invention introduces a three-dimensional model of the human face, converts the image of the human face into human face key point parameters, predicts the change of the human face parameters through the characteristics of voice, and generates an image which is consistent with audio through a generating module so as to make up for the defect of being driven only through the image.

Different from the image driving mode, because the voice of each person has different characteristics, the face key point parameters of another person cannot be well recovered through the voice of one person, the method needs to train for a specific person, and when the characteristics of another person are used for prediction, the characteristics need to be converted into the voice characteristics of the trained person in one step. The whole flow is shown in fig. 4.

The specific method of the voice driving mode comprises the following steps: firstly, extracting the characteristics of source audio, and after obtaining the audio characteristics, carrying out characteristic mapping on the audio characteristics according to the sound characteristics of a trainer so as to find the expression of the characteristics in a training person space; after the audio characteristics are obtained, establishing the corresponding relation between the audio characteristics and the mouth shape characteristics, and after the mouth characteristics are obtained, integrating the parameters of the eyebrows and the head postures of the eyes and the head obtained by sampling to obtain a characteristic map of the whole face; and finally, generating an image of the characteristic image to obtain a human face image animation.

Since the person changes the lip shape most frequently while talking, the audio characteristics of the voice are closely related to the lip shape. Therefore, the invention establishes the mapping relation between the sound characteristics and the lip characteristics by extracting the characteristics of the sound, thereby generating the talking video of the corresponding person according to the change of the lip. The posture of the head and shoulders of the person in the generated result can be adjusted by presetting or establishing a weak relationship with the sound characteristics.

Example 2

The embodiment provides a human face image animation system based on motion and voice characteristics, which comprises an image driving module and a voice driving module; wherein the content of the first and second substances,

the image driving module is used for inputting a talking video of one face and a face of another person to obtain a dynamic image video of the other person which is originally a static picture;

The image driving module comprises a key point detection unit, an action extraction unit and an image generation unit; the key point detection unit is used for respectively inputting a target person and a frame of image of the driving video, and obtaining a plurality of key points and first derivative information near the corresponding key points after passing through the encoder; the action extraction unit is used for inputting key points and first-order derivative information obtained by the last network to obtain a deformation field from a target person to an image of a driving video and a dimension reduction source image after down sampling, and obtaining an occlusion judgment image and a deformation image after feature integration; and an image generation unit which inputs the occlusion determination map together with the deformation map and the feature map of the target person and decodes the occlusion determination map to obtain a moving image video.

The voice driving module includes: the device comprises a target audio feature extraction unit, a feature integration unit and an image generation unit; the target audio feature extraction unit is used for extracting the features of the source audio, and after the audio features are obtained, the audio features are subjected to feature mapping according to the sound features of the trainee, so that the expression of the features in the space of the trainee is found; the feature integration unit is used for establishing a corresponding relation between the audio features and the mouth shape features after the audio features are obtained, integrating the parameters of the eye eyebrows and the head postures obtained by sampling after the mouth features are obtained, and obtaining a feature map of the whole face; and the image generation unit is used for generating the characteristic image to obtain the human face image animation.

Because the GPU is used for accelerating the operation speed in the training and prediction stages, and each user does not have higher video card configuration, the video image prediction method and the video image prediction device further comprise the cloud server, the user is allowed to upload the figure image, the audio and the action video to the cloud server, the cloud server automatically calculates the corresponding result after receiving the request and returns the result to the user, and therefore a remote solution is provided for the user without video card resources.

The invention provides two driving modes of video and audio to drive the target person, and the driving modes are various and can meet various requirements. The resolution precision of the network reaches 512 x 512, and a relatively high-definition face video result can be obtained.

According to the invention, through the cloud service scheme, the user can upload images, audios and driving videos, and the target video is obtained through a remote computing method, so that the problem of no display card resource is avoided.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A human face image animation method based on motion and voice characteristics is characterized by comprising the following steps: an image driving mode and a voice driving mode; the image driving mode is as follows: inputting a talking video of one face and a face of another person to obtain a dynamic image video of the other person which is originally a static picture; the voice driving mode is as follows: training is carried out aiming at a specific figure, when the feature of another person is used for prediction, the feature is converted into the voice feature of the trained person in one step, the voice feature is converted into the face feature, and the face image animation is obtained.

2. The human face image animation method based on the motion and the voice characteristics as claimed in claim 1, wherein the image driving mode comprises three steps of key point detection, motion extraction and image generation;

3. The method for animating human face images based on motion and speech features according to claim 1, wherein the specific method of the speech driving mode is as follows: firstly, extracting the characteristics of source audio, and after obtaining the audio characteristics, carrying out characteristic mapping on the audio characteristics according to the sound characteristics of a trainer so as to find the expression of the characteristics in a training person space; after the audio characteristics are obtained, establishing the corresponding relation between the audio characteristics and the mouth shape characteristics, and after the mouth characteristics are obtained, integrating the parameters of the eyebrows and the head postures of the eyes and the head obtained by sampling to obtain a characteristic map of the whole face; and finally, generating an image of the characteristic image to obtain a human face image animation.

4. A human face image animation system based on motion and voice characteristics is characterized by comprising an image driving module and a voice driving module; wherein the content of the first and second substances,

5. The system of claim 4, wherein the image driving module comprises a key point detecting unit, a motion extracting unit and an image generating unit;

6. The system of claim 4, wherein the voice driving module comprises: the device comprises a target audio feature extraction unit, a feature integration unit and an image generation unit; wherein the content of the first and second substances,

7. The system of claim 4, further comprising a cloud server allowing the user to upload the character image, audio and motion video to the cloud server, and the cloud server automatically calculates the corresponding result to be returned to the user after receiving the request.