CN114821734A

CN114821734A - Method and device for driving expression of virtual character

Info

Publication number: CN114821734A
Application number: CN202210518926.4A
Authority: CN
Inventors: 王海新; 杜峰; 吴朝阳
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-07-29

Abstract

The invention discloses a method and a device for driving the expression of a virtual character, and relates to the technical field of computers. One embodiment of the method comprises: acquiring a face image, and extracting key points of the face image to obtain face key points; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image; and performing expression driving on the virtual character according to the expression coefficient value. The implementation method can drive the virtual character rapidly, truly, naturally and accurately, and is extremely low in network consumption, high in driving speed and high in generalization. The virtual character can be driven relatively truly, naturally and accurately, the network consumption is extremely low, the driving speed is high, and the generalization performance is strong.

Description

Method and device for driving expression of virtual character

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for driving the expression of a virtual character.

Background

With the rapid development of artificial intelligence technology, virtual characters are very popular products in the market at present. The use of live virtual images for social interaction, performance, live broadcasting and the like has become a popular trend on the internet at present. The cartoon virtual image customized by the user is used as the speech of the user, so that the user can be ensured to have a satisfactory appearance, the personal privacy can be protected to a great extent, and the expression and action of the user can be captured in real time, so that the user is pursued by people. The virtual character has various presentation forms, which mainly depend on different driving modes, such as voice driving, text driving, image driving, and the like.

At present, the mainstream avatar expression driving method is to drive a virtual character by using an expression capturing technology, but the adoption of the avatar expression driving method has higher requirements on network stability and bandwidth and higher flow consumption, and if the network stability is lower or the bandwidth is smaller, the timeliness and the fluency of driving can be influenced; moreover, the expression of the virtual character is not real, natural and accurate enough, and the audience experience is influenced.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for driving an expression of a virtual character, which can quickly drive the virtual character more truly, naturally and accurately, and have a very small network consumption, a fast driving speed, and a strong generalization.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of driving an expression of a virtual character, including:

acquiring a face image, and extracting key points of the face image to obtain face key points;

inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image;

and performing expression driving on the virtual character according to the expression coefficient value.

Optionally, the extracting the key points of the face image includes: the face image is input into a pre-trained face key point recognition model for key point extraction, and the face key point recognition model is obtained by training the face image subjected to key point labeling.

Optionally, before inputting the face key points into a pre-trained expression coefficient model, the method further includes: and limiting the face image into a picture with a specified size, and carrying out normalization processing on the face key points.

Optionally, the normalizing the face key points includes: acquiring the horizontal coordinate and the vertical coordinate of the central point of the key point of the face as well as the height and the width of the face image; for each face key point, calculating the normalized abscissa of the face key point according to the abscissa of the face key point, the abscissa of the central point and the height and width of a face image, and calculating the normalized ordinate of the face key point according to the ordinate of the face key point, the ordinate of the central point and the height and width of the face image; and translating the central point of the face key point to the upper left corner of the face image with the specified size so as to translate the normalized horizontal coordinate and the normalized vertical coordinate, thereby completing the normalization processing of the face key point.

Optionally, the expression coefficient model is obtained by training in the following manner: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and learning the face key points corresponding to the marked face image to obtain the expression coefficient model.

Optionally, the pre-trained expression coefficient model comprises a blink expression coefficient model and a non-blink expression coefficient model; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image, wherein the expression coefficient value comprises: extracting eye key points from the face key points, and inputting the eye key points into the blink expression coefficient model to obtain blink expression coefficient values corresponding to the face image; inputting the face key points into the non-blink expression coefficient model to obtain a non-blink expression coefficient value corresponding to the face image; and obtaining an expression coefficient value corresponding to the face image according to the blink expression coefficient value and the non-blink expression coefficient value.

Optionally, the blink expression coefficient model is trained by: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and extracting eye key points from the face key points corresponding to the marked face image, and learning the eye key points through a deep learning optimization algorithm to obtain the blink expression coefficient model.

Optionally, the non-blinking expression coefficient model is trained by: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and carrying out coordinate format conversion on the face key points corresponding to the marked face image to obtain a square matrix, and learning the square matrix through a deep learning optimization algorithm to obtain the non-blinking expression coefficient model.

Optionally, when the non-blinking expression coefficient model is trained, the loss function includes a first loss function corresponding to a full-face expression except for a blinking expression and a second loss function corresponding to a mouth-opening expression.

Optionally, before the coordinate format of the face key point corresponding to the marked face image is transformed to obtain a square matrix, the method further includes: and under the condition that the coordinates of the face key points corresponding to the marked face image cannot be directly converted into the square matrix, supplementing by using a value of 0 until a key point coordinate set consisting of the coordinates of the face key points corresponding to the marked face image and the value of 0 can be subjected to coordinate format conversion to obtain the square matrix.

According to another aspect of the embodiments of the present invention, there is provided an apparatus for driving an expression of a virtual character, including:

the key point acquisition module is used for acquiring a face image and extracting key points of the face image to obtain face key points;

the coefficient value calculation module is used for inputting the face key points into a pre-trained expression coefficient model to obtain expression coefficient values corresponding to the face image;

and the expression driving module is used for performing expression driving on the virtual character according to the expression coefficient value.

According to another aspect of the embodiments of the present invention, there is provided an electronic device for driving an expression of a virtual character, including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for driving the expression of the virtual character provided by the embodiment of the invention.

According to still another aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the method for driving the expression of a virtual character provided by the embodiments of the present invention.

One embodiment of the above invention has the following advantages or benefits: obtaining face key points by obtaining a face image and extracting key points of the face image; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image; according to the technical scheme of performing expression driving on the virtual character according to the expression coefficient value, the expression of the virtual character is driven by learning the corresponding relation between the position coordinates of the key points of the face and the expression coefficient model, so that the virtual character model shows the same expression as a real person, the virtual character can be rapidly driven relatively, naturally and accurately, the network consumption is low, the driving speed is high, and the generalization is strong.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a method for driving the expression of a virtual character according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a face key point recognition result according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of normalized face key point results obtained by performing normalization processing on face key points according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a network structure of a blink expression coefficient model according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a network structure of a non-blinking expression coefficient model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the main modules of an apparatus for driving the expressions of a virtual character according to an embodiment of the present invention;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the invention, the data acquisition, storage, use, processing and the like all conform to relevant regulations of national laws and regulations.

In order to solve the technical problems in the prior art, the invention provides a method and a device for driving an expression of a virtual character, which drive the expression of the virtual character by learning the corresponding relationship between position coordinates of key points of a human face and an expression coefficient model (in the embodiment of the invention, for example, a morpher), wherein the morpher, namely a blendshape, is a parameter capable of driving a 3D character model, and can be given different weights, and the different weights determine the opening and closing degree of actions. The invention mainly learns the corresponding expression coefficient model based on the input image information to drive the model expression, so that the model shows the same expression as the real person. The method can drive the virtual character more truly, naturally and accurately, and has the advantages of extremely low network consumption, high driving speed and strong generalization.

The method for driving the expression of the virtual character comprises the steps of firstly obtaining a face image through a camera, outputting an expression coefficient value as an expression control parameter through an expression coefficient model, and further driving a virtual character 3D model to display the expression.

Fig. 1 is a schematic diagram of the main steps of a method for driving the expression of a virtual character according to an embodiment of the present invention. As shown in fig. 1, the method for driving the expression of the virtual character according to the embodiment of the present invention mainly includes the following steps S101 to S103.

Step S101: the method comprises the steps of obtaining a face image, and extracting key points of the face image to obtain face key points. According to the technical scheme of the invention, the person photo can be obtained through the camera and processed to obtain the face image, for example, the face image is obtained by carrying out object detection and matting from the person photo. And then, extracting key points of the face image to obtain key points of the face. When the key point extraction is performed, the extraction can be performed based on an existing face key point identification model. After the face key points are obtained, coordinate values corresponding to each face key point can be obtained, wherein the coordinate values comprise an abscissa x and an ordinate y. When the processing operation is carried out on the key points of the human face, the coordinate values of the key points of the human face are all processed.

In an embodiment of the present invention, when extracting the key points of the face image, the implementation principle is, for example: the face image is input into a pre-trained face key point recognition model for key point extraction, and the face key point recognition model is obtained by training the face image subjected to key point labeling. The face key point recognition model is a deep learning model, in order to meet the bloom of the model, a plurality of human face pictures can be collected, a training data set is formed by marking key points of each human face picture, and then the face key point recognition model is trained through deep learning. Fig. 2 is a schematic diagram of a face key point recognition result according to an embodiment of the present invention. In the embodiment of the present invention, as shown in fig. 2, the face key points obtained by the face key point recognition model are selected from a 300-point face key point recognition model to extract the face key points, and a total of 300 face key points are extracted, and 600 coordinate values are obtained by combining the abscissa x and the ordinate y of the face key points.

Step S102: and inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image. The expression coefficient model is used for acquiring the facial expression according to the facial key points and further generating an expression coefficient value for driving the expression of the virtual character.

According to an embodiment of the present invention, before inputting the face key points into a pre-trained expression coefficient model, the method further includes: and limiting the face image into a picture with a specified size, and carrying out normalization processing on the face key points. By carrying out normalization processing on the key points of the face, subsequent data processing can be conveniently carried out, the complexity of the model is simplified, and the operation efficiency of the model is improved.

According to the technical scheme of the embodiment of the invention, the normalization processing of the face key points specifically comprises the following steps: acquiring the horizontal coordinate and the vertical coordinate of the central point of the key point of the face as well as the height and the width of the face image; for each face key point, calculating the normalized abscissa of the face key point according to the abscissa of the face key point, the abscissa of the central point and the height and width of a face image, and calculating the normalized ordinate of the face key point according to the ordinate of the face key point, the ordinate of the central point and the height and width of the face image; and translating the central point of the face key point to the upper left corner of the face image with the specified size so as to translate the normalized horizontal coordinate and the normalized vertical coordinate, thereby completing the normalization processing of the face key point.

In the embodiment of the present invention, it is noted that x coordinates of all face key points are x _ points, and y coordinates of all face key points are y _ points, then the whole normalization process is as follows:

solving the minimum value of the abscissa of all the face key points:

min_x_val＝min(x_points) (1)

solving the maximum value of the abscissa of all the face key points:

max_x_val＝max(x_points) (2)

solving the minimum value of the ordinate of all the face key points:

min_y_val＝min(y_points) (3)

solving the maximum value of the vertical coordinates of all the face key points:

max_y_val＝max(y_points) (4)

obtaining the width of the face image according to the maximum value and the minimum value of the abscissa:

width＝max_x_val-min_x_val (5)

obtaining the height of the face image according to the maximum value and the minimum value of the vertical coordinate:

height＝max_y_val-min_y_val (6)

solving the horizontal coordinates of the central points of all the face key points:

mean_x＝(max_x_val+min_x_val)/2 (7)

solving the vertical coordinates of the central points of all the face key points:

mean_y＝(max_y_val+min_y_val)/2 (8)

and subtracting the abscissa of the central point from the abscissas of all the face key points, and updating the abscissas of the face key points:

x_points＝x_points-mean_x (9)

and (3) subtracting the ordinate of the central point from the ordinate of all the face key points, and updating the ordinate of the face key points:

y_points＝y_points-mean_y (10)

the center point of the key point of the human face can be moved to the center of the face through the formulas (9) and (10);

calculating the normalized abscissa of each face key point:

calculating the normalized ordinate of each face key point:

according to the formulas (11) and (12), normalization is carried out by dividing the horizontal and vertical coordinates of the key points of the human face by the maximum value of the height and the width of the human face image; zooming to the preset image size; in order to avoid the phenomenon of edge approaching, multiplying by a coefficient of 0.9, thereby obtaining a normalized abscissa and a normalized ordinate of each face key point;

translating the normalized abscissa of each face key point:

x_points＝x_points+(m_width/2) (13)

translating the normalized ordinate of each face key point:

y_points＝y_points+(m_height/2) (14)

and (3) translating the horizontal and vertical coordinates of the central point to the upper left corner of the set picture again through the formulas (13) and (14), wherein the horizontal coordinate value and the vertical coordinate value of the key point of the face are between (0, m _ width) and (0, m _ height).

Fig. 3 is a schematic diagram of a normalized face key point result obtained by performing normalization processing on a face key point according to an embodiment of the present invention, and illustrates a normalized face key point obtained by performing normalization processing on a face key point according to an embodiment of the present invention. The position of the white pixel in the image is the coordinate value of the extracted key point of the human face. In the embodiment of the present invention, m _ width may be 96, and m _ height may be 96, that is, the values of the images with different resolutions input into the subsequent model are all normalized to 0 to 96, and after dividing by 96, are normalized to 0 to 1, and are input into the expression coefficient model for training, so that the complexity of data processing may be reduced, and the model training efficiency may be improved.

According to an embodiment of the present invention, the expression coefficient model is obtained by training, for example, the following ways: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and learning the face key points corresponding to the marked face image to obtain the expression coefficient model.

In the embodiment of the invention, in order to meet the generalization of the model, multiple expressions of multiple persons need to be captured, the expressions of various persons can be comprehensively captured, so that the network can learn more expression characteristics, specifically, recording personnel is required to make various exaggerated expressions such as mouth opening, mouth beeping, smiling, blinking and the like, and pictures in a normal speaking state can be recorded at the same time to form a facial image set. For each face image in the face image set, the key point extraction is required to obtain a face key point corresponding to each face image, wherein when the key point extraction is performed, the key point extraction can be performed according to the method described in step S101, and finally, a face key point corresponding to each face image is obtained. Similarly, after extracting the face key points and before marking the face key points, normalization processing can be performed on the face key points corresponding to each extracted face image, and the normalization processing on the face key points can be realized according to the normalization processing method.

Before learning the corresponding relationship between the face key points and the expression coefficient values, the face key points corresponding to the face images need to be marked by using the blending deformation values (blendshape values) of the face images. When determining the mixed deformation value of the facial image, firstly, training data is made, the input of a network is a facial picture, the requirement is to obtain a blendshape value corresponding to each picture, the blendshape refers to a technology for realizing combination of a plurality of predefined shapes and any number by using a single grid, the technology is called as a deformation target in some animation software such as Maya and 3dsMax, and the mixed deformation value of a plurality of shapes can express expressions such as smiling, frowning, eye closing and the like. At present, the 3DMM (3 Dmorable Model, 3D deformation Model) field can reconstruct the three-dimensional image of a corresponding person through a single picture, the face reconstruction of the single picture is based on three-dimensional models with special formats, the models contain changeable vertex coordinates to enable the models to achieve the deformation effect, and meanwhile, the models contain a blendshape value capable of controlling facial expressions. In the embodiment of the present invention, a blendshape value corresponding to each face image can be obtained through a face model reconstruction method according to the face images in the obtained face image set. And marking the face key points corresponding to the face image by using the blended deformation value blendshape value obtained by the model as a label of the face key points corresponding to the face image. And then, learning the face key points corresponding to the marked face image to obtain an expression coefficient model.

According to yet another embodiment of the present invention, the pre-trained expression coefficient models include a blinking expression coefficient model and a non-blinking expression coefficient model. The blink expression coefficient model is only used for calculating the expression coefficient value corresponding to the blink expression, and the non-blink expression coefficient model is used for calculating the expression coefficient values corresponding to other facial expressions except the blink expression. In the embodiment of the invention, a step-by-step network is adopted to learn the relationship of different human face parts. When the expression of the virtual character is driven to change, the change of eyes can be influenced when the mouth is opened, so that the invention provides the method for learning the mapping relation between the key points of the human face and the facial expression by parts. In addition, since the normal person slightly closes the eye when blinking one eye, but it is not desirable that this effect is manifested on the 3D model, the present invention proposes that the left and right eyes each learn the corresponding expression coefficient value to obtain the blinking expression coefficient. That is, the blinking expression coefficient model of the present invention has two models, which are used to calculate the expression coefficient values of the left eye and the right eye, respectively.

According to one embodiment of the present invention, when the face key points are input into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image, the method specifically includes:

extracting eye key points from the face key points, and inputting the eye key points into the blink expression coefficient model to obtain blink expression coefficient values corresponding to the face image;

inputting the face key points into the non-blink expression coefficient model to obtain a non-blink expression coefficient value corresponding to the face image;

and obtaining an expression coefficient value corresponding to the face image according to the blink expression coefficient value and the non-blink expression coefficient value.

Specifically, the blink expression coefficient model may be trained by: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and extracting eye key points from the face key points corresponding to the marked face image, and learning the eye key points through a deep learning optimization algorithm to obtain the blink expression coefficient model. When the blink expression coefficient model is trained, only the coordinate information of dozens of eye key points related to eyes in the marked face key points is needed to be learned.

Fig. 4 is a schematic network structure diagram of a blink expression coefficient model according to an embodiment of the present invention. As shown in fig. 4, the network of the blink expression coefficient model according to the embodiment of the present invention mainly applies a three-layer fully-connected network, and adds an activation (RELU) function in the middle of the fully-connected network. Because the input parameters of the network are less and only dozens of key point coordinate information of eyes are available, excessive complex network layers are not needed. When the blink expression coefficient model is trained, the loss function is set as an MSE (mean square loss function), the output of the model is a blink control parameter weight value of 1 × 1, and therefore, the loss is the following formula (15), wherein y is a label value corresponding to a key point of the human face, y' is a predicted expression coefficient value,

loss＝MSE(y,y') (15)。

the training method of the blink expression coefficient model can select a deep learning optimization algorithm Adam training method, wherein opt _params As a network parameter, weight _decay As a regularization parameter, the blink expression coefficient model opt is as follows, formula (16):

opt＝Adam(opt _params ,weight _decay ＝0.0001) (16)。

according to the embodiment of the invention, the non-blinking expression coefficient model is obtained by training in the following way: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and carrying out coordinate format conversion on the face key points corresponding to the marked face image to obtain a square matrix, and learning the square matrix through a deep learning optimization algorithm to obtain the non-blinking expression coefficient model. The non-blinking expression coefficient model mainly learns the expression mapping relations of the human face except blinking, so that all key points of the human face can be input to train the network model.

When the coordinate format of the face key points corresponding to the marked face image is converted to obtain a square matrix, if the coordinates of the face key points corresponding to the marked face image cannot be directly converted to the square matrix, the coordinates are supplemented by using a value 0 until a key point coordinate set consisting of the coordinates of the face key points corresponding to the marked face image and the value 0 can be subjected to coordinate format conversion to obtain the square matrix. For example, in combination with the foregoing embodiment, the number of extracted face key points is 300, and the abscissa x and the ordinate y thereof have 600 coordinate values, which cannot be directly format-converted into a square matrix, and therefore some 0 values need to be supplemented. Specifically, in the embodiment of the present invention, pixel values (the pixel value of the black pixel in fig. 3) of 300 non-human face key points (for example, black pixels) may be added to form a key point coordinate set, so that 900 values in the key point coordinate set may be subjected to format conversion to obtain a 30 × 30 square matrix input network, which is convenient for network training. Specifically, the method can be implemented by a function in python, and the principle is that 900 values in the key point coordinate set are arranged into a square matrix with 30 rows and 30 columns, that is, a gray image is formed and sent to a network for training.

Fig. 5 is a schematic network structure diagram of a non-blinking expression coefficient model according to an embodiment of the present invention. As shown in fig. 5, the input of the non-blinking expression coefficient model of the embodiment of the present invention is the coordinates of the key point positions of the whole face. In order to meet the network input format, converting the coordinate information format of the key points into a picture of 30 × 30. The network structure of the non-blinking expression coefficient model includes convolutional layers (the first 5 network layers) and fully-connected layers (the last 3 network layers), for example, 25 × 8 represents that the convolutional kernel size is 8 × 8, the number of convolutional layers is 25, and 1 × 1800 represents that the fully-connected layer parameter is 1800. The output of the network is a set of 1 x 51 vectors, and the expression coefficient values of the non-blinking expression coefficient model, which are the vectors, and 51 expression coefficient values drive 51 parts of the face respectively, so that actions such as opening mouth, breaking mouth, losing mouth, closing mouth and the like are completed. When training the non-blinking expression coefficient model, the MSE is used as a loss function, and the Adam optimization method is used for updating the parameters, wherein the training method of the non-blinking expression coefficient model is the same as that of the blinking expression coefficient model, and the generated non-blinking expression coefficient model opt can be referred to as formula (16).

In addition, when the non-blinking expression coefficient model is trained, the expression coefficient values corresponding to the expressions of open mouths are considered to be particularly important, so that when the non-blinking expression coefficient model is trained, the loss functions comprise a first loss function corresponding to the full-face expressions except for the blinking expressions and a second loss function corresponding to the expressions of open mouths. The loss function is shown in equation (17) below:

Loss＝MSE(y,y')+λ*MSE(mouth_open_label,mouth_open_predict) (17)。

in the above formula (17), the first term MSE (y, y ') is a first loss function corresponding to a full-face expression except for a blinking expression, y is a label value corresponding to a key point of a human face, and y' is a predicted expression coefficient value; the second term is a second loss function corresponding to open-mouth expression, where λ is a weight coefficient, mouth _ open _ label is a label value of the open-mouth expression coefficient value, and mouth _ open _ predict is a predicted value of the open-mouth expression coefficient value. In the embodiment of the invention, the lambda can be set to be 0.3, so that the learned mouth opening value can be ensured to be more correct and to have larger weight.

Step S103: and performing expression driving on the virtual character according to the expression coefficient value.

According to the steps from S101 to S103, obtaining face key points by obtaining a face image and extracting key points of the face image; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image; the technical scheme of performing expression driving on the virtual character according to the expression coefficient value drives the expression of the virtual character by learning the corresponding relation between the position coordinates of the key points of the human face and the expression coefficient model, so that the virtual character model shows the same expression as a real human, the virtual character can be driven relatively, naturally, accurately and quickly, the network consumption is low, the driving speed is high, and the generalization is strong.

Fig. 6 is a schematic diagram of main blocks of an apparatus for driving an expression of a virtual character according to an embodiment of the present invention. As shown in fig. 6, the apparatus 600 for driving the expression of the virtual character according to the embodiment of the present invention mainly includes a key point obtaining module 601, a coefficient value calculating module 602, and an expression driving module 603.

A key point obtaining module 601, configured to obtain a face image, and perform key point extraction on the face image to obtain a face key point;

a coefficient value calculation module 602, configured to input the facial key points into a pre-trained expression coefficient model, so as to obtain an expression coefficient value corresponding to the facial image;

and the expression driving module 603 is configured to perform expression driving on the virtual character according to the expression coefficient value.

According to an embodiment of the present invention, the key point obtaining module 601 may further be configured to: the face image is input into a pre-trained face key point recognition model for key point extraction, and the face key point recognition model is obtained by training the face image subjected to key point labeling.

According to another embodiment of the present invention, the apparatus 600 for driving the expression of the virtual character may further include a key point normalization processing module (not shown in the figure) for: and before the face key points are input into a pre-trained expression coefficient model, limiting the face image into a picture with a specified size, and carrying out normalization processing on the face key points.

According to yet another embodiment of the present invention, the keypoint normalization processing module (not shown in the figure) may be further configured to: acquiring the horizontal coordinate and the vertical coordinate of the central point of the key point of the face as well as the height and the width of the face image; for each face key point, calculating the normalized abscissa of the face key point according to the abscissa of the face key point, the abscissa of the central point and the height and width of a face image, and calculating the normalized ordinate of the face key point according to the ordinate of the face key point, the ordinate of the central point and the height and width of the face image; and translating the central point of the face key point to the upper left corner of the face image with the specified size so as to translate the normalized horizontal coordinate and the normalized vertical coordinate, thereby completing the normalization processing of the face key point.

According to another embodiment of the present invention, the expression coefficient model is trained by: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and learning the face key points corresponding to the marked face image to obtain the expression coefficient model.

According to yet another embodiment of the present invention, the pre-trained expression coefficient models include a blinking expression coefficient model and a non-blinking expression coefficient model;

coefficient value calculation module 602 may also be configured to: extracting eye key points from the face key points, and inputting the eye key points into the blink expression coefficient model to obtain blink expression coefficient values corresponding to the face image; inputting the face key points into the non-blink expression coefficient model to obtain a non-blink expression coefficient value corresponding to the face image; and obtaining an expression coefficient value corresponding to the face image according to the blink expression coefficient value and the non-blink expression coefficient value.

According to another embodiment of the present invention, the blink expression coefficient model is trained by: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and extracting eye key points from the face key points corresponding to the marked face image, and learning the eye key points through a deep learning optimization algorithm to obtain the blink expression coefficient model.

According to a further embodiment of the invention, the non-blinking expression coefficient model is trained by: acquiring a face image set; extracting key points of each face image to obtain face key points corresponding to each face image; for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value; and carrying out coordinate format conversion on the face key points corresponding to the marked face image to obtain a square matrix, and learning the square matrix through a deep learning optimization algorithm to obtain the non-blinking expression coefficient model.

According to another embodiment of the invention, the loss functions of the non-blinking expression coefficient model during training comprise a first loss function corresponding to a full-face expression except for a blinking expression and a second loss function corresponding to a mouth-opening expression.

According to another embodiment of the present invention, the apparatus 600 for driving the expression of the virtual character may further include a coordinate format conversion module (not shown in the drawings) for: before coordinate format conversion is carried out on the face key points corresponding to the marked face images to obtain a square matrix, under the condition that the coordinates of the face key points corresponding to the marked face images cannot be directly converted into the square matrix, 0 value is used for supplementing until a key point coordinate set consisting of the coordinates of the face key points corresponding to the marked face images and the 0 value is obtained, and coordinate format conversion can be carried out to obtain the square matrix.

According to the technical scheme of the embodiment of the invention, the key points of the human face are obtained by acquiring the human face image and extracting the key points of the human face image; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image; the technical scheme of performing expression driving on the virtual character according to the expression coefficient value drives the expression of the virtual character by learning the corresponding relation between the position coordinates of the key points of the human face and the expression coefficient model, so that the virtual character model shows the same expression as a real human, the virtual character can be driven relatively, naturally, accurately and quickly, the network consumption is low, the driving speed is high, and the generalization is strong.

Fig. 7 illustrates an exemplary system architecture 700 of a method of driving an expression of a virtual character or an apparatus for driving an expression of a virtual character to which an embodiment of the present invention can be applied.

As shown in fig. 7, the system architecture 700 may include

terminal devices

701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the

terminal devices

701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The

terminal devices

701, 702, 703 may have installed thereon various communication client applications, such as an image processing application, an image key point extraction application, an image feature extraction application, photographing software, and the like (for example only).

The

terminal devices

701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 705 may be a server that provides various services, such as a background management server (for example only) that supports virtual character emoji driver requests from users using the

terminal devices

701, 702, 703. The background management server can acquire a face image from data such as a received virtual character expression driving request and extract key points of the face image to obtain face key points; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image; and performing expression driving and other processing on the virtual character according to the expression coefficient value, and feeding back a processing result (such as a virtual character expression display result-only an example) to the terminal equipment.

It should be noted that the method for driving the expression of the virtual character provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the apparatus for driving the expression of the virtual character is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use with a terminal device or server implementing an embodiment of the present invention. The terminal device or the server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor comprises a key point acquisition module, a coefficient value calculation module and an expression driving module. The names of these units or modules do not in some cases form a limitation to the units or modules themselves, for example, the key point obtaining module may also be described as a "module for obtaining a face image and performing key point extraction on the face image to obtain key points of the face".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a face image, and extracting key points of the face image to obtain face key points; inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image; and performing expression driving on the virtual character according to the expression coefficient value.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for driving an expression of a virtual character, comprising:

2. The method of claim 1, wherein extracting key points from the face image comprises:

the face image is input into a pre-trained face key point recognition model for key point extraction, and the face key point recognition model is obtained by training the face image subjected to key point labeling.

3. The method of claim 1, wherein before inputting the face key points into a pre-trained expression coefficient model, further comprising:

and limiting the face image into a picture with a specified size, and carrying out normalization processing on the face key points.

4. The method of claim 3, wherein normalizing the face keypoints comprises:

acquiring the horizontal coordinate and the vertical coordinate of the central point of the key point of the face as well as the height and the width of the face image;

for each face key point, calculating the normalized abscissa of the face key point according to the abscissa of the face key point, the abscissa of the central point and the height and width of a face image, and calculating the normalized ordinate of the face key point according to the ordinate of the face key point, the ordinate of the central point and the height and width of the face image;

and translating the central point of the face key point to the upper left corner of the face image with the specified size so as to translate the normalized horizontal coordinate and the normalized vertical coordinate, thereby completing the normalization processing of the face key point.

5. The method of claim 1, wherein the expression coefficient model is trained by:

acquiring a face image set;

extracting key points of each face image to obtain face key points corresponding to each face image;

for each face image, obtaining a mixed deformation value of the face image based on a face model reconstruction method, and marking face key points corresponding to the face image by using the mixed deformation value;

and learning the face key points corresponding to the marked face image to obtain the expression coefficient model.

6. The method of claim 1, wherein the pre-trained expression coefficient models comprise a blinking expression coefficient model and a non-blinking expression coefficient model;

inputting the face key points into a pre-trained expression coefficient model to obtain an expression coefficient value corresponding to the face image, wherein the expression coefficient value comprises:

7. The method of claim 6, wherein the blink expression coefficient model is trained by:

acquiring a face image set;

and extracting eye key points from the face key points corresponding to the marked face image, and learning the eye key points through a deep learning optimization algorithm to obtain the blink expression coefficient model.

8. The method of claim 6, wherein the non-blinking expression coefficient model is trained by:

acquiring a face image set;

and carrying out coordinate format conversion on the face key points corresponding to the marked face image to obtain a square matrix, and learning the square matrix through a deep learning optimization algorithm to obtain the non-blinking expression coefficient model.

9. The method of claim 8, wherein the non-blinking expression coefficient model is trained in which the loss functions include a first loss function corresponding to a full-face expression other than a blinking expression and a second loss function corresponding to a mouth-opening expression.

10. The method of claim 8, wherein before transforming the coordinate format of the face key points corresponding to the marked face image to obtain the square matrix, the method further comprises:

and under the condition that the coordinates of the face key points corresponding to the marked face image cannot be directly converted into the square matrix, supplementing by using a value of 0 until a key point coordinate set consisting of the coordinates of the face key points corresponding to the marked face image and the value of 0 can be subjected to coordinate format conversion to obtain the square matrix.

11. An apparatus for driving an expression of a virtual character, comprising:

12. An electronic device for driving an expression of a virtual character, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.

13. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.