CN113537056A

CN113537056A - Avatar driving method, apparatus, device, and medium

Info

Publication number: CN113537056A
Application number: CN202110800699.XA
Authority: CN
Inventors: 卫华威; 韩欣彤
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-10-22

Abstract

The embodiment of the invention discloses a method, a device, equipment and a medium for driving an avatar. The method comprises the following steps: acquiring a face image in a live video frame in real time; recognizing an expression base coefficient corresponding to the face image based on a deep learning mode; the expression base coefficient is matched with a preset target avatar expression base group; and using the expression base coefficient to the target avatar expression base group to drive the expression of the avatar corresponding to the target avatar expression base group. According to the technical scheme, on the basis of the application of the deep learning technology, the live broadcast images are acquired in real time only by the cheap camera, so that the face images can be driven to express the virtual image, professional equipment is not needed, and the cost for driving the virtual image to conduct personalized real-time live broadcast is reduced.

Description

Avatar driving method, apparatus, device, and medium

Technical Field

The embodiment of the invention relates to the field of computer vision, in particular to a method, a device, equipment and a medium for driving an avatar.

Background

With the popularity of the live broadcast industry, more and more people have entered the live broadcast industry as anchor. However, some problems often occur, for example, the anchor may not be confident about the image of the anchor itself, worry about that the direct broadcasting effect is affected by directly exposing the real appearance in the direct broadcasting environment, and for example, the anchor always broadcasts the live broadcast with a single self image, and lacks diversity, and limits many interesting live broadcasting modes and playing methods. The virtual live broadcast is produced at the right moment, and the anchor can drive the virtual image to carry out personalized live broadcast at the back.

At present, some anchor broadcasters drive virtual images to perform personalized live broadcasting by adopting an AR (Augmented Reality) technology, but the AR equipment has higher cost and cannot be popularized and used. Therefore, how to reduce the cost of driving the avatar to perform personalized real-time live broadcasting is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the invention provides an avatar driving method, device, equipment and medium, which are used for realizing real-time driving of avatar expressions and reducing the requirements on professional equipment.

In a first aspect, an embodiment of the present invention provides an avatar driving method, including:

acquiring a face image in a live video frame in real time;

recognizing an expression base coefficient corresponding to the face image based on a deep learning mode; the expression base coefficient is matched with a preset target avatar expression base group;

and using the expression base coefficient to the target avatar expression base group to drive the expression of the avatar corresponding to the target avatar expression base group.

In a second aspect, an embodiment of the present invention further provides an avatar driving apparatus, including:

the real-time face image acquisition module is used for acquiring a face image in a live video frame in real time;

the facial expression recognition module is used for recognizing an expression base coefficient corresponding to the facial image based on a deep learning mode; the expression base coefficient is matched with a preset target avatar expression base group;

and the avatar driving module is used for using the expression base coefficient to the target avatar expression base group so as to drive the avatar expression corresponding to the target avatar expression base group.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the avatar driving method of any of the embodiments.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the avatar driving method according to any embodiment.

In the technical scheme provided by the embodiment of the invention, the expression base coefficient is identified based on a deep learning mode aiming at the face image in the live video frame acquired in real time, and the identified expression base coefficient is used on the corresponding target virtual image expression base group so as to realize the expression drive of the corresponding virtual image. According to the technical scheme, on the basis of the application of the deep learning technology, the live broadcast images are collected in real time only by the cheap camera, so that the driving of the face images to the expression of the virtual image can be realized, professional equipment is not needed, and the cost for driving the virtual image to perform personalized real-time live broadcast is reduced.

Drawings

Fig. 1 is a flowchart of an avatar driving method according to an embodiment of the present invention;

FIG. 2 is a diagram of a target avatar expression base set according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating an effect of preprocessing a face image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a face-driven avatar expression according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a training procedure of a facial expression recognition model according to a second embodiment of the present invention;

fig. 6 is a schematic network structure diagram of a facial expression recognition model according to a second embodiment of the present invention;

fig. 7 is a schematic diagram of a test result of a face-driven avatar expression according to a second embodiment of the present invention;

fig. 8 is a schematic diagram of a test result of a face-driven avatar expression according to a second embodiment of the present invention;

fig. 9 is a schematic block diagram of an avatar driving apparatus according to a third embodiment of the present invention;

fig. 10 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for the sake of convenience, the drawings only show some structures related to the present invention, not all structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of an avatar driving method according to an embodiment of the present invention, where the embodiment is applicable to a situation where a anchor drives an avatar expression in a live broadcast room, and the method can be executed by an avatar driving apparatus according to any embodiment of the present invention, where the apparatus may be composed of hardware and/or software, and may be generally integrated in a computer device.

As shown in fig. 1, the avatar driving method provided in the present embodiment includes the steps of:

and S110, acquiring a face image in a live video frame in real time.

Live video frames refer to video frames in video data acquired in a live room. Alternatively, the live video frames may be single-view live video frames acquired using a single camera. The face image in the live video frame refers to the face image of the anchor in the live room.

S120, identifying an expression base coefficient corresponding to the face image based on a deep learning mode; and the expression base coefficient is matched with a preset target avatar expression base group.

The expression base group refers to a combination composed of a plurality of expression bases. In the expression base group, each expression is used to identify an expression, such as whether the mouth is open, how large the mouth is, whether the eyes are open, how much the eyes are open, and the like.

The target avatar expression base group refers to an expression base group corresponding to any avatar, and may refer to a general avatar expression base group or an individual avatar expression base group. The facial expression characteristics corresponding to different avatar expression base groups are different. In this embodiment, the avatar corresponding to the preset target avatar expression base group is the avatar that the anchor wishes to drive, that is, the avatar replaces the anchor's own avatar to perform live virtual broadcast.

The target avatar expression base set shown in fig. 2 includes 10 expression bases, and each expression base represents a different expression. Each expression base has the same mesh structure, the mesh structure is a 3D model storage format and generally comprises fixed points and edges, each point and edge have fixed semantic information, and for example, the xx point represents a left mouth corner point.

The deep learning is a research direction in the field of machine learning, and the machine can have the analysis and learning capability like a human by learning the internal rules and the representation levels of sample data, and can recognize data such as characters, images and sounds. In this embodiment, the expression base coefficients corresponding to the face image are identified by means of deep learning.

Optionally, the expression base coefficient corresponding to the facial image may be identified based on a pre-trained facial expression identification model. The facial expression recognition model can adopt a lightweight network to ensure the real-time performance of facial image expression recognition. Illustratively, the facial expression recognition model may employ a network structure such as MobileNet v2, taking into account the real-time and efficiency of the model.

The method comprises the steps that for a face image in a live video frame acquired in real time, an expression base coefficient corresponding to the face image is learned based on a preset target virtual image expression base group. And combining each expression base in the expression base group of the target virtual image according to the expression base coefficient to combine the virtual image corresponding to the face image, wherein the expression represented by the virtual image is consistent with the expression represented by the face image. It should be noted that, since the target avatar expression base set includes a plurality of expression bases, the identified expression base coefficients are also composed of a plurality of coefficients respectively corresponding to the expression bases in the target avatar expression base set.

As an optional implementation manner, identifying an expression basis coefficient corresponding to the face image based on a deep learning manner may specifically be:

and inputting the facial image into a pre-trained facial expression recognition model to obtain an expression base coefficient corresponding to the facial image.

In this embodiment, the facial image is input to a pre-trained facial expression recognition model, and the facial expression recognition model outputs an expression basis coefficient corresponding to the facial image, that is, a multi-dimensional coefficient vector matching the expression basis number in a preset target avatar expression basis set.

Before the face image is input into the pre-trained facial expression recognition model, data normalization, face key point detection and matting processing can be performed on the face image, and the method for detecting the face key points is not specifically limited in this embodiment. For example, the effect of face keypoint detection and matting processing on a face image can be seen in fig. 3.

After the data normalization, the face key point detection and the matting processing are carried out on the face image, the face expression recognition is carried out through the pre-trained face expression recognition model, the influence of the face image background factor on the face expression recognition result is reduced, and the accuracy of the face expression recognition result is improved.

In the embodiment, the facial expression recognition model is located in the virtual image driving device, and based on the facial expression recognition model, the facial image is input to the pre-trained facial expression recognition model in real time, so that the expression base coefficient corresponding to the facial image is obtained.

As another optional implementation, identifying an expression basis coefficient corresponding to the face image based on a deep learning manner may further specifically be:

sending expression base coefficient identification request information to a pre-trained facial expression identification model; wherein the expression base coefficient identification request information at least comprises a face image; and receiving an expression base coefficient corresponding to the face image returned by the face expression recognition model.

In this embodiment, the facial expression recognition model may be located outside the avatar driving apparatus and communicatively connected to the avatar driving apparatus.

When the anchor drives the virtual image to perform personalized live broadcasting, the anchor side equipment responds to the operation of the anchor driving the virtual image to perform personalized live broadcasting and sends expression base coefficient identification request information to a pre-trained facial expression identification model in real time, wherein the expression base coefficient identification request information comprises a facial image in a live video frame acquired in real time.

Optionally, before sending the expression base coefficient identification request information to the pre-trained facial expression identification model in real time, the data normalization, the face key point detection and the matting processing may be performed on the facial image, and the expression base coefficient identification request information includes the facial image after the data normalization, the face key point detection and the matting processing are performed.

In the embodiment, the facial expression recognition model is positioned outside the virtual image driving device, so that the facial expression recognition model is conveniently maintained and updated, and the virtual image driving device is not sensitive.

S130, the expression base coefficient is used on the target virtual image expression base group to drive the expression of the virtual image corresponding to the target virtual image expression base group.

After the expression base coefficient is obtained, the expression base coefficient is used on the target virtual image expression base group, namely, each virtual image expression base in the target virtual image expression base group is combined according to the expression base coefficient, and the virtual image with the expression consistent with the face image can be generated.

In the embodiment, the expression base coefficients corresponding to the face image are identified based on a deep learning mode, and the expression base coefficients are used as the weights of all expression bases in the corresponding target virtual image expression base group to be superposed, so that the expression of the real person can be driven to the corresponding virtual image.

Taking the face image as shown in fig. 4 as an example, the expression base coefficients recognized based on the expression base set of the target avatar as shown in fig. 2 are: jawOpen ═ 0.65, mouthchannel ═ 0.14 (the coefficients corresponding to the remaining expression bases are all 0), and the expression base coefficient is used on the target avatar expression base set, resulting in the avatar expression as shown in fig. 4.

Aiming at the face images in a plurality of continuous live video frames acquired in real time, the expression base coefficients corresponding to the face images are recognized in a deep learning-based mode and the expression base coefficients are used on the target virtual image expression base group, so that the driving of the anchor in the live broadcast room on the virtual image expressions corresponding to the target virtual image expression base group can be realized.

According to the technical scheme, the technical threshold of virtual live broadcasting is reduced, the anchor in the live broadcasting room only needs one camera, and any virtual image can be driven in real time to make a corresponding expression, so that the interaction between the virtual anchor and audiences is improved. Based on the technical scheme, thousands of people will become reality, and each anchor can truly present own joy, anger, sadness and expression on different virtual image anchors according to own hobbies, so that the entertainment and the interest of live broadcast are greatly enriched.

Example two

The embodiment is embodied on the basis of the foregoing embodiment, and details of the structure and training process of the facial expression recognition model are described. Before recognizing the expression basis coefficient corresponding to the face image based on a deep learning manner, the avatar driving method provided by this embodiment further includes training a facial expression recognition model.

As shown in fig. 5, the training process of the facial expression recognition model includes the following steps: .

And S210, obtaining a face image sample.

The facial image sample refers to a facial image used for training a facial expression recognition model.

In order to ensure that the facial expression recognition model can learn the features with better generalization (i.e. more general features), the collected facial image samples are required to ensure the comprehensive and rich facial expressions, which can include the independent expressions of the key parts of the face such as eyes and mouth, and the linked expressions of all parts. Meanwhile, in order to improve the robustness of the facial expression recognition model, it is also necessary that the acquired facial image samples include the expressions of real persons with different head gestures. Alternatively, the face image samples may be collected by a face data collection tool such as LiveLinkFace.

In this embodiment, a face image may be acquired in a video mode, and a complete face image sample is obtained by screening each frame of image.

Furthermore, after the face image sample is obtained, data preprocessing can be performed on the face image sample, so that the facial expression recognition model can learn features better. Specifically, image normalization, face key point detection and face matting processing can be performed on a face image sample. The key point information in the face image can be extracted based on the face key point detection technology, the face contour is determined, and face matting is achieved. For example, the effect of face keypoint detection and matting processing on a face image can be seen in fig. 3. In addition, data enhancement processing, such as color enhancement processing and the like, can be performed on the face image samples to enrich the face image samples, so that the facial expression recognition model can learn features better.

And S220, determining an expression base coefficient expected value corresponding to the face image sample.

In this embodiment, the facial expression recognition model may adopt a deep learning training model with real value (label in the model training process) supervision, so when obtaining a facial image sample, it is also necessary to determine an expression base coefficient expected value corresponding to the facial image sample, which is used as a label in the training process of the facial expression recognition model for supervision.

Optionally, for example, a preset application program interface identifies the real expression base coefficient value of the face image sample as a corresponding expected expression base coefficient value. And the expression base coefficient expected value corresponding to the face image sample is matched with a preset target virtual image expression base group. The target avatar expression base group refers to an expression base group corresponding to any avatar, and may refer to a general avatar expression base group or an individual avatar expression base group, which is not specifically limited in this embodiment.

And S230, taking the expression base coefficient expected value corresponding to the face image sample as supervision, and training a face expression recognition model according to the face image sample.

In this embodiment, when the facial expression recognition model is trained, the expression base coefficient expected value corresponding to the facial image sample is used as a supervision, so that the expression base coefficient predicted value output by the facial expression recognition model for the facial image sample is close to the corresponding expression base coefficient expected value.

Optionally, the expression base coefficient expected value corresponding to the face image sample is used as a supervision, and a facial expression recognition model is trained according to the face image sample, which may specifically be:

inputting the facial image sample into an untrained facial expression recognition model to obtain an expression base coefficient prediction value; determining a target loss value according to the difference between the expression base coefficient predicted value and the expression base coefficient expected value; and when the change rate of the target loss value is smaller than a preset threshold value, determining that the facial expression recognition model completes training.

And inputting the facial image sample into an untrained facial expression recognition model, and outputting an expression base coefficient predicted value corresponding to the facial image sample through processing of each processing layer in the facial expression recognition model. Furthermore, a target loss value may be determined according to a difference between the expression base coefficient predicted value and the expression base coefficient expected value as a label. The facial expression recognition model can be trained in a mode of minimizing a target loss value.

As an optional implementation manner, the facial image sample is input into an untrained facial expression recognition model to obtain an expression basis coefficient prediction value, which may specifically be:

inputting the facial image sample into a facial expression recognition model, and sequentially processing through each processing layer according to the setting sequence of the processing layers in the facial expression recognition model to obtain an expression base coefficient predicted value;

the processing layer comprises a convolution layer, a deconvolution layer, a pooling layer and a full-connection layer, and jump connection is established between at least one pair of convolution layer and deconvolution layer with equal size.

As shown in fig. 6, the processing layers in the facial expression recognition model may include a convolutional layer, an anti-convolutional layer, a pooling layer, and a full-link layer. The number of the convolution layers can be multiple, and the number of the deconvolution layers can be multiple. As the step size of the convolutional layer increases, the corresponding image size (vector size) decreases accordingly, at this time, the deconvolution layer may be set, and the corresponding image size (vector size) may be increased again, thereby increasing the capacity of the neural network and enhancing the learning ability of the neural network.

Referring to the example shown in fig. 6, in the facial expression recognition model, the first seven layers are convolutional layers, the eighth and ninth layers are anti-convolutional layers, the tenth and eleventh layers are convolutional layers, the twelfth layer is an anti-convolutional layer, and the thirteenth, fourteenth and fifteenth layers are convolutional layers. And inputting the face image sample into a face expression recognition model, sequentially processing the face image sample by the first fifteen layers (or convolution layers or reverse convolution layers), processing the face image sample by a pooling layer, and finally outputting an expression substrate coefficient predicted value after processing by a full-connection layer.

In this embodiment, equal sized convolutional and deconvolution layers also establish a hopping connection. The outputs of the convolutional layer and the deconvolution layer for which a jump connection is established are together used as the input for the next processing layer. Taking the example that the convolution layer of the fifth layer and the deconvolution layer of the ninth layer establish jump connection, the output of the convolution layer of the fifth layer and the output of the deconvolution layer of the ninth layer are used together as the input of the deconvolution layer of the tenth layer.

In the facial expression recognition model, the built jump connection between the convolution layer and the deconvolution layer enables the gradient to jump to other processing layers directly, so that the neural network is easier to train, and the problem that the previous times of the facial expression recognition model are not sufficiently trained due to too fast gradient descending is solved.

In this embodiment, the training times of the facial expression recognition model may also be set, and if the change rate of the target loss value is smaller than a preset threshold value, that is, the target loss value approaches a fixed value when the facial expression recognition model completes the training for the corresponding times, it may be determined that the training of the facial expression recognition model is completed.

Optionally, the difference between the expression base coefficient predicted value and the expression base coefficient expected value includes at least one of:

the Euclidean distance between the expression base coefficient predicted value and the expression base coefficient expected value;

the expression base coefficient predicted value and the expression base coefficient expected value are distributed differently;

and determining the difference between the three-dimensional grid model determined according to the expression base coefficient predicted value and the three-dimensional grid model determined according to the expression base coefficient expected value.

In this embodiment, one or more loss functions may be designed for training the facial expression recognition model, and then when the change rates of one or more target loss values are all smaller than a preset threshold, it is determined that training of the facial expression recognition model is completed.

Since the expression base coefficient predicted value output by the facial expression recognition model is an N-dimensional vector, the euclidean distance between the expression base coefficient predicted value and the expression base coefficient expected value can be used as the difference between the expression base coefficient predicted value and the expression base coefficient expected value, and the target loss value can be determined based on the L1 loss function or the L2 loss function.

Since each coefficient in the expression base coefficient predicted value is in the interval [0,1], the distribution difference between the expression base coefficient predicted value and the expression base coefficient expected value can be used as the difference between the expression base coefficient predicted value and the expression base coefficient expected value, and the target loss value is determined based on the cross entropy loss function.

On the basis, the difference between the three-dimensional grid model determined according to the expression base coefficient predicted value and the three-dimensional grid model (mesh) determined according to the expression base coefficient expected value can be used as the difference between the expression base coefficient predicted value and the expression base coefficient expected value, and the target loss value can be determined according to the difference on the mesh structure.

Compared with the mode of determining the target loss value according to the numerical value, the mode of determining the target loss value based on the difference on the mesh structure has the advantages that the function represented by each point on the mesh structure is fixed, and the error between the points on different mesh structures is small, so that the target loss value determined according to the function is more accurate.

After the training of the facial expression recognition model is completed, videos recorded in different scenes can be adopted for model testing and reasoning. When model testing and reasoning are carried out, collected face images can be preprocessed, such as data normalization processing, face key point detection and face matting processing, but relevant processing is not carried out in modes such as image enhancement and the like. Fig. 7 and 8 show the test results of the facial expression recognition model for different characters in different scenes.

In the technical scheme, in order to improve the network performance of the facial expression recognition model, the original result of the convolutional neural network is adjusted, and the learning capacity of the neural network on input information is improved through a small amount of deconvolution operation and jump connection operation.

When the facial expression recognition model provided according to the embodiment is used for virtual image driving, only one camera is needed, each frame of image acquired by the camera is processed based on the convolutional neural network technology, the facial expression base coefficient in the image is output, and then the expression of the 3D virtual image is driven in real time, so that the expression of the virtual image is changed along with the anchor expression, the high performance and the real-time performance of virtual image driving are ensured, the requirement for driving the expression of the 3D virtual image is met, the possibility is provided for the live broadcast landing of the virtual image, and the appreciation and the interactivity of audiences are improved.

EXAMPLE III

Fig. 9 is a schematic block structure diagram of an avatar driving apparatus according to a third embodiment of the present invention, where this embodiment is applicable to a situation where a anchor drives an avatar expression in a live broadcast room, and the apparatus may be implemented in a software and/or hardware manner, and may be generally integrated in a computer device.

As shown in fig. 9, the apparatus specifically includes: a real-time facial image acquisition module 310, a facial expression recognition module 320 and an avatar driving module 330. Wherein the content of the first and second substances,

a real-time facial image acquisition module 310, configured to acquire a facial image in a live video frame in real time;

a facial expression recognition module 320, configured to recognize an expression base coefficient corresponding to the facial image based on a deep learning manner; the expression base coefficient is matched with a preset target avatar expression base group;

and the avatar driving module 330 is configured to use the expression base coefficient to the target avatar expression base set to drive an avatar expression corresponding to the target avatar expression base set.

Optionally, the facial expression recognition module 320 is specifically configured to input the facial image into a pre-trained facial expression recognition model, so as to obtain an expression basis coefficient corresponding to the facial image.

Optionally, the facial expression recognition module 320 is specifically configured to send expression base coefficient recognition request information to a pre-trained facial expression recognition model; wherein the expression base coefficient identification request information at least comprises a face image; and receiving an expression base coefficient corresponding to the face image returned by the face expression recognition model.

Optionally, the apparatus further comprises: the facial expression recognition model training module is used for acquiring a facial image sample before recognizing an expression base coefficient corresponding to the facial image based on a deep learning mode; determining an expression base coefficient expected value corresponding to the face image sample; and taking an expression base coefficient expected value corresponding to the facial image sample as supervision, and training a facial expression recognition model according to the facial image sample.

Further, the facial expression recognition model training module is specifically used for inputting the facial image sample into an untrained facial expression recognition model to obtain an expression base coefficient prediction value; determining a target loss value according to the difference between the expression base coefficient predicted value and the expression base coefficient expected value; and when the change rate of the target loss value is smaller than a preset threshold value, determining that the facial expression recognition model completes training.

Further, the facial expression recognition model training module is specifically used for inputting the facial image sample into a facial expression recognition model, and sequentially processing the facial image sample through each processing layer according to the setting sequence of the processing layers in the facial expression recognition model to obtain an expression base coefficient predicted value; the processing layer comprises a convolution layer, a deconvolution layer, a pooling layer and a full-connection layer, and jump connection is established between at least one pair of convolution layer and deconvolution layer with equal size.

The virtual image driving device provided by the embodiment of the invention can execute the virtual image driving method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 10 is a schematic structural diagram of a computer apparatus according to a fourth embodiment of the present invention, as shown in fig. 10, the computer apparatus includes a processor 40, a memory 41, an input device 42, and an output device 43; the number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in fig. 10; the processor 40, the memory 41, the input device 42 and the output device 43 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 10.

The memory 41 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the avatar driving method in the embodiment of the present invention (for example, the face image real-time acquisition module 310, the facial expression recognition module 320, and the avatar driving module 330 in the avatar driving apparatus in fig. 9). The processor 40 executes various functional applications of the computer device and data processing, i.e., implements the above-described avatar driving method, by running software programs, instructions, and modules stored in the memory 41.

The memory 41 may mainly include a storage program area and a storage data table area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data table area may store data created according to use of the computer device, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 42 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function controls of the computer apparatus. The output device 43 may include a display device such as a display screen.

EXAMPLE five

An embodiment of the present invention also provides a computer-readable storage medium storing a computer program, which when executed by a computer processor is configured to perform an avatar driving method, the method including:

acquiring a face image in a live video frame in real time;

Of course, the computer program of the computer-readable storage medium storing the computer program provided in the embodiments of the present invention is not limited to the above method operations, and may also perform related operations in the avatar driving method provided in any embodiments of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods of the embodiments of the present invention.

It should be noted that, in the embodiment of the avatar driving apparatus, the included units and modules are only divided according to functional logic, but not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. An avatar driving method, comprising:

acquiring a face image in a live video frame in real time;

2. The method of claim 1, wherein identifying the expression base coefficients corresponding to the facial image based on a deep learning manner comprises:

3. The method of claim 1, wherein identifying the expression base coefficients corresponding to the facial image based on a deep learning manner comprises:

sending expression base coefficient identification request information to a pre-trained facial expression identification model; wherein the expression base coefficient identification request information at least comprises a face image;

and receiving an expression base coefficient corresponding to the face image returned by the face expression recognition model.

4. The method according to claim 2 or 3, before identifying the expression base coefficients corresponding to the face image based on a deep learning manner, further comprising:

acquiring a face image sample;

determining an expression base coefficient expected value corresponding to the face image sample;

and taking an expression base coefficient expected value corresponding to the facial image sample as supervision, and training a facial expression recognition model according to the facial image sample.

5. The method of claim 4, wherein training a facial expression recognition model according to the facial image samples with an expression base coefficient expected value corresponding to the facial image samples as a supervision comprises:

inputting the facial image sample into an untrained facial expression recognition model to obtain an expression base coefficient prediction value;

determining a target loss value according to the difference between the expression base coefficient predicted value and the expression base coefficient expected value;

and when the change rate of the target loss value is smaller than a preset threshold value, determining that the facial expression recognition model completes training.

6. The method of claim 5, wherein the difference between the expression base coefficient predicted value and the expression base coefficient expected value comprises at least one of:

7. The method of claim 5, wherein inputting the facial image samples into an untrained facial expression recognition model to obtain an expression base coefficient prediction value comprises:

8. An avatar driving apparatus, comprising:

9. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.